Fix speed issue on LUMI with 7B model #270

epwalsh · 2023-09-13T21:44:16Z

Our FSDP wrapping policy (Olmo.fsdp_wrap_fn) used to have a bug which accidentally made training faster. See:

https://github.com/allenai/LLM/compare/152a3595f4a523a8044f23eada7da068b301158d..main#diff-ef8ab7279deeec716e70a1cc9ab2accaaa60f27b301cc0733f1e00a9e39c07d1R905-R907

The result of the bug was that our model was wrapped in a single top-level FSDP instance, instead of the intended behavior of wrapping each block within their own FSDP instance.
Naturally it makes sense to wrap block-by-block, but as @dirkgr found out this actually slows training down substantially on LUMI with the 7B model on more than 32 nodes.

So I've added a new configuration option fsdp.wrapping_strategy that defaults to None, which gives us back the original (unintended) behavior. We'll have to tune this again when we scale up to 70B params.

dirkgr · 2023-09-13T22:47:03Z

So our options are "size_based", "block_based", and None? And None is the one we have been using?

epwalsh · 2023-09-13T22:50:42Z

So our options are "size_based", "block_based", and None? And None is the one we have been using?

Lately we've been using block-based, but that was slow for some reason on >32 LUMI nodes. None is equivalent to what we used to have unintentionally.

epwalsh · 2023-09-13T23:00:55Z

With these changes we're getting full speed on 64 nodes, but for some reason we're still not at full speed on 128 nodes. In any case, this change is a positive so we'll merge but keep digging.

get back up to speed by defaulting to single top-level FSDP

3c84f88

epwalsh requested a review from dirkgr September 13, 2023 21:44

improve configuration options

517a46c

epwalsh merged commit 2eedf07 into main Sep 13, 2023
10 checks passed

epwalsh deleted the petew/speed-fix branch September 13, 2023 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix speed issue on LUMI with 7B model #270

Fix speed issue on LUMI with 7B model #270

epwalsh commented Sep 13, 2023 •

edited

Loading

dirkgr commented Sep 13, 2023

epwalsh commented Sep 13, 2023

epwalsh commented Sep 13, 2023

Fix speed issue on LUMI with 7B model #270

Fix speed issue on LUMI with 7B model #270

Conversation

epwalsh commented Sep 13, 2023 • edited Loading

dirkgr commented Sep 13, 2023

epwalsh commented Sep 13, 2023

epwalsh commented Sep 13, 2023

epwalsh commented Sep 13, 2023 •

edited

Loading