Fix speed issue on LUMI with 7B model #270
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Our FSDP wrapping policy (
Olmo.fsdp_wrap_fn
) used to have a bug which accidentally made training faster. See:https://github.com/allenai/LLM/compare/152a3595f4a523a8044f23eada7da068b301158d..main#diff-ef8ab7279deeec716e70a1cc9ab2accaaa60f27b301cc0733f1e00a9e39c07d1R905-R907
The result of the bug was that our model was wrapped in a single top-level FSDP instance, instead of the intended behavior of wrapping each block within their own FSDP instance.
Naturally it makes sense to wrap block-by-block, but as @dirkgr found out this actually slows training down substantially on LUMI with the 7B model on more than 32 nodes.
So I've added a new configuration option
fsdp.wrapping_strategy
that defaults toNone
, which gives us back the original (unintended) behavior. We'll have to tune this again when we scale up to 70B params.