-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow specify max_duration in terms of tokens #385
Conversation
@@ -96,7 +96,6 @@ def build_train_dataloader(train_config: TrainConfig) -> DataLoader: | |||
seed=train_config.seed, | |||
shuffle=True, | |||
drop_last=train_config.data.drop_last, | |||
max_examples=train_config.global_train_batch_size * train_config.max_duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't needed anyway and shouldn't change anything since we've always set max_duration
to the equivalent of 2T tokens, and our dataset is not quite that big.
olmo/config.py
Outdated
""" | ||
Maximum number of batches to train for. | ||
How long to train for. If specified as an integer (the default), the units are assumped to be steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
How long to train for. If specified as an integer (the default), the units are assumped to be steps. | |
How long to train for. If specified as an integer (the default), the units are assumed to be steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the distinction seems to be less about the typing and more about the suffix (no suffix vs "T"). It would be good to clarify that the suffix determines whether the unit is steps or tokens. You could even add a separate suffix for steps if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be backwards compatible though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You think this is solid enough that I could queue up the 2x batch size for the 65B without trying it in salloc
first?
Yea I feel good about this. |
Though actually I'll need to fix loading unsharded checkpoints first. So I can't just schedule a job. |
For example,
max_duration="2e12T"
means train until 2T tokens. This is much more clear than specifying in terms of steps, which requires doing the math to convert tokens to steps based on batch size and sequence length. It will also make it much easier to resume runs after changing the batch size, as you won't need to changemax_duration
at all.