Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specify max_duration in terms of tokens #385

Merged
merged 4 commits into from
Nov 27, 2023

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Nov 27, 2023

For example, max_duration="2e12T" means train until 2T tokens. This is much more clear than specifying in terms of steps, which requires doing the math to convert tokens to steps based on batch size and sequence length. It will also make it much easier to resume runs after changing the batch size, as you won't need to change max_duration at all.

@@ -96,7 +96,6 @@ def build_train_dataloader(train_config: TrainConfig) -> DataLoader:
seed=train_config.seed,
shuffle=True,
drop_last=train_config.data.drop_last,
max_examples=train_config.global_train_batch_size * train_config.max_duration,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't needed anyway and shouldn't change anything since we've always set max_duration to the equivalent of 2T tokens, and our dataset is not quite that big.

olmo/config.py Outdated
"""
Maximum number of batches to train for.
How long to train for. If specified as an integer (the default), the units are assumped to be steps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

Suggested change
How long to train for. If specified as an integer (the default), the units are assumped to be steps.
How long to train for. If specified as an integer (the default), the units are assumed to be steps.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the distinction seems to be less about the typing and more about the suffix (no suffix vs "T"). It would be good to clarify that the suffix determines whether the unit is steps or tokens. You could even add a separate suffix for steps if you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be backwards compatible though.

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think this is solid enough that I could queue up the 2x batch size for the 65B without trying it in salloc first?

@epwalsh
Copy link
Member Author

epwalsh commented Nov 27, 2023

You think this is solid enough that I could queue up the 2x batch size for the 65B without trying it in salloc first?

Yea I feel good about this.

@epwalsh epwalsh merged commit ff883e5 into main Nov 27, 2023
10 checks passed
@epwalsh epwalsh deleted the epwalsh/max-duration-tokens branch November 27, 2023 21:54
@dirkgr
Copy link
Member

dirkgr commented Nov 28, 2023

Though actually I'll need to fix loading unsharded checkpoints first. So I can't just schedule a job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants