Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with multi-epoch training #584

Open
Muennighoff opened this issue May 18, 2024 · 0 comments
Open

Problems with multi-epoch training #584

Muennighoff opened this issue May 18, 2024 · 0 comments
Labels
type/bug An issue about a bug

Comments

@Muennighoff
Copy link
Collaborator

🐛 Describe the bug

I think there are two problems with multi-epoch training:

  • Training finishes if setting e.g. duration: 2e12T & 1 epoch < 2e12 tokens. It currently requires setting duration: 2ep but it should also work with T I think (also mentioned here: Break at 1 epoch "Training epoch complete", can't pretraining beyond 1 epoch ? #554)
    olmo.train:816 INFO [step=817847/1430511]
    train/CrossEntropyLoss=2.341
    train/Perplexity=10.39
    throughput/total_tokens=1,715,149,471,744
    throughput/device/tokens_per_second=18,573
    throughput/device/batches_per_second=0.5668
    olmo.train:1172 INFO Training epoch complete
    olmo.train:1194 INFO Saving final checkpoint...
    train:238 INFO Training complete
  • Afaict when resuming a run in >1 epoch state, it requires newly setting epoch: num_epochs in the config to ensure that the data is in a different order:
    seed=seed + (train_config.epoch or 0),

    I think we should just load this from the trainer state dict. However, afaict this is currently not happening because the checkpoint is only loaded after the IterableDataset is already created. I.e. data loader is loaded:
    train_loader = build_train_dataloader(cfg)

    Checkpoint with epoch value is loaded:
    trainer.restore_checkpoint(

    & the data loader remains unchanged.

Without knowing this, people will train the 2nd epoch with the same data order as the 1st.

Versions

latest main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug An issue about a bug
Projects
None yet
Development

No branches or pull requests

1 participant