Add ability to restart on new epoch #383

epwalsh · 2023-11-24T20:16:00Z

You can set the epoch via the option --epoch=[INTEGER]. This
automatically handles changing the data order each epoch by setting the
data seed to seed + epoch. So --epoch is the only flag you need to
set when restarting on a new epoch. Everything else in the config can
stay the same.

Note that we count epochs starting from 0. So to start the 2nd epoch you would add
the flag --epoch=1.

I cherry-picked this commit from #350, which has now started its 2nd epoch.

You can set the epoch via the option `--epoch=[INTEGER]`. This automatically handles changing the data order each epoch by setting the data seed to `seed + epoch`. So `--epoch` is the only flag you need to set when restarting on a new epoch. Everything else in the config can stay the same. Note that we count epochs starting from 0. So to start the 2nd epoch you would add the flag `--epoch=1`.

2015aroras · 2023-11-27T17:19:43Z

olmo/train.py

@@ -147,40 +152,47 @@ def load_trainer_state_dict(self, state_dict: Dict[str, Any]) -> None:
        ]

        # Dataset / dataloader position.
+        checkpoint_epoch = state_dict.get("epoch", 0)
        self.global_step = state_dict["global_step"]
        self.global_data_step = state_dict["global_data_step"]
        self.global_train_examples_seen = state_dict.get(  # newer addition


Is there any reason to keep self.global_train_examples_seen? You can perform this state_dict.get() backwards compatibility check without keeping the global_train_examples_seen variable.

Yea good point. Turns out we don't need global_data_step either. 4d6e61c

`global_train_examples_seen` and `global_data_step` no longer needed

2015aroras · 2023-11-27T20:32:40Z

olmo/train.py

            "global_train_tokens_seen",
-            self.global_data_step * self.cfg.global_train_batch_size * self.cfg.model.max_sequence_length,
+            state_dict.get("global_data_step", 0)  # for backwards compatibility


This will result in throughput/total_tokens being reset to 0 if global_train_tokens_seen and global_data_step are both not present. Maybe state_dict.get("global_data_step", self.global_step) is safer?

Absolutely, good catch: b8ca94d

2015aroras reviewed Nov 27, 2023

View reviewed changes

Remove redundant Trainer fields

4d6e61c

`global_train_examples_seen` and `global_data_step` no longer needed

2015aroras reviewed Nov 27, 2023

View reviewed changes

default to global_step

b8ca94d

2015aroras approved these changes Nov 27, 2023

View reviewed changes

epwalsh merged commit e16e606 into main Nov 27, 2023
10 checks passed

epwalsh deleted the epwalsh/start-new-epoch branch November 27, 2023 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to restart on new epoch #383

Add ability to restart on new epoch #383

epwalsh commented Nov 24, 2023 •

edited

Loading

2015aroras Nov 27, 2023

epwalsh Nov 27, 2023

2015aroras Nov 27, 2023

epwalsh Nov 27, 2023

Add ability to restart on new epoch #383

Add ability to restart on new epoch #383

Conversation

epwalsh commented Nov 24, 2023 • edited Loading

2015aroras Nov 27, 2023

Choose a reason for hiding this comment

epwalsh Nov 27, 2023

Choose a reason for hiding this comment

2015aroras Nov 27, 2023

Choose a reason for hiding this comment

epwalsh Nov 27, 2023

Choose a reason for hiding this comment

epwalsh commented Nov 24, 2023 •

edited

Loading