Train a few steps after time limit reached #362

epwalsh · 2023-11-06T21:08:59Z

This expands on the cancellation logic so that when a run is canceled due to reaching the time limit, it will train for 10 more steps after the cancellation goes into effect and after saving the final checkpoint. That way when we restart the run from the latest checkpoint we'll have some overlap in metrics on W&B, which is good for verifying that the restart worked properly.

2015aroras · 2023-11-06T21:41:00Z

olmo/train.py

+                    )
+
+                if stop_at is not None and self.global_step >= stop_at:
+                    canceled = hard_stop = True

                # Maybe save sharded checkpoint.
                if canceled or (


This will save a checkpoint for all the extra steps. Consider making this and some later code hard_stop instead

Alternatively, you could have canceled represent a hard stop and cancel_initiated represent the beginning of a cancellation.

Ah good catch!

2015aroras · 2023-11-06T21:45:39Z

olmo/train.py

        if get_global_rank() == 0:
            if self.cfg.time_limit is not None and time.time() - self._start_time >= self.cfg.time_limit:
                # First check if we've reached the training time limit.
                should_cancel = True
                cancel_reason = "time limit reached"
+                extra_steps = 10  # train for 10 extra steps so we get an overlap in metrics when we restart


Consider making this a config setting

2015aroras · 2023-11-06T23:48:36Z

olmo/train.py

@@ -849,7 +875,7 @@ def on_trace_ready(p):
                    speed_monitor.reset()

                # Maybe run evaluations.
-                if not canceled and self.global_step % self.cfg.eval_interval == 0:
+                if not cancel_initiated and self.global_step % self.cfg.eval_interval == 0:


To be clear, you don't want eval metrics if they happen in those extra steps?

Right... though it's debatable. I think when we cancel we want to stop ASAP, and the eval loop adds time.

Yeah, no eval loops. This is a sanity check.

olmo/util.py

Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>

dirkgr · 2024-01-04T01:45:41Z

Do we still want this? Can we get it merged?

epwalsh · 2024-01-04T19:00:30Z

@dirkgr yes, do you want to give a final review? Otherwise I think we're good to go with this.

dirkgr

I did not review again. I was fine with it last time, except those spelling errors.

Train a few steps after time limit reached

b828938

epwalsh requested review from dirkgr and 2015aroras November 6, 2023 21:09

2015aroras reviewed Nov 6, 2023

View reviewed changes

epwalsh added 2 commits November 6, 2023 14:40

fix: canceled vs cancel_initiated

a1c32e9

add configuration option

4ed81c6

epwalsh requested a review from 2015aroras November 6, 2023 22:43

2015aroras approved these changes Nov 6, 2023

View reviewed changes

Merge branch 'main' into epwalsh/train-after-cancel

56fc2cb

dirkgr requested changes Nov 8, 2023

View reviewed changes

olmo/util.py Outdated Show resolved Hide resolved

olmo/util.py Outdated Show resolved Hide resolved

I never won a spelling bee

84ad7a1

Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>

epwalsh requested a review from dirkgr November 9, 2023 01:26

epwalsh added 2 commits January 4, 2024 10:45

fix merge conflicts

7e044cd

Clean up

c38b642

dirkgr approved these changes Jan 4, 2024

View reviewed changes

epwalsh merged commit 23eb949 into main Jan 4, 2024
10 checks passed

epwalsh deleted the epwalsh/train-after-cancel branch January 4, 2024 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train a few steps after time limit reached #362

Train a few steps after time limit reached #362

epwalsh commented Nov 6, 2023

2015aroras Nov 6, 2023

2015aroras Nov 6, 2023

epwalsh Nov 6, 2023

epwalsh Nov 6, 2023

2015aroras Nov 6, 2023

epwalsh Nov 6, 2023

2015aroras Nov 6, 2023

epwalsh Nov 6, 2023

dirkgr Nov 7, 2023

dirkgr commented Jan 4, 2024

epwalsh commented Jan 4, 2024

dirkgr left a comment

Train a few steps after time limit reached #362

Train a few steps after time limit reached #362

Conversation

epwalsh commented Nov 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented Jan 4, 2024

epwalsh commented Jan 4, 2024

dirkgr left a comment

Choose a reason for hiding this comment