Add `TimeLimitCallback` to `mx/trainer` callbacks. #1631

yx1215 · 2021-07-19T09:07:33Z

Issue #, if available:

Description of changes:
Add TimelimitCallback so that user can set a time limit to the training process.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Please tag this pr with at least one of these labels to make our release process faster: BREAKING, new feature, bug fix, other change, dev setup

jaheba · 2021-07-19T09:56:04Z

src/gluonts/mx/trainer/callback.py

@@ -338,3 +339,31 @@ def on_network_initializing_end(
        self, training_network: nn.HybridBlock
    ) -> None:
        copy_parameters(self.predictor.prediction_net, training_network)
+
+
+class TimeLimitCallback(Callback):


Can we have a short doc-string describing the class and it's parameters?

Let's give it a more descriptive name.

How about TrainingTimeLimitCallback?

jaheba · 2021-07-19T09:56:38Z

src/gluonts/mx/trainer/callback.py

+
+class TimeLimitCallback(Callback):
+    @validated()
+    def __init__(self, time_limit=None):


validated only makes sense if you have type annotations really.

Suggested change

def __init__(self, time_limit=None):

def __init__(self, time_limit: Optional[int] = None) -> None:

jaheba · 2021-07-19T09:57:42Z

src/gluonts/mx/trainer/callback.py

+        self.start_time = None
+        self.time_limit = time_limit


Can we reverse these? Parameters which are passed should be handled first.

jaheba · 2021-07-19T09:58:59Z

src/gluonts/mx/trainer/callback.py

+        if self.time_limit is not None:
+            cur_time = time.time()
+            if cur_time - self.start_time > self.time_limit:
+                logging.warning(


We don't want to use logging directly`, but use a logger instance instead.

jaheba

Before I forget: Thanks for the PR!

jaheba · 2021-07-19T14:43:55Z

src/gluonts/mx/trainer/callback.py

+            cur_time = time.time()
+            if cur_time - self.start_time > self.time_limit:


Suggested change

cur_time = time.time()

if cur_time - self.start_time > self.time_limit:

elapsed = time.time() - self.start_time

if elapsed > self.time_limit:

jaheba · 2021-07-19T14:45:14Z

src/gluonts/mx/trainer/callback.py

@@ -338,3 +341,39 @@ def on_network_initializing_end(
        self, training_network: nn.HybridBlock
    ) -> None:
        copy_parameters(self.predictor.prediction_net, training_network)
+
+
+class TrainingTimeLimitCallback(Callback):


I think we omitted the callback name from other callbacks.

Suggested change

class TrainingTimeLimitCallback(Callback):

class TrainingTimeLimit(Callback):

jaheba · 2021-07-19T14:45:34Z

src/gluonts/mx/trainer/callback.py

+
+class TimeLimitCallback(Callback):
+    @validated()
+    def __init__(self, time_limit=None):


Suggested change

def __init__(self, time_limit=None):

def __init__(self, time_limit: Optional[int] = None) -> None:

jaheba · 2021-07-19T14:46:48Z

src/gluonts/mx/trainer/callback.py

+        time_limit: int
+                    time in seconds, after which your training process will end.


Suggested change

time_limit: int

time in seconds, after which your training process will end.

time_limit: int

time in seconds, after which training ends

src/gluonts/mx/trainer/callback.py

jaheba · 2021-07-19T14:47:50Z

src/gluonts/mx/trainer/callback.py

+            cur_time = time.time()
+            if cur_time - self.start_time > self.time_limit:
+                logger.warning(
+                    "Time limit exceed during training, stop training."


Suggested change

"Time limit exceed during training, stop training."

"Time limit exceeded during training, stopping training."

jaheba · 2021-07-19T16:01:38Z

@borchero Can you also take a look?

borchero

This is very nice in general, I was implementing something similar very recently :D however, I would consider a couple more points:

Should we time validation? We could "stop" timing on_validation_epoch_start and resume on on_train_epoch_start.
Should we extend the Callback base class to return a value from on_train_batch_end? This way, we could stop training after the first batch exceeding the time limit instead of the first epoch (which is more useful in my opinion).

src/gluonts/mx/trainer/callback.py

borchero · 2021-07-19T16:12:29Z

src/gluonts/mx/trainer/callback.py

+        if self.time_limit is not None:
+            elapsed = time.time() - self.start_time
+            if elapsed > self.time_limit:
+                logger.warning(


loger.info?

jaheba · 2021-07-19T19:36:09Z

Should we time validation? We could "stop" timing on_validation_epoch_start and resume on on_train_epoch_start.

What is the intuition here? Why would I want to stop training after a certain amount of time?

Should we extend the Callback base class to return a value from on_train_batch_end? This way, we could stop training after the first batch exceeding the time limit instead of the first epoch (which is more useful in my opinion).

Would there be other places where we also would like to be able to stop training?

yx1215 · 2021-07-20T00:58:41Z

Should we time validation? We could "stop" timing on_validation_epoch_start and resume on on_train_epoch_start.

What is the intuition here? Why would I want to stop training after a certain amount of time?

Sometimes user might only have limit time resource. They won't know with that limit time, how much epoch they can run. so they can set the epoch to 9999 and give a time limit to make it stop once the time is used up.

Should we extend the Callback base class to return a value from on_train_batch_end? This way, we could stop training after the first batch exceeding the time limit instead of the first epoch (which is more useful in my opinion).

Would there be other places where we also would like to be able to stop training?

Maybe we can do what @borchero says, recording only the training process, not the validation process. And if we want to be more precise with the time limit, we can check after every batch

borchero · 2021-07-20T09:01:31Z

What is the intuition here? Why would I want to stop training after a certain amount of time?

One very common use case is hyperparameter optimization. In the successive halving algorithm, you train many configurations for a "budget" N, then take the best 50% of configurations, train for a total budget of 2N and so on ... While you can look at the budget as the number of epochs, it is often useful to actually use time as the budget (as it also often determines the money spent) -- especially if you want your budget to be independent of model size and/or dataset size.

Would there be other places where we also would like to be able to stop training?

I think, we could consider allowing to do that after every hook which is called at the end of some iteration (i.e. end of training/validation batch, end of training/validation epoch, end of epoch).

borchero · 2021-07-20T09:02:57Z

Maybe we can do what @borchero says, recording only the training process, not the validation process. And if we want to be more precise with the time limit, we can check after every batch.

I would say that we should have a flag in the __init__ of the callback to determine whether validation should be recorded.

jaheba · 2021-07-20T11:32:34Z

I think I'm against treating treating validation differently and to have a net-train mode. It makes the code more complicated and is less intuitive (if I have a budged, I don't care too much how it is spend). If we realise there is still a need for this, we can add it later.

jaheba · 2021-07-20T11:34:51Z

We can have a flag, which controls whether we check after each batch or after each full epoch.

What happens if we stop after a batch? Is that entire epoch invalidated?

borchero · 2021-07-20T13:02:14Z

I think I'm against treating treating validation differently and to have a net-train mode. It makes the code more complicated and is less intuitive (if I have a budged, I don't care too much how it is spend). If we realise there is still a need for this, we can add it later.

I actually needed this exactly (i.e. only track training time and not callbacks/validation since callbacks were potentially very time-consuming) few weeks ago and ended up rewriting the Trainer class (for some other reasons as well). Would be nice to have this option out-of-the-box and it's not too much work imo.

borchero · 2021-07-20T13:02:49Z

What happens if we stop after a batch? Is that entire epoch invalidated?

I would just stop the epoch prematurely and treat it like the epoch has been completed.

jaheba · 2021-07-20T14:24:00Z

I think I'm against treating treating validation differently and to have a net-train mode. It makes the code more complicated and is less intuitive (if I have a budged, I don't care too much how it is spend). If we realise there is still a need for this, we can add it later.

I actually needed this exactly (i.e. only track training time and not callbacks/validation since callbacks were potentially very time-consuming) few weeks ago and ended up rewriting the Trainer class (for some other reasons as well). Would be nice to have this option out-of-the-box and it's not too much work imo.

Fair enough.

jaheba · 2021-07-20T14:25:20Z

What happens if we stop after a batch? Is that entire epoch invalidated?

I would just stop the epoch prematurely and treat it like the epoch has been completed.

Then let's do this.

yx1215 · 2021-07-20T16:22:16Z

I'm writing the conclusion here to make sure we are on the same page.
We will have to add the following:

a flag that controls whether validation epoch should be recorded in the time limit
a flag that controls whether we stop at the end of each epoch or each batch, and if we stop after a batch, we will treat it as the whole epoch ends.

Did I miss anything?

borchero

Nice, thanks!!

src/gluonts/mx/trainer/callback.py

borchero · 2021-07-21T13:59:35Z

src/gluonts/mx/trainer/callback.py

+    def __init__(
+        self,
+        time_limit: int,
+        include_validation_in_time_limit: bool = True,


What about track_validation_duration?

borchero · 2021-07-21T13:59:59Z

src/gluonts/mx/trainer/callback.py

@@ -355,25 +362,96 @@ def __init__(self, time_limit: Optional[int] = None) -> None:
            time in seconds, after which training ends


We should properly document the parameters.

borchero · 2021-07-21T14:01:30Z

src/gluonts/mx/trainer/callback.py

+    def on_train_batch_end(self, training_network: nn.HybridBlock) -> bool:
+        print(
+            "on_train_batch_end", self.time_spent
+        )  # for debugging purpose, will be deleted before merging


Just to keep track of the comment^^

borchero · 2021-07-21T14:02:33Z

src/gluonts/mx/trainer/callback.py

+        self.checkpoint = time.time()
+        print(
+            "on_train_epoch_end", self.time_spent
+        )  # for debugging purpose, will be deleted before merging


Only for tracking.

borchero · 2021-07-21T14:02:46Z

src/gluonts/mx/trainer/callback.py

+        self.time_spent += time.time() - self.checkpoint
+        self.checkpoint = time.time()
+        if self.stop_during_epoch:
+            if self.time_spent > self.time_limit:


Log for consistency?

I think we don't need log here, otherwise, the log will be printed twice when we stop after one batch(one after the batch, and one after the epoch). Because regardless of whether we are to stop after one batch, time limit will always be check after one epoch.

borchero · 2021-07-21T14:03:19Z

src/gluonts/mx/trainer/callback.py

+
+        if self.stop_during_epoch:
+            if self.time_spent > self.time_limit:
+                return False


Logging for consistency

Same as above.

jaheba · 2021-07-24T12:51:06Z

Looks like there is some code duplication. Can we make the checking more reusable?

yx1215 · 2021-07-25T10:14:17Z

Looks like there is some code duplication. Can we make the checking more reusable?

I've just reduced the redundancy of the code.

jaheba · 2021-07-27T09:43:36Z

src/gluonts/mx/trainer/_base.py

@@ -404,6 +408,8 @@ def loop(  # todo call run epoch
                        logger.info(
                            f"Number of parameters in {net_name}: {num_model_param}"
                        )
+                    if not should_continue:
+                        break


This doesn't set the outer should_continue and thus we will call loop again and again.

I fixed this by setting self.halt=True before we break from the loop.

jaheba · 2021-07-27T09:44:46Z

src/gluonts/mx/trainer/callback.py

+    def __init__(
+        self,
+        time_limit: int,
+        track_validation_duration: bool = True,


I think something like use_net_training is better, since time can be spent in a lot of places, including validation.

I'm not sure what this actually means.... What would we use use_net_training to record?

jaheba · 2021-07-27T09:49:39Z

src/gluonts/mx/trainer/callback.py

+    def on_train_start(self, max_epochs: int) -> None:
+        self.checkpoint = time.time()
+
+    def should_continue_by_timelimit(self, record_time=True, should_stop=True):


I think this should be two separate methods, which do not have default arguments. It's not really clear to me at a quick glance what this is supposed to be doing.

lostella · 2021-08-12T11:04:43Z

Marked as breaking because of the change in the output signature for the on_train_batch_end and on_validation_batch_end methods

…in TrainerTimeLimit

jaheba · 2022-06-14T20:20:50Z

@lostella Should we have tests, at least for serde?

lostella · 2022-06-15T06:27:33Z

@lostella Should we have tests, at least for serde?

Yes that would be good. For example the TerminateOnNan callback doesn't seem to use any of the serialization mechanisms other classes rely on.

jaheba · 2022-06-15T07:18:33Z

However not sure we can really test it functionally. Would also take some time, which we don't have.

Co-authored-by: Jasper <schjaspe@amazon.de>

jaheba reviewed Jul 19, 2021

View reviewed changes

jaheba requested a review from borchero July 19, 2021 16:00

borchero suggested changes Jul 19, 2021

View reviewed changes

borchero suggested changes Jul 21, 2021

View reviewed changes

jaheba reviewed Jul 27, 2021

View reviewed changes

yx1215 requested a review from borchero August 4, 2021 18:47

lostella added new feature (one of pr required labels) BREAKING This is a breaking change (one of pr required labels) labels Aug 9, 2021

lostella added this to the v0.9 milestone Aug 24, 2021

yx1215 added 5 commits November 2, 2021 13:04

add TimeLimitCallback in callback.py

f6fcdeb

black fix

3ef99a9

type fix

df34cf7

minor fix

a5617f4

minor fix

b7ae0cb

yx1215 added 6 commits November 2, 2021 13:04

type fix

bd70ce4

add two flags include_validation_in_time_limit and stop_during_epoch …

5a87d0c

…in TrainerTimeLimit

minor fix

88f39db

minor fix

35c0156

reduce code redundancy

b6e7698

updates

cd19b02

yx1215 force-pushed the add_time_limit_callback branch from 938a2ec to cd19b02 Compare November 2, 2021 17:04

lostella modified the milestones: v0.9, v0.10 Feb 17, 2022

jaheba added 2 commits June 14, 2022 15:42

Merge branch 'dev' into add_time_limit_callback

b7007ad

Simplify callback.

0c3bc71

jaheba changed the title ~~add TimeLimitCallback in callback.py~~ Add TimeLimitCallback to mx/trainer callbacks. Jun 14, 2022

jaheba added 2 commits June 14, 2022 17:45

Use BaseModel to enable serde support.

a720325

Update pydantic requirement to support PrivateAttr.

8813d34

lostella approved these changes Jun 15, 2022

View reviewed changes

Merge branch 'dev' into add_time_limit_callback

85e17f4

jaheba merged commit 7032ada into awslabs:dev Jun 15, 2022

jaheba mentioned this pull request Jun 17, 2022

v0.10.0 Release #2046

Closed

kashif pushed a commit to kashif/gluon-ts that referenced this pull request Jun 24, 2022

Add TimeLimitCallback to mx/trainer callbacks. (awslabs#1631)

bb45758

Co-authored-by: Jasper <schjaspe@amazon.de>

	def __init__(self, time_limit=None):
	def __init__(self, time_limit: Optional[int] = None) -> None:

		cur_time = time.time()
		if cur_time - self.start_time > self.time_limit:

	class TrainingTimeLimitCallback(Callback):
	class TrainingTimeLimit(Callback):

		time_limit: int
		time in seconds, after which your training process will end.

	"Time limit exceed during training, stop training."
	"Time limit exceeded during training, stopping training."

		@@ -355,25 +362,96 @@ def __init__(self, time_limit: Optional[int] = None) -> None:
		time in seconds, after which training ends

Add TimeLimitCallback to mx/trainer callbacks. #1631

Add TimeLimitCallback to mx/trainer callbacks. #1631

Conversation

yx1215 commented Jul 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaheba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaheba commented Jul 19, 2021

borchero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaheba commented Jul 19, 2021

yx1215 commented Jul 20, 2021 • edited

borchero commented Jul 20, 2021 • edited

borchero commented Jul 20, 2021

jaheba commented Jul 20, 2021

jaheba commented Jul 20, 2021

borchero commented Jul 20, 2021

borchero commented Jul 20, 2021

jaheba commented Jul 20, 2021

jaheba commented Jul 20, 2021

yx1215 commented Jul 20, 2021

borchero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yx1215 Jul 21, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaheba commented Jul 24, 2021

yx1215 commented Jul 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lostella commented Aug 12, 2021

jaheba commented Jun 14, 2022

lostella commented Jun 15, 2022

jaheba commented Jun 15, 2022

Add `TimeLimitCallback` to `mx/trainer` callbacks. #1631

Add `TimeLimitCallback` to `mx/trainer` callbacks. #1631

yx1215 commented Jul 20, 2021 •

edited

borchero commented Jul 20, 2021 •

edited

yx1215 Jul 21, 2021 •

edited