resume training from previous epoch #1872

Bhee · 2016-03-02T08:57:54Z

I saved the model and weights after each epoch using callbacks.ModelCheckpoint. I want to train it again from the last epoch.
How to set the model.fit() command to start from the previous epoch?

tboquet · 2016-03-02T15:47:11Z

Do you want to do something special with the history? If not, you can just call .fit one or several times and you will be able to continue to train the model. If you want to continue the training in another process, you just have to load the weights and call model.fit().

Bhee · 2016-03-03T05:28:39Z

when I call model.fit() after loading models and weights , it showing epoch = 1. If I stop the training at 100 epoch. I want to resume the training with epoch=101.

ymcui · 2016-03-03T09:03:13Z

I think it is no matter whether it SHOWs the training is at epoch = 1 or epoch = 101.
As far as I know, the model itself doesn't save the EPOCH information into model file.
If you have loaded the correct previous model (the model should have been saved with epoch number), it should be no problem on continuing your training process.

Bhee · 2016-03-03T09:58:54Z

thank u

tboquet · 2016-03-03T16:18:12Z

@ymcui is right, the label of epoch is only a name for the iterations in the current fit. Sorry when I said history I meant the history dictionary the fit method returns. I think #1868 is basically the same question. If you think it resolves your problem please close the issue!

nithishdivakar · 2016-04-21T05:48:11Z

But there is a problem with this approach. What about hyper parameters that change according to epoch, say learning rate with a decay. Just restarting it with fit method doesn't take that into account.

lolongcovas · 2016-12-20T11:35:19Z

yeah, this happens to me when I resume the training process by loading weights.

I was training resnet18 with imagenet dataset, the model saved the weights at 1st epoch with lr=0.1 at beginning. I stopped it, then tried the resume functionality, and it turns out that the model starts with the same lr=0.1, and the loss increase for each iteration. To set the lr to the state of the 1st epoch, I changed the lr according to SGD lr update func: lr = lr * (1. / (1+decay*iterations)), however, it didnt work, the loss sill increases, but slower than with lr=0.1. Probably I should still lower the lr, but I dont understand why the loss still increase even the lr is set accordantly.

ywenlu · 2017-01-24T16:06:52Z

Try the initial_epoch argument in .fit method.

smhoang · 2017-02-16T16:05:48Z

using initial_epoch didn't work in this case

lewfish · 2017-03-09T15:34:52Z

But there is a problem with this approach. What about hyper parameters that change according to epoch, say learning rate with a decay. Just restarting it with fit method doesn't take that into account.

Setting the initial_epoch in fit_generator is not enough to solve this problem when using the ReduceLROnPlateau callback because there's no way for the callback to know what the learning rate should be without having the history of the previous (ie. before resuming training) epochs. Perhaps the callback constructor should have an optional history parameter that can be used to correctly initialize the learning rate and the wait variable (see https://github.com/fchollet/keras/blob/ab3b93e8dd103f1d9729305825791a084c7c8493/keras/callbacks.py#L744)

MartinThoma · 2017-04-25T20:58:18Z

Besides using the initial_epoch argument of fit, I re-wrote the history callback:

class History(Callback):
    """
    Callback that records events into a `History` object.

    This callback is automatically applied to
    every Keras model. The `History` object
    gets returned by the `fit` method of models.
    """

    def on_train_begin(self, logs=None):
        if not hasattr(self, 'epoch'):
            self.epoch = []
            self.history = {}

    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        self.epoch.append(epoch)
        for k, v in logs.items():
            self.history.setdefault(k, []).append(v)

This allows using the same callback and it just appends to the end. @fchollet should I post a pull request for this? It seems to me that this is more useful than the current behaviour of overwriting the
logs in on_train_begin.

i3v · 2017-05-20T19:53:28Z

@MartinThoma ,
One would probably need to replace this line with

    if initial_epoch==0:
        self.history = cbks.History()

to make your suggestion work, right? I've tried to make this stuff work, and eventually ran into a feeling that too many different things should be changed, see #6697. What do you think?

syedfaizalex · 2017-12-16T19:23:41Z

if you want to resume from epoch 101 ,simply use "initial_epoch = 101" in model.fit().

initial_epoch: Epoch at which to start training (useful for resuming a previous training run)

jperl · 2017-12-24T16:45:35Z

Seems that tensorflow estimators also support resuming training. "Since the state of the model is persisted (in model_dir=PATH above), the model will improve the more iterations you train it, until it settles"

bupedroni · 2018-03-19T23:04:15Z

Related question: What happens to all the gradient computations that rely on a history of the gradients (when momentum is present, such as in ADAM and most gradient descent algorithms)? Does the checkpoint store these as well? Thanks!

valekar · 2018-06-25T10:03:09Z

@bupedroni: As far as I know, every time I loaded the existing model, all the hyperparameters were set to default values.

Best way to resume is to write a custom callback and store all the hyperparameters and then start the training as mentioned by @MartinThoma

morenoh149 · 2018-07-27T22:24:34Z

@MartinThoma I'd like a pull request implementing that, basically I'm training a model, but if I notice that the metrics haven't diverged I'd like to train for another x epochs. And also be able to plot the history overall in an additive way.

For now I'm just accumulating histories like this https://www.kaggle.com/morenoh149/keras-continue-training

imranparuk · 2018-10-23T15:10:57Z

Still have this issues... any update on it?

thebeancounter · 2018-12-14T07:46:28Z

anything new here?

imranparuk · 2018-12-14T11:45:53Z

just port your code to pytorch 😆

nithishdivakar · 2018-12-14T11:58:08Z

Ya. That actually worked for me. 2 years and counting.

0xrushi · 2019-04-07T17:30:36Z

I think it is no matter whether it SHOWs the training is at epoch = 1 or epoch = 101.
As far as I know, the model itself doesn't save the EPOCH information into model file.
If you have loaded the correct previous model (the model should have been saved with epoch number), it should be no problem on continuing your training process.

So does that mean if i call
model.fit(epochs = 20)

and

model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)

both are same ??

srcolinas · 2019-04-19T02:01:54Z

I think it is no matter whether it SHOWs the training is at epoch = 1 or epoch = 101.
As far as I know, the model itself doesn't save the EPOCH information into model file.
If you have loaded the correct previous model (the model should have been saved with epoch number), it should be no problem on continuing your training process.

So does that mean if i call
model.fit(epochs = 20)

and

model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)

both are same ??

Yes, they are equivalent. At least that is what I found using the TensorFlow Keras API in TensorFlow 2.0

MunishaTripping · 2019-10-04T12:06:35Z

How can I get the epoch at which model was saved in ModelCheckpoint ?

hollowgalaxy · 2019-11-10T05:28:29Z

save epoch number in the name of the model. Fetch that number with regex when resuming training.

MichelHalmes · 2020-01-09T15:17:17Z

I managed to do this with an optimizer whose learning rate depends on the number of iterations eg Adam.

Here is the pseudo-code:

...
if os.path.isfile(checkpoint_path+".index"):
    # This loads `(root).optimizer.iter`from the checkpoint
    model.load_weights(checkpoint_path)

# Recover the iterations from the model and convert to epochs
initial_epoch = model.optimizer.iterations.numpy() // STEPS_PER_EPOCH
callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_weights_only=True)
model.fit(train_data, epochs=NUM_EPOCHS, initial_epoch=initial_epoch,
                 callbacks=[callback])

Hope this helps :-)

dorukkarinca · 2020-04-12T00:15:01Z

I got tired of this so I ended up writing a Keras wrapper that autosaves and restores the epoch number, training history, and model weights:

pip install keras-buoy
Link to Github project

Let me know what you think. PRs more than welcome.

morenoh149 · 2020-04-15T21:56:59Z

@dorukkarinca is this handled in tensor flow v2? that's supposed to supersede keras

dorukkarinca · 2020-04-16T01:06:26Z

@morenoh149 not to the best of my knowledge. This wrapper wraps tensorflow.keras anyway.

KazegamiKuon · 2021-01-27T15:26:45Z

@dorukkarinca UwU and Orz . Your wrapper help me so much. I didnt know why I wasted 1 week to retrain at start.

RoyiAvital · 2022-03-20T14:33:45Z

So does that mean if i call
model.fit(epochs = 20)

and

model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)

While they are the same, is there a simple way to append the history of each call in order to have the same history for both cases as well?

MoshiurRahmanFaisal · 2022-08-11T12:02:55Z

just use the below callback function to resume training from the epoch where you have stopped.
callback = tf.keras.callbacks.experimental.BackupAndRestore(
backup_dir="temp")

SExpert12 · 2022-12-16T12:37:37Z

Hello,
I am loading using function. Now how to access in fit to resume model training

def load_data(labels_file, test_size):
"""
Display a list of images in a single figure with matplotlib.
Parameters:
labels_file: The labels CSV file.
test_size: The size of the testing set.
"""
labels = pd.read_csv(labels_file)
X = labels[['center', 'left', 'right']].values
y = labels['steering'].values
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=test_size, random_state=0)
return X_train, X_valid, y_train, y_valid

Please help me out.

MoshiurRahmanFaisal · 2022-12-17T06:07:44Z

Did you declared any X_train? Or divided your data set into train test? The error shows in this screenshot tells me that you haven't declared any X_train there your features for training data will be.

…

On Fri, 16 Dec 2022, 6.37 pm SExpert12, ***@***.***> wrote: Hello, I am loading using function. Now how to access in fit to resume model training def load_data(labels_file, test_size): """ Display a list of images in a single figure with matplotlib. Parameters: labels_file: The labels CSV file. test_size: The size of the testing set. """ labels = pd.read_csv(labels_file) X = labels[['center', 'left', 'right']].values y = labels['steering'].values X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=test_size, random_state=0) return X_train, X_valid, y_train, y_valid [image: image] <https://user-images.githubusercontent.com/113903215/208099652-602bd156-ea09-4b14-9f68-cf703042a7c5.png> [image: resumemodel_error] <https://user-images.githubusercontent.com/113903215/208099775-1302093d-e38b-4044-82ea-292410885f24.png> Please help me out. — Reply to this email directly, view it on GitHub <#1872 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARZR3AG6JK2GJ7OUB6OSQMTWNRPB5ANCNFSM4B446GBA> . You are receiving this because you commented.Message ID: ***@***.***>

MoshiurRahmanFaisal · 2022-12-17T06:13:29Z

First fix your issue with the train and test and then declare a callback function as bellow. callback = tf.keras.callbacks.experimental.BackupAndRestore( backup_dir="temp") Now use this callback function while fitting your model. This will help you to resume your training where you have left it. Thank you and all the best. On Sat, 17 Dec 2022, 12.07 pm Moshiur Rahman Faisal 1811966642, < ***@***.***> wrote:

…

Did you declared any X_train? Or divided your data set into train test? The error shows in this screenshot tells me that you haven't declared any X_train there your features for training data will be. On Fri, 16 Dec 2022, 6.37 pm SExpert12, ***@***.***> wrote: > Hello, > I am loading using function. Now how to access in fit to resume model > training > > def load_data(labels_file, test_size): > """ > Display a list of images in a single figure with matplotlib. > Parameters: > labels_file: The labels CSV file. > test_size: The size of the testing set. > """ > labels = pd.read_csv(labels_file) > X = labels[['center', 'left', 'right']].values > y = labels['steering'].values > X_train, X_valid, y_train, y_valid = train_test_split(X, y, > test_size=test_size, random_state=0) > return X_train, X_valid, y_train, y_valid > > [image: image] > <https://user-images.githubusercontent.com/113903215/208099652-602bd156-ea09-4b14-9f68-cf703042a7c5.png> > [image: resumemodel_error] > <https://user-images.githubusercontent.com/113903215/208099775-1302093d-e38b-4044-82ea-292410885f24.png> > > Please help me out. > > — > Reply to this email directly, view it on GitHub > <#1872 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ARZR3AG6JK2GJ7OUB6OSQMTWNRPB5ANCNFSM4B446GBA> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

SExpert12 · 2022-12-17T07:12:34Z

Thanks for reply. Yes it is declared here.

Now how to get local variable value to access?

MoshiurRahmanFaisal · 2022-12-17T07:19:52Z

Use: X_train, X_valid, y_train, y_valid = load_data(your labels file, your test size) Then try to fit your model on the same code where history = model.fit(your code)

…

On Sat, 17 Dec 2022, 1.12 pm SExpert12, ***@***.***> wrote: Thanks for reply. Yes it is declared here. [image: image] <https://user-images.githubusercontent.com/113903215/208230375-cb410584-5cc9-4bb3-bd66-83941ba2c793.png> Now how to get local variable value to access? — Reply to this email directly, view it on GitHub <#1872 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARZR3AHAC2RYK7AR4ZJPLPLWNVRW7ANCNFSM4B446GBA> . You are receiving this because you commented.Message ID: ***@***.***>

SExpert12 · 2022-12-17T11:47:22Z

Okay. Let me try this out. Thanks for quick reply.

Bhee closed this as completed Mar 4, 2016

wrongu mentioned this issue Apr 29, 2016

epoch_offset argument for fit functions #2557

Closed

kencoken mentioned this issue Nov 18, 2016

Add initial epoch argument to fit functions #4429

Merged

lewfish mentioned this issue Mar 2, 2017

Resuming training does not take into account the last used learning rate azavea/raster-vision#4

Closed

i3v mentioned this issue May 24, 2017

Why reset model.history? / comprehensive model state saving #6697

Closed

ignaciorlando mentioned this issue Sep 21, 2017

Save/load checkpoints during model training ignaciorlando/cnn-dr-kaggle#12

Closed

resume training from previous epoch #1872

resume training from previous epoch #1872

Comments

Bhee commented Mar 2, 2016

tboquet commented Mar 2, 2016

Bhee commented Mar 3, 2016

ymcui commented Mar 3, 2016

Bhee commented Mar 3, 2016

tboquet commented Mar 3, 2016

nithishdivakar commented Apr 21, 2016

lolongcovas commented Dec 20, 2016

ywenlu commented Jan 24, 2017

smhoang commented Feb 16, 2017

lewfish commented Mar 9, 2017

MartinThoma commented Apr 25, 2017

i3v commented May 20, 2017 • edited Loading

syedfaizalex commented Dec 16, 2017

jperl commented Dec 24, 2017 • edited Loading

bupedroni commented Mar 19, 2018

valekar commented Jun 25, 2018

morenoh149 commented Jul 27, 2018 • edited Loading

imranparuk commented Oct 23, 2018

thebeancounter commented Dec 14, 2018

imranparuk commented Dec 14, 2018

nithishdivakar commented Dec 14, 2018

0xrushi commented Apr 7, 2019

srcolinas commented Apr 19, 2019

MunishaTripping commented Oct 4, 2019

hollowgalaxy commented Nov 10, 2019

MichelHalmes commented Jan 9, 2020 • edited Loading

dorukkarinca commented Apr 12, 2020 • edited Loading

morenoh149 commented Apr 15, 2020

dorukkarinca commented Apr 16, 2020

KazegamiKuon commented Jan 27, 2021

RoyiAvital commented Mar 20, 2022

MoshiurRahmanFaisal commented Aug 11, 2022 • edited Loading

SExpert12 commented Dec 16, 2022

MoshiurRahmanFaisal commented Dec 17, 2022 via email

MoshiurRahmanFaisal commented Dec 17, 2022 via email

SExpert12 commented Dec 17, 2022

MoshiurRahmanFaisal commented Dec 17, 2022 via email

SExpert12 commented Dec 17, 2022

i3v commented May 20, 2017 •

edited

Loading

jperl commented Dec 24, 2017 •

edited

Loading

morenoh149 commented Jul 27, 2018 •

edited

Loading

MichelHalmes commented Jan 9, 2020 •

edited

Loading

dorukkarinca commented Apr 12, 2020 •

edited

Loading

MoshiurRahmanFaisal commented Aug 11, 2022 •

edited

Loading