-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resume training from previous epoch #1872
Comments
Do you want to do something special with the history? If not, you can just call |
when I call model.fit() after loading models and weights , it showing epoch = 1. If I stop the training at 100 epoch. I want to resume the training with epoch=101. |
I think it is no matter whether it SHOWs the training is at epoch = 1 or epoch = 101. |
thank u |
But there is a problem with this approach. What about hyper parameters that change according to epoch, say learning rate with a decay. Just restarting it with fit method doesn't take that into account. |
yeah, this happens to me when I resume the training process by loading weights. I was training resnet18 with imagenet dataset, the model saved the weights at 1st epoch with lr=0.1 at beginning. I stopped it, then tried the resume functionality, and it turns out that the model starts with the same lr=0.1, and the loss increase for each iteration. To set the lr to the state of the 1st epoch, I changed the lr according to SGD lr update func: lr = lr * (1. / (1+decay*iterations)), however, it didnt work, the loss sill increases, but slower than with lr=0.1. Probably I should still lower the lr, but I dont understand why the loss still increase even the lr is set accordantly. |
Try the initial_epoch argument in .fit method. |
using initial_epoch didn't work in this case |
Setting the |
Besides using the
This allows using the same callback and it just appends to the end. @fchollet should I post a pull request for this? It seems to me that this is more useful than the current behaviour of overwriting the |
@MartinThoma ,
to make your suggestion work, right? I've tried to make this stuff work, and eventually ran into a feeling that too many different things should be changed, see #6697. What do you think? |
if you want to resume from epoch 101 ,simply use "initial_epoch = 101" in model.fit(). initial_epoch: Epoch at which to start training (useful for resuming a previous training run) |
Seems that tensorflow estimators also support resuming training. "Since the state of the model is persisted (in model_dir=PATH above), the model will improve the more iterations you train it, until it settles" |
Related question: What happens to all the gradient computations that rely on a history of the gradients (when momentum is present, such as in ADAM and most gradient descent algorithms)? Does the checkpoint store these as well? Thanks! |
@bupedroni: As far as I know, every time I loaded the existing model, all the hyperparameters were set to default values. Best way to resume is to write a custom callback and store all the hyperparameters and then start the training as mentioned by @MartinThoma |
@MartinThoma I'd like a pull request implementing that, basically I'm training a model, but if I notice that the metrics haven't diverged I'd like to train for another x epochs. And also be able to plot the history overall in an additive way. For now I'm just accumulating histories like this https://www.kaggle.com/morenoh149/keras-continue-training |
Still have this issues... any update on it? |
anything new here? |
just port your code to pytorch 😆 |
Ya. That actually worked for me. 2 years and counting. |
So does that mean if i call and model.fit(epochs=5) both are same ?? |
Yes, they are equivalent. At least that is what I found using the TensorFlow Keras API in TensorFlow 2.0 |
How can I get the epoch at which model was saved in ModelCheckpoint ? |
save epoch number in the name of the model. Fetch that number with regex when resuming training. |
I managed to do this with an optimizer whose learning rate depends on the number of iterations eg Adam. Here is the pseudo-code: ...
if os.path.isfile(checkpoint_path+".index"):
# This loads `(root).optimizer.iter`from the checkpoint
model.load_weights(checkpoint_path)
# Recover the iterations from the model and convert to epochs
initial_epoch = model.optimizer.iterations.numpy() // STEPS_PER_EPOCH
callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_weights_only=True)
model.fit(train_data, epochs=NUM_EPOCHS, initial_epoch=initial_epoch,
callbacks=[callback]) Hope this helps :-) |
I got tired of this so I ended up writing a Keras wrapper that autosaves and restores the epoch number, training history, and model weights:
Let me know what you think. PRs more than welcome. |
@dorukkarinca is this handled in tensor flow v2? that's supposed to supersede keras |
@morenoh149 not to the best of my knowledge. This wrapper wraps tensorflow.keras anyway. |
@dorukkarinca UwU and Orz . Your wrapper help me so much. I didnt know why I wasted 1 week to retrain at start. |
While they are the same, is there a simple way to append the history of each call in order to have the same history for both cases as well? |
just use the below callback function to resume training from the epoch where you have stopped. |
Hello, def load_data(labels_file, test_size): Please help me out. |
Did you declared any X_train? Or divided your data set into train test? The
error shows in this screenshot tells me that you haven't declared any
X_train there your features for training data will be.
…On Fri, 16 Dec 2022, 6.37 pm SExpert12, ***@***.***> wrote:
Hello,
I am loading using function. Now how to access in fit to resume model
training
def load_data(labels_file, test_size):
"""
Display a list of images in a single figure with matplotlib.
Parameters:
labels_file: The labels CSV file.
test_size: The size of the testing set.
"""
labels = pd.read_csv(labels_file)
X = labels[['center', 'left', 'right']].values
y = labels['steering'].values
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
test_size=test_size, random_state=0)
return X_train, X_valid, y_train, y_valid
[image: image]
<https://user-images.githubusercontent.com/113903215/208099652-602bd156-ea09-4b14-9f68-cf703042a7c5.png>
[image: resumemodel_error]
<https://user-images.githubusercontent.com/113903215/208099775-1302093d-e38b-4044-82ea-292410885f24.png>
Please help me out.
—
Reply to this email directly, view it on GitHub
<#1872 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARZR3AG6JK2GJ7OUB6OSQMTWNRPB5ANCNFSM4B446GBA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
First fix your issue with the train and test and then declare a callback
function as bellow.
callback = tf.keras.callbacks.experimental.BackupAndRestore(
backup_dir="temp")
Now use this callback function while fitting your model. This will help you
to resume your training where you have left it.
Thank you and all the best.
On Sat, 17 Dec 2022, 12.07 pm Moshiur Rahman Faisal 1811966642, <
***@***.***> wrote:
… Did you declared any X_train? Or divided your data set into train test?
The error shows in this screenshot tells me that you haven't declared any
X_train there your features for training data will be.
On Fri, 16 Dec 2022, 6.37 pm SExpert12, ***@***.***> wrote:
> Hello,
> I am loading using function. Now how to access in fit to resume model
> training
>
> def load_data(labels_file, test_size):
> """
> Display a list of images in a single figure with matplotlib.
> Parameters:
> labels_file: The labels CSV file.
> test_size: The size of the testing set.
> """
> labels = pd.read_csv(labels_file)
> X = labels[['center', 'left', 'right']].values
> y = labels['steering'].values
> X_train, X_valid, y_train, y_valid = train_test_split(X, y,
> test_size=test_size, random_state=0)
> return X_train, X_valid, y_train, y_valid
>
> [image: image]
> <https://user-images.githubusercontent.com/113903215/208099652-602bd156-ea09-4b14-9f68-cf703042a7c5.png>
> [image: resumemodel_error]
> <https://user-images.githubusercontent.com/113903215/208099775-1302093d-e38b-4044-82ea-292410885f24.png>
>
> Please help me out.
>
> —
> Reply to this email directly, view it on GitHub
> <#1872 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ARZR3AG6JK2GJ7OUB6OSQMTWNRPB5ANCNFSM4B446GBA>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
Use:
X_train, X_valid, y_train, y_valid = load_data(your labels file, your test
size)
Then try to fit your model on the same code where history = model.fit(your
code)
…On Sat, 17 Dec 2022, 1.12 pm SExpert12, ***@***.***> wrote:
Thanks for reply. Yes it is declared here.
[image: image]
<https://user-images.githubusercontent.com/113903215/208230375-cb410584-5cc9-4bb3-bd66-83941ba2c793.png>
Now how to get local variable value to access?
—
Reply to this email directly, view it on GitHub
<#1872 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARZR3AHAC2RYK7AR4ZJPLPLWNVRW7ANCNFSM4B446GBA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Okay. Let me try this out. Thanks for quick reply. |
I saved the model and weights after each epoch using callbacks.ModelCheckpoint. I want to train it again from the last epoch.
How to set the model.fit() command to start from the previous epoch?
The text was updated successfully, but these errors were encountered: