Fix Train Step calculations for Checkpointing #279

tanaysoni · 2020-03-16T10:36:32Z

Currently, the checkpointing during training is implemented at the beginning of the train loop. This causes a training step overlap. For instance, consider a case when checkpoint_every is set to 100 and a checkpoint is saved at step 300. Then, a training resumed from the saved checkpoint will start at step 300, causing step 300 to be executed twice.

This PR moves the checkpoint at the end of the training loop. Thus, when the training resumes from a checkpoint saved at step 300, it'd start from step 301.

farm/train.py

Fix train step calculations for train checkpointing

eae80f1

tanaysoni force-pushed the fix-train-checkpoint-steps branch from f61b3ee to eae80f1 Compare March 16, 2020 10:39

tanaysoni commented Mar 16, 2020

View reviewed changes

farm/train.py Outdated Show resolved Hide resolved

tholor approved these changes Mar 16, 2020

View reviewed changes

Update train loop to start epoch/step iterators from 0

c3fd3df

tanaysoni requested a review from tholor March 16, 2020 11:47

tholor approved these changes Mar 16, 2020

View reviewed changes

Fix docstring

92aefd3

tanaysoni merged commit 5d78cbe into master Mar 16, 2020

tanaysoni deleted the fix-train-checkpoint-steps branch March 16, 2020 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Train Step calculations for Checkpointing #279

Fix Train Step calculations for Checkpointing #279

tanaysoni commented Mar 16, 2020

Fix Train Step calculations for Checkpointing #279

Fix Train Step calculations for Checkpointing #279

Conversation

tanaysoni commented Mar 16, 2020