Reloading model and params from Checkpoint #51

odel-odel · 2019-04-08T08:07:31Z

Hi,
How can I reload the checkpoint and model file in order to continue from the last epoch I have reached in previous (aborted) running ? I want to do this in the pretrain stage and also in the train stage

Thanks,
Odel

glample · 2019-04-08T10:00:27Z

34825ea
should do the trick. You still have to provide the parameters though. What you can do is simply copy paste the "running command" at the beginning of the train.log of the experiment with the checkpoint you want to reload, and simply add --reload_checkpoint EXP_PATH/checkpoint.pth

odel-odel · 2019-04-08T12:22:54Z

Thank you for the quick response.
Now I'm getting runtime error ;

Traceback (most recent call last):
File "train.py", line 330, in
main(params)
File "train.py", line 250, in main
trainer = SingleTrainer(model, data, params)
File "/NMT/XLM/src/trainer.py", line 704, in init
super().init(data, params)
File "/NMT/XLM/src/trainer.py", line 94, in init
self.reload_checkpoint()
File "/NMT/XLM/src/trainer.py", line 457, in reload_checkpoint
getattr(self, name).load_state_dict(data[name])
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for TransformerModel:
Missing key(s) in state_dict: "position_embeddings.weight", "lang_embeddings.weight", "embeddings.weight", "layer_norm_emb.bias", "layer_norm_emb.weight", "attentions.0.q_lin.bias", "attentions.0.q_lin.weight", "attentions.0.k_lin.bias", "attentions.0.k_lin.weight", "attentions.0.v_lin.bias", "attentions.0.v_lin.weight", "attentions.0.out_lin.bias", "attentions.0.out_lin.weight", "attentions.1.q_lin.bias", "attentions.1.q_lin.weight", "attentions.1.k_lin.bias", "attentions.1.k_lin.weight", "attentions.1.v_lin.bias", "attentions.1.v_lin.weight", "attentions.1.out_lin.bias", "attentions.1.out_lin.weight", "attentions.2.q_lin.bias", "attentions.2.q_lin.weight", "attentions.2.k_lin.bias", "attentions.2.k_lin.weight", "attentions.2.v_lin.bias", "attentions.2.v_lin.weight", "attentions.2.out_lin.bias", "attentions.2.out_lin.weight", "attentions.3.q_lin.bias", "attentions.3.q_lin.weight", "attentions.3.k_lin.bias", "attentions.3.k_lin.weight", "attentions.3.v_lin.bias", "attentions.3.v_lin.weight", "attentions.3.out_lin.bias", "attentions.3.out_lin.weight", "attentions.4.q_lin.bias", "attentions.4.q_lin.weight", "attentions.4.k_lin.bias", "attentions.4.k_lin.weight", "attentions.4.v_lin.bias", "attentions.4.v_lin.weight", "attentions.4.out_lin.bias", "attentions.4.out_lin.weight", "attentions.5.q_lin.bias", "attentions.5.q_lin.weight", "attentions.5.k_lin.bias", "attentions.5.k_lin.weight", "attentions.5.v_lin.bias", "attentions.5.v_lin.weight", "attentions.5.out_lin.bias", "attentions.5.out_lin.weight", "layer_norm1.0.bias", "layer_norm1.0.weight", "layer_norm1.1.bias", "layer_norm1.1.weight", "layer_norm1.2.bias", "layer_norm1.2.weight", "layer_norm1.3.bias", "layer_norm1.3.weight", "layer_norm1.4.bias", "layer_norm1.4.weight", "layer_norm1.5.bias", "layer_norm1.5.weight", "ffns.0.lin1.bias", "ffns.0.lin1.weight", "ffns.0.lin2.bias", "ffns.0.lin2.weight", "ffns.1.lin1.bias", "ffns.1.lin1.weight", "ffns.1.lin2.bias", "ffns.1.lin2.weight", "ffns.2.lin1.bias", "ffns.2.lin1.weight", "ffns.2.lin2.bias", "ffns.2.lin2.weight", "ffns.3.lin1.bias", "ffns.3.lin1.weight", "ffns.3.lin2.bias", "ffns.3.lin2.weight", "ffns.4.lin1.bias", "ffns.4.lin1.weight", "ffns.4.lin2.bias", "ffns.4.lin2.weight", "ffns.5.lin1.bias", "ffns.5.lin1.weight", "ffns.5.lin2.bias", "ffns.5.lin2.weight", "layer_norm2.0.bias", "layer_norm2.0.weight", "layer_norm2.1.bias", "layer_norm2.1.weight", "layer_norm2.2.bias", "layer_norm2.2.weight", "layer_norm2.3.bias", "layer_norm2.3.weight", "layer_norm2.4.bias", "layer_norm2.4.weight", "layer_norm2.5.bias", "layer_norm2.5.weight", "pred_layer.proj.bias", "pred_layer.proj.weight".
Unexpected key(s) in state_dict: "module.position_embeddings.weight", "module.lang_embeddings.weight", "module.embeddings.weight", "module.layer_norm_emb.weight", "module.layer_norm_emb.bias", "module.attentions.0.q_lin.weight", "module.attentions.0.q_lin.bias", "module.attentions.0.k_lin.weight", "module.attentions.0.k_lin.bias", "module.attentions.0.v_lin.weight", "module.attentions.0.v_lin.bias", "module.attentions.0.out_lin.weight", "module.attentions.0.out_lin.bias", "module.attentions.1.q_lin.weight", "module.attentions.1.q_lin.bias", "module.attentions.1.k_lin.weight", "module.attentions.1.k_lin.bias", "module.attentions.1.v_lin.weight", "module.attentions.1.v_lin.bias", "module.attentions.1.out_lin.weight", "module.attentions.1.out_lin.bias", "module.attentions.2.q_lin.weight", "module.attentions.2.q_lin.bias", "module.attentions.2.k_lin.weight", "module.attentions.2.k_lin.bias", "module.attentions.2.v_lin.weight", "module.attentions.2.v_lin.bias", "module.attentions.2.out_lin.weight", "module.attentions.2.out_lin.bias", "module.attentions.3.q_lin.weight", "module.attentions.3.q_lin.bias", "module.attentions.3.k_lin.weight", "module.attentions.3.k_lin.bias", "module.attentions.3.v_lin.weight", "module.attentions.3.v_lin.bias", "module.attentions.3.out_lin.weight", "module.attentions.3.out_lin.bias", "module.attentions.4.q_lin.weight", "module.attentions.4.q_lin.bias", "module.attentions.4.k_lin.weight", "module.attentions.4.k_lin.bias", "module.attentions.4.v_lin.weight", "module.attentions.4.v_lin.bias", "module.attentions.4.out_lin.weight", "module.attentions.4.out_lin.bias", "module.attentions.5.q_lin.weight", "module.attentions.5.q_lin.bias", "module.attentions.5.k_lin.weight", "module.attentions.5.k_lin.bias", "module.attentions.5.v_lin.weight", "module.attentions.5.v_lin.bias", "module.attentions.5.out_lin.weight", "module.attentions.5.out_lin.bias", "module.layer_norm1.0.weight", "module.layer_norm1.0.bias", "module.layer_norm1.1.weight", "module.layer_norm1.1.bias", "module.layer_norm1.2.weight", "module.layer_norm1.2.bias", "module.layer_norm1.3.weight", "module.layer_norm1.3.bias", "module.layer_norm1.4.weight", "module.layer_norm1.4.bias", "module.layer_norm1.5.weight", "module.layer_norm1.5.bias", "module.ffns.0.lin1.weight", "module.ffns.0.lin1.bias", "module.ffns.0.lin2.weight", "module.ffns.0.lin2.bias", "module.ffns.1.lin1.weight", "module.ffns.1.lin1.bias", "module.ffns.1.lin2.weight", "module.ffns.1.lin2.bias", "module.ffns.2.lin1.weight", "module.ffns.2.lin1.bias", "module.ffns.2.lin2.weight", "module.ffns.2.lin2.bias", "module.ffns.3.lin1.weight", "module.ffns.3.lin1.bias", "module.ffns.3.lin2.weight", "module.ffns.3.lin2.bias", "module.ffns.4.lin1.weight", "module.ffns.4.lin1.bias", "module.ffns.4.lin2.weight", "module.ffns.4.lin2.bias", "module.ffns.5.lin1.weight", "module.ffns.5.lin1.bias", "module.ffns.5.lin2.weight", "module.ffns.5.lin2.bias", "module.layer_norm2.0.weight", "module.layer_norm2.0.bias", "module.layer_norm2.1.weight", "module.layer_norm2.1.bias", "module.layer_norm2.2.weight", "module.layer_norm2.2.bias", "module.layer_norm2.3.weight", "module.layer_norm2.3.bias", "module.layer_norm2.4.weight", "module.layer_norm2.4.bias", "module.layer_norm2.5.weight", "module.layer_norm2.5.bias", "module.pred_layer.proj.weight", "module.pred_layer.proj.bias".

glample · 2019-04-08T13:32:17Z

Ah yes, I also had this because I tried to reload on a single GPU a model trained on multiple GPU. Problem in that case is that with multi-GPU, the model is encapsulated in a module (this is why you have all the extra .module in the reloaded checkpoint parameters).

See 34825ea#diff-e750911d9404a6f817e2015251a4a654R458
I added a commented line. Comment out:
getattr(self, name).load_state_dict(data[name])
and uncomment:
getattr(self, name).load_state_dict({k[len('module.'):]: v for k, v in data[name].items()})
it should solve the issue.

odel-odel · 2019-04-10T06:11:57Z

Thanks !!

bhardwaj1230 · 2020-02-17T23:18:32Z

Ah yes, I also had this because I tried to reload on a single GPU a model trained on multiple GPU. Problem in that case is that with multi-GPU, the model is encapsulated in a module (this is why you have all the extra .module in the reloaded checkpoint parameters).

See 34825ea#diff-e750911d9404a6f817e2015251a4a654R458
I added a commented line. Comment out:
getattr(self, name).load_state_dict(data[name])
and uncomment:
getattr(self, name).load_state_dict({k[len('module.'):]: v for k, v in data[name].items()})
it should solve the issue.

This solved my issues, when I trained TLM on multi gpu's and translating using just 1 gpu.

glample closed this as completed Apr 10, 2019

colmantse mentioned this issue Feb 1, 2021

cannot find checkpoint to reload in multi-gpu pretraining #327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reloading model and params from Checkpoint #51

Reloading model and params from Checkpoint #51

odel-odel commented Apr 8, 2019

glample commented Apr 8, 2019

odel-odel commented Apr 8, 2019

glample commented Apr 8, 2019

odel-odel commented Apr 10, 2019

bhardwaj1230 commented Feb 17, 2020

Reloading model and params from Checkpoint #51

Reloading model and params from Checkpoint #51

Comments

odel-odel commented Apr 8, 2019

glample commented Apr 8, 2019

odel-odel commented Apr 8, 2019

glample commented Apr 8, 2019

odel-odel commented Apr 10, 2019

bhardwaj1230 commented Feb 17, 2020