Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Reloading model and params from Checkpoint #51

Closed
odel-odel opened this issue Apr 8, 2019 · 5 comments
Closed

Reloading model and params from Checkpoint #51

odel-odel opened this issue Apr 8, 2019 · 5 comments

Comments

@odel-odel
Copy link

Hi,
How can I reload the checkpoint and model file in order to continue from the last epoch I have reached in previous (aborted) running ? I want to do this in the pretrain stage and also in the train stage

Thanks,
Odel

@glample
Copy link
Contributor

glample commented Apr 8, 2019

34825ea
should do the trick. You still have to provide the parameters though. What you can do is simply copy paste the "running command" at the beginning of the train.log of the experiment with the checkpoint you want to reload, and simply add --reload_checkpoint EXP_PATH/checkpoint.pth

@odel-odel
Copy link
Author

Thank you for the quick response.
Now I'm getting runtime error ;

Traceback (most recent call last):
File "train.py", line 330, in
main(params)
File "train.py", line 250, in main
trainer = SingleTrainer(model, data, params)
File "/NMT/XLM/src/trainer.py", line 704, in init
super().init(data, params)
File "/NMT/XLM/src/trainer.py", line 94, in init
self.reload_checkpoint()
File "/NMT/XLM/src/trainer.py", line 457, in reload_checkpoint
getattr(self, name).load_state_dict(data[name])
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for TransformerModel:
Missing key(s) in state_dict: "position_embeddings.weight", "lang_embeddings.weight", "embeddings.weight", "layer_norm_emb.bias", "layer_norm_emb.weight", "attentions.0.q_lin.bias", "attentions.0.q_lin.weight", "attentions.0.k_lin.bias", "attentions.0.k_lin.weight", "attentions.0.v_lin.bias", "attentions.0.v_lin.weight", "attentions.0.out_lin.bias", "attentions.0.out_lin.weight", "attentions.1.q_lin.bias", "attentions.1.q_lin.weight", "attentions.1.k_lin.bias", "attentions.1.k_lin.weight", "attentions.1.v_lin.bias", "attentions.1.v_lin.weight", "attentions.1.out_lin.bias", "attentions.1.out_lin.weight", "attentions.2.q_lin.bias", "attentions.2.q_lin.weight", "attentions.2.k_lin.bias", "attentions.2.k_lin.weight", "attentions.2.v_lin.bias", "attentions.2.v_lin.weight", "attentions.2.out_lin.bias", "attentions.2.out_lin.weight", "attentions.3.q_lin.bias", "attentions.3.q_lin.weight", "attentions.3.k_lin.bias", "attentions.3.k_lin.weight", "attentions.3.v_lin.bias", "attentions.3.v_lin.weight", "attentions.3.out_lin.bias", "attentions.3.out_lin.weight", "attentions.4.q_lin.bias", "attentions.4.q_lin.weight", "attentions.4.k_lin.bias", "attentions.4.k_lin.weight", "attentions.4.v_lin.bias", "attentions.4.v_lin.weight", "attentions.4.out_lin.bias", "attentions.4.out_lin.weight", "attentions.5.q_lin.bias", "attentions.5.q_lin.weight", "attentions.5.k_lin.bias", "attentions.5.k_lin.weight", "attentions.5.v_lin.bias", "attentions.5.v_lin.weight", "attentions.5.out_lin.bias", "attentions.5.out_lin.weight", "layer_norm1.0.bias", "layer_norm1.0.weight", "layer_norm1.1.bias", "layer_norm1.1.weight", "layer_norm1.2.bias", "layer_norm1.2.weight", "layer_norm1.3.bias", "layer_norm1.3.weight", "layer_norm1.4.bias", "layer_norm1.4.weight", "layer_norm1.5.bias", "layer_norm1.5.weight", "ffns.0.lin1.bias", "ffns.0.lin1.weight", "ffns.0.lin2.bias", "ffns.0.lin2.weight", "ffns.1.lin1.bias", "ffns.1.lin1.weight", "ffns.1.lin2.bias", "ffns.1.lin2.weight", "ffns.2.lin1.bias", "ffns.2.lin1.weight", "ffns.2.lin2.bias", "ffns.2.lin2.weight", "ffns.3.lin1.bias", "ffns.3.lin1.weight", "ffns.3.lin2.bias", "ffns.3.lin2.weight", "ffns.4.lin1.bias", "ffns.4.lin1.weight", "ffns.4.lin2.bias", "ffns.4.lin2.weight", "ffns.5.lin1.bias", "ffns.5.lin1.weight", "ffns.5.lin2.bias", "ffns.5.lin2.weight", "layer_norm2.0.bias", "layer_norm2.0.weight", "layer_norm2.1.bias", "layer_norm2.1.weight", "layer_norm2.2.bias", "layer_norm2.2.weight", "layer_norm2.3.bias", "layer_norm2.3.weight", "layer_norm2.4.bias", "layer_norm2.4.weight", "layer_norm2.5.bias", "layer_norm2.5.weight", "pred_layer.proj.bias", "pred_layer.proj.weight".
Unexpected key(s) in state_dict: "module.position_embeddings.weight", "module.lang_embeddings.weight", "module.embeddings.weight", "module.layer_norm_emb.weight", "module.layer_norm_emb.bias", "module.attentions.0.q_lin.weight", "module.attentions.0.q_lin.bias", "module.attentions.0.k_lin.weight", "module.attentions.0.k_lin.bias", "module.attentions.0.v_lin.weight", "module.attentions.0.v_lin.bias", "module.attentions.0.out_lin.weight", "module.attentions.0.out_lin.bias", "module.attentions.1.q_lin.weight", "module.attentions.1.q_lin.bias", "module.attentions.1.k_lin.weight", "module.attentions.1.k_lin.bias", "module.attentions.1.v_lin.weight", "module.attentions.1.v_lin.bias", "module.attentions.1.out_lin.weight", "module.attentions.1.out_lin.bias", "module.attentions.2.q_lin.weight", "module.attentions.2.q_lin.bias", "module.attentions.2.k_lin.weight", "module.attentions.2.k_lin.bias", "module.attentions.2.v_lin.weight", "module.attentions.2.v_lin.bias", "module.attentions.2.out_lin.weight", "module.attentions.2.out_lin.bias", "module.attentions.3.q_lin.weight", "module.attentions.3.q_lin.bias", "module.attentions.3.k_lin.weight", "module.attentions.3.k_lin.bias", "module.attentions.3.v_lin.weight", "module.attentions.3.v_lin.bias", "module.attentions.3.out_lin.weight", "module.attentions.3.out_lin.bias", "module.attentions.4.q_lin.weight", "module.attentions.4.q_lin.bias", "module.attentions.4.k_lin.weight", "module.attentions.4.k_lin.bias", "module.attentions.4.v_lin.weight", "module.attentions.4.v_lin.bias", "module.attentions.4.out_lin.weight", "module.attentions.4.out_lin.bias", "module.attentions.5.q_lin.weight", "module.attentions.5.q_lin.bias", "module.attentions.5.k_lin.weight", "module.attentions.5.k_lin.bias", "module.attentions.5.v_lin.weight", "module.attentions.5.v_lin.bias", "module.attentions.5.out_lin.weight", "module.attentions.5.out_lin.bias", "module.layer_norm1.0.weight", "module.layer_norm1.0.bias", "module.layer_norm1.1.weight", "module.layer_norm1.1.bias", "module.layer_norm1.2.weight", "module.layer_norm1.2.bias", "module.layer_norm1.3.weight", "module.layer_norm1.3.bias", "module.layer_norm1.4.weight", "module.layer_norm1.4.bias", "module.layer_norm1.5.weight", "module.layer_norm1.5.bias", "module.ffns.0.lin1.weight", "module.ffns.0.lin1.bias", "module.ffns.0.lin2.weight", "module.ffns.0.lin2.bias", "module.ffns.1.lin1.weight", "module.ffns.1.lin1.bias", "module.ffns.1.lin2.weight", "module.ffns.1.lin2.bias", "module.ffns.2.lin1.weight", "module.ffns.2.lin1.bias", "module.ffns.2.lin2.weight", "module.ffns.2.lin2.bias", "module.ffns.3.lin1.weight", "module.ffns.3.lin1.bias", "module.ffns.3.lin2.weight", "module.ffns.3.lin2.bias", "module.ffns.4.lin1.weight", "module.ffns.4.lin1.bias", "module.ffns.4.lin2.weight", "module.ffns.4.lin2.bias", "module.ffns.5.lin1.weight", "module.ffns.5.lin1.bias", "module.ffns.5.lin2.weight", "module.ffns.5.lin2.bias", "module.layer_norm2.0.weight", "module.layer_norm2.0.bias", "module.layer_norm2.1.weight", "module.layer_norm2.1.bias", "module.layer_norm2.2.weight", "module.layer_norm2.2.bias", "module.layer_norm2.3.weight", "module.layer_norm2.3.bias", "module.layer_norm2.4.weight", "module.layer_norm2.4.bias", "module.layer_norm2.5.weight", "module.layer_norm2.5.bias", "module.pred_layer.proj.weight", "module.pred_layer.proj.bias".

@glample
Copy link
Contributor

glample commented Apr 8, 2019

Ah yes, I also had this because I tried to reload on a single GPU a model trained on multiple GPU. Problem in that case is that with multi-GPU, the model is encapsulated in a module (this is why you have all the extra .module in the reloaded checkpoint parameters).

See 34825ea#diff-e750911d9404a6f817e2015251a4a654R458
I added a commented line. Comment out:
getattr(self, name).load_state_dict(data[name])
and uncomment:
getattr(self, name).load_state_dict({k[len('module.'):]: v for k, v in data[name].items()})
it should solve the issue.

@odel-odel
Copy link
Author

Thanks !!

@glample glample closed this as completed Apr 10, 2019
@bhardwaj1230
Copy link

Ah yes, I also had this because I tried to reload on a single GPU a model trained on multiple GPU. Problem in that case is that with multi-GPU, the model is encapsulated in a module (this is why you have all the extra .module in the reloaded checkpoint parameters).

See 34825ea#diff-e750911d9404a6f817e2015251a4a654R458
I added a commented line. Comment out:
getattr(self, name).load_state_dict(data[name])
and uncomment:
getattr(self, name).load_state_dict({k[len('module.'):]: v for k, v in data[name].items()})
it should solve the issue.

This solved my issues, when I trained TLM on multi gpu's and translating using just 1 gpu.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants