Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model save not working #3

Closed
jpilaul opened this issue Aug 2, 2021 · 6 comments
Closed

Model save not working #3

jpilaul opened this issue Aug 2, 2021 · 6 comments

Comments

@jpilaul
Copy link

jpilaul commented Aug 2, 2021

There are a few checkpoint_callback being created in lighting_base.py and I think that using the callback on line https://github.com/XiangLi1999/PrefixTuning/blob/cleaned/seq2seq/lightning_base.py#L749 does allow us to save the model. I am rerunning the model right now to verify without the line. However, since it takes a long time to train, I was hoping that you can help me fix model saving.
Thanks

@XiangLi1999
Copy link
Owner

XiangLi1999 commented Aug 2, 2021

I wonder what's the issue with model saving? Could you be more specific? is it not saving any models? If so, I think you could check the version of lightning you installed. I think pytorch-lightning==0.8.5 should work!

edit: should be pytorch-lightning==0.9.0. NOT 0.8.5

@jpilaul
Copy link
Author

jpilaul commented Aug 4, 2021

Nope that doesn't work either. I tried pytorch-lightning==0.8.5 and reverted changes that I had made to make the code run. I am getting the following error with your current code version:

Traceback (most recent call last):
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/finetune.py", line 879, in <module>
    main(args)
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/finetune.py", line 787, in main
    logger=logger,
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/lightning_base.py", line 792, in generic_train
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 992, in fit
    results = self.spawn_ddp_children(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 462, in spawn_ddp_children
    results = self.ddp_train(local_rank, q=None, model=model, is_master=True)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
    self.train()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
    self.run_training_epoch()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 470, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 430, in run_evaluation
    self.on_validation_end()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 112, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 12, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 318, in on_validation_end
    self._save_model(filepath)
TypeError: _save_model() missing 2 required positional arguments: 'trainer' and 'pl_module'
Exception ignored in: <bound method tqdm.__del__ of <tqdm.asyncio.tqdm_asyncio object at 0x7f3cc022f4a8>>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1138, in __del__
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1285, in close
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1478, in display
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1141, in __str__
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1436, in format_dict
TypeError: 'NoneType' object is not iterable

@XiangLi1999
Copy link
Owner

XiangLi1999 commented Aug 4, 2021

I still think this is a package version issue. Here is a solution. Could you configure your virtual env using this docker image? xlisali/xlisali:prefix3

@luxuantao
Copy link

Nope that doesn't work either. I tried pytorch-lightning==0.8.5 and reverted changes that I had made to make the code run. I am getting the following error with your current code version:

Traceback (most recent call last):
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/finetune.py", line 879, in <module>
    main(args)
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/finetune.py", line 787, in main
    logger=logger,
  File "/home/ubuntu/Projects/PrefixTuning/seq2seq/lightning_base.py", line 792, in generic_train
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 992, in fit
    results = self.spawn_ddp_children(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 462, in spawn_ddp_children
    results = self.ddp_train(local_rank, q=None, model=model, is_master=True)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1213, in run_pretrain_routine
    self.train()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
    self.run_training_epoch()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 470, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 430, in run_evaluation
    self.on_validation_end()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 112, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 12, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 318, in on_validation_end
    self._save_model(filepath)
TypeError: _save_model() missing 2 required positional arguments: 'trainer' and 'pl_module'
Exception ignored in: <bound method tqdm.__del__ of <tqdm.asyncio.tqdm_asyncio object at 0x7f3cc022f4a8>>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1138, in __del__
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1285, in close
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1478, in display
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1141, in __str__
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/tqdm/std.py", line 1436, in format_dict
TypeError: 'NoneType' object is not iterable

I have the same problem

@XiangLi1999
Copy link
Owner

XiangLi1999 commented Aug 13, 2021

Could you try pip install pytorch-lightning==0.9.0 and let me know if this solves the problem
(I will edit my previous post if this solves the problem for both of you)!

(Side note: I tried to look into the problem and realized that I used 0.9.0 version. Previously used `conda env export' but that's not printing the right version of pytorch-lightning that I actually used. )

@luxuantao
Copy link

Could you try pip install pytorch-lightning==0.9.0 and let me know if this solves the problem
(I will edit my previous post if this solves the problem for both of you)!

(Side note: I tried to look into the problem and realized that I used 0.9.0 version. Previously used `conda env export' but that's not printing the right version of pytorch-lightning that I actually used. )

It works! Thanks!

@jpilaul jpilaul closed this as completed Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants