Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelOnCPU() has destructive side effects? #2541

Closed
kennysong opened this issue Apr 9, 2020 · 6 comments
Closed

ModelOnCPU() has destructive side effects? #2541

kennysong opened this issue Apr 9, 2020 · 6 comments

Comments

@kennysong
Copy link

kennysong commented Apr 9, 2020

Describe the bug

I'm training a language_model_learner to generate company names.

Surprisingly, I noticed that calling learn.export() mutates learn in some way that significantly drops the quality of the model.

I manually called each line of learn.export() to identify where this was going wrong, and found that the model quality drops when entering the with ModelOnCPU(learn.model) as m: context.

You can see what happens here:

Screen Shot 2020-04-09 at 4 07 25 PM

(FYI, loading the exported model with load_learner also gives the same garbage predictions.)

Provide your installation details

=== Software === 
python        : 3.7.6
fastai        : 1.0.60
fastprogress  : 0.2.2
torch         : 1.4.0
nvidia driver : 418.87
torch cuda    : 10.1 / is available
torch cudnn   : 7603 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 15079MB | Tesla T4

=== Environment === 
platform      : Linux-4.9.0-12-amd64-x86_64-with-debian-9.12
distro        : #1 SMP Debian 4.9.210-1 (2020-01-20)
conda env     : base
python        : /opt/conda/bin/python
sys.path      : /home/jupyter/data/models
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload
/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions

To Reproduce

Perhaps you can reproduce with a smaller example? (It takes a few hours to train the model.)

If not, here is my code:

  1. To download the dataset: https://gist.github.com/kennysong/daf7b77a5f42860059b312dd57968512
  2. To train the language model and generate predictions: https://gist.github.com/kennysong/9572da21461e10c9c6e7d7c8f6902ab4

Expected behavior

learn.export() should not alter learn at all.

Screenshots

See above.

Additional context

Also, a similar drop in performance happens if I do learn.save() and learn2.load(). learn2 will generate the same garbage predictions. learn is unmodified.

@sgugger
Copy link
Contributor

sgugger commented Apr 9, 2020

Since you have the model available, it's going to be easier for you test than me. ModelOnCPU just does

model.cpu()
model.to(previous_device)

Could you just try this (instead of with ModelOnCPU(m) as m: pass) and see if it still gives you garbage predictions?

@kennysong
Copy link
Author

Yep, it happens if I call those two lines as well.

Screen Shot 2020-04-09 at 11 06 01 PM

Let me know if there's something else I should try.

@sgugger
Copy link
Contributor

sgugger commented Apr 11, 2020

Just an update: I managed to reproduce this bug on a smaller scale and am trying to get some minimal reproducer to find its cause. Not sure yet if it's inside fastai or PyTorch. It's a very subtle one, so it might take me a while to get to the root of it.

@kennysong
Copy link
Author

Got it, thanks for the update – let me know if I can help.

@sgugger
Copy link
Contributor

sgugger commented Apr 13, 2020

Ok, finally went to the bottom of this. It was due to WeightDropout not really working. I fixed that, but note that you may need to adjust your weight dropout.

@kennysong
Copy link
Author

Awesome, verified that it's working correctly on fastai master. Thank you Sylvain!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants