Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save load #1502

Merged
merged 3 commits into from Jan 23, 2019
Merged

Save load #1502

merged 3 commits into from Jan 23, 2019

Conversation

pouannes
Copy link

The export and load_learner methods of Learner were only working when a gpu with cuda was available, so there were no possibility to export a model and then load it on a cpu only device.
You can now do that by specifiying device='cpu' when calling load_learner

@sgugger
Copy link
Contributor

sgugger commented Jan 23, 2019

Good suggestion. I mayo change it a little bit because I'm not sure what happens if we put a GPU as a device (since there are other things than the model in the state), so it may end up just being a flag cpu=True if device doesn't work for any device.
Thanks a lot!

@sgugger sgugger merged commit 90c6dd4 into fastai:master Jan 23, 2019
@njaremko
Copy link

njaremko commented Jan 30, 2019

@sgugger The changes you made after this merge have broken this again. Trying to load an exported model on cpu-only machines get the cuda error in 1.0.42 and master.

@sgugger
Copy link
Contributor

sgugger commented Jan 30, 2019

You have to re-export any previously saved Learner with fastai v1.0.42 (or later) but the bug isn't there anymore normally.

@njaremko
Copy link

njaremko commented Jan 30, 2019

Before making my comment I trained a Learner from scratch with 1.0.42 and master to make sure. It's still happening for me.

Jan 30 09:24:39 PM  Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.
Jan 30 09:24:39 PM  Traceback (most recent call last):
  File "app/server.py", line 31, in setup_learner
    learn = load_learner(path, export_file_name, True)
  File "/usr/local/lib/python3.7/site-packages/fastai/basic_train.py", line 469, in load_learner
    state = torch.load(open(Path(path)/fname, 'rb'))
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 367, in load
    return _load(f, map_location, pickle_module)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 538, in _load
    result = unpickler.load()
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 504, in persistent_load
    data_type(size), location)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 113, in default_restore_location
    result = fn(storage, location)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 94, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 78, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.

@sgugger
Copy link
Contributor

sgugger commented Jan 30, 2019

You should check your version on the machine you export your 'export.pkl' file then provide us with the code you're running. Just confirmed I had no problem loading an exported file on a CPU-only instance.

@njaremko
Copy link

Double checked both machines. Both running 1.0.42.

Code used to generate model:

from fastai import *
from fastai.vision import *

# Loading data
bs = 32
df = pd.read_csv('./labels.csv', header='infer')
data = (ImageItemList.from_df(df, path='.', folder='train')
        .random_split_by_pct()
        .label_from_df(label_cls=FloatList)
        .transform(get_transforms(), size=299)
        .databunch(bs=bs).normalize(imagenet_stats))

# Create model
learn = create_cnn(data, models.resnet101)

# Use mixed precision
learn.to_fp16()

# Find Learning Rate
learn.lr_find()
learn.recorder.plot()

# Fit
learn.fit_one_cycle(10)
learn.save('stage-1-101')

# Prep for fine tuning
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()

# Fine Tuning
learn.fit_one_cycle(10, max_lr=slice(1e-6, 3e-4))
learn.save('stage-2-101')

# Export
learn.to_fp32()
learn.export('prod-101.pkl')

@sgugger
Copy link
Contributor

sgugger commented Jan 31, 2019

Ok, the problem was different: to_fp32 doesn't properly remove the MixedPrecision callback, and even if you don't call it, the MixedPrecision callback saves some stuff on the GPU. Both should be fixed now.

@DnzzL
Copy link

DnzzL commented Jan 31, 2019

Hello.
Both machines have fastai 1.0.42 installed.
I have retrained my model with last version, but I still have this issue too.
Why was the boolean cpu removed for the loading function from the commit a92d6c6 ?

@njaremko
Copy link

njaremko commented Jan 31, 2019

You'll need to install fastai from master if you want this fix, as it's not in a release yet. (assuming you have the same issue with fp16)

It was removed because the logic was changed to always save models for CPU.

@DnzzL
Copy link

DnzzL commented Feb 1, 2019

I have installed the last version of fastai with pip install git+https://github.com/fastai/fastai.git, so I have version 1.0.43.dev0 for training, ran all the pipeline again and save the classifier.
The inference is still version 1.0.42, but I still have the problem:

If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.

I think it's more flexible to be able to choose while loading rather than at saving.

@sgugger
Copy link
Contributor

sgugger commented Feb 1, 2019

The boolean cpu wasn't working: there was still an error on some machines with CPU only, that's why it was removed. If you still get the error, it means you have some tensors on the GPU saved, and they shouldn't really be there. Sharing your code would help us debug this.

@DnzzL
Copy link

DnzzL commented Feb 1, 2019

You'll have a déjà-vu, it's based on your DeepFrench notebook ;)

import fastai
import torch
print(fastai.__version__) # 1.0.43.dev0
print(torch.__version__) # 1.0.0

from fastai import *
from fastai.text import *
import pandas as pd

lang = "en" 
if lang == "fr":
  weights_pretrained = 'wref30k'
  itos_pretrained = 'itosref30k'
  pretained_data = (weights_pretrained, itos_pretrained)
if lang == "es":
  weights_pretrained = 'model-30k-vocab-noqrnn'
  itos_pretrained = 'itos_pretrained'
if lang != "en":
  pretained_data = (weights_pretrained, itos_pretrained)

PATH_LM = Path(f'data/{lang}')
PATH_CS = Path(f'data/{lang}')

train_df = pd.read_csv(f"{PATH_LM}/train.csv")
valid_df = pd.read_csv(f"{PATH_LM}/valid.csv")
test_df = pd.read_csv(f"{PATH_LM}/test.csv")

tokenizer = Tokenizer(lang=f'{lang}', n_cpus=8)
data_lm = TextLMDataBunch.from_df(PATH_LM, tokenizer=tokenizer, bs=32, train_df=train_df, valid_df=valid_df, test_df=test_df)

"""# Fine tuning LM"""

if lang == "en":
  learn = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0)
else:
  learn = language_model_learner(data_lm, pretrained_fnames=pretained_data, drop_mult=0)
learn.freeze()

learn.lr_find()
learn.recorder.plot(skip_start=0)
learn.fit_one_cycle(1, 1e-2)
learn.save(f'{lang}_head_pretrained')

learn.unfreeze()
learn.fit_one_cycle(2, 1e-3, moms=(0.8,0.7))
learn.save(f'{lang}_lm_fine_tuned')
learn.save_encoder(f'{lang}_ft_enc')

"""# Classification task"""

train_df = pd.read_csv(f"{PATH_CS}/train.csv", names=["label", "text"])
valid_df = pd.read_csv(f"{PATH_CS}/valid.csv")
test_df = pd.read_csv(f"{PATH_CS}/test.csv")

labelcounts = train_df.groupby(["label"]).size()
label_sum = len(train_df["label"])
class_imbalance = [(count/label_sum) for count in labelcounts]
print(class_imbalance)

data_clas = TextClasDataBunch.from_df(PATH_CS, train_df=train_df, valid_df=valid_df, test_df=test_df, tokenizer=tokenizer, vocab=data_lm.train_ds.vocab, bs=32)

learn = text_classifier_learner(data_clas, drop_mult=0.3)
learn.load_encoder(f'{lang}_ft_enc')
learn.freeze()

"""## Take class imbalance into account"""

weights_balance = [(1-count/label_sum) for count in labelcounts]
loss_weights = torch.FloatTensor(weights_balance).cuda()
learn.crit = partial(F.cross_entropy, weight=loss_weights)

"""## Train classifier"""

learn.lr_find()
learn.recorder.plot(skip_start=0)

starting_lr = 1e-2

learn.fit_one_cycle(2, starting_lr, moms=(0.8,0.7))

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(starting_lr/(2.6**4),1e-2), moms=(0.8,0.7))

learn.freeze_to(-3)
lr = starting_lr/2
learn.fit_one_cycle(1, slice(lr/(2.6**4),5e-3), moms=(0.8,0.7))

learn.unfreeze()
lr = starting_lr/10
learn.fit_one_cycle(2, slice(lr/(2.6**4),1e-3), moms=(0.8,0.7))
learn.fit_one_cycle(2, slice(lr/(2.6**4),1e-3), moms=(0.8,0.7))

learn.save(f'{lang}_ulmfit')
learn.export(f'{lang}_ulmfit.pkl') 

@sgugger
Copy link
Contributor

sgugger commented Feb 4, 2019

Ok, so it's the weights of your loss function that are the culprit. I'll try to find a more general fix that will go through everything in the state and save it on the CPU.

In the meantime, pushed a fix that will load the Learner on cpu if that's the default device. Since I had reports that the map_location wasn't working before (though I couldn't reproduce) it may not work in 100% of the cases.

@DnzzL
Copy link

DnzzL commented Feb 4, 2019

Nice catch!
It works when I save after removing .cuda() for the metrics.
Thank you for your help.

@anurag
Copy link

anurag commented Feb 4, 2019 via email

@alex000kim
Copy link

alex000kim commented Feb 6, 2019

Seeing the same error as @njaremko with 1.0.42

@sgugger
Copy link
Contributor

sgugger commented Feb 6, 2019

It's only fixed in master, there hasn't been a new release yet.

@alex000kim
Copy link

Thanks @sgugger. Any estimates on the next release date?

@sgugger
Copy link
Contributor

sgugger commented Feb 6, 2019

I think sometime next week. There has been a lot of changes with my digging-out of kwargs everywhere, so we want to make sure it's stable and nothing is broken before the next release.
In the meantime, you can use a dev install.

@stas00
Copy link

stas00 commented Feb 6, 2019

In the meantime, you can use a dev install.

The simplest way to accomplish that is with:

pip install git+https://github.com/fastai/fastai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants