Save load #1502

pouannes · 2019-01-22T23:00:21Z

The export and load_learner methods of Learner were only working when a gpu with cuda was available, so there were no possibility to export a model and then load it on a cpu only device.
You can now do that by specifiying device='cpu' when calling load_learner

sgugger · 2019-01-23T14:31:44Z

Good suggestion. I mayo change it a little bit because I'm not sure what happens if we put a GPU as a device (since there are other things than the model in the state), so it may end up just being a flag cpu=True if device doesn't work for any device.
Thanks a lot!

njaremko · 2019-01-30T21:27:35Z

@sgugger The changes you made after this merge have broken this again. Trying to load an exported model on cpu-only machines get the cuda error in 1.0.42 and master.

sgugger · 2019-01-30T21:59:19Z

You have to re-export any previously saved Learner with fastai v1.0.42 (or later) but the bug isn't there anymore normally.

njaremko · 2019-01-30T22:13:36Z

Before making my comment I trained a Learner from scratch with 1.0.42 and master to make sure. It's still happening for me.

Jan 30 09:24:39 PM  Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.
Jan 30 09:24:39 PM  Traceback (most recent call last):
  File "app/server.py", line 31, in setup_learner
    learn = load_learner(path, export_file_name, True)
  File "/usr/local/lib/python3.7/site-packages/fastai/basic_train.py", line 469, in load_learner
    state = torch.load(open(Path(path)/fname, 'rb'))
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 367, in load
    return _load(f, map_location, pickle_module)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 538, in _load
    result = unpickler.load()
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 504, in persistent_load
    data_type(size), location)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 113, in default_restore_location
    result = fn(storage, location)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 94, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 78, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.

sgugger · 2019-01-30T22:41:12Z

You should check your version on the machine you export your 'export.pkl' file then provide us with the code you're running. Just confirmed I had no problem loading an exported file on a CPU-only instance.

njaremko · 2019-01-30T23:14:23Z

Double checked both machines. Both running 1.0.42.

Code used to generate model:

from fastai import *
from fastai.vision import *

# Loading data
bs = 32
df = pd.read_csv('./labels.csv', header='infer')
data = (ImageItemList.from_df(df, path='.', folder='train')
        .random_split_by_pct()
        .label_from_df(label_cls=FloatList)
        .transform(get_transforms(), size=299)
        .databunch(bs=bs).normalize(imagenet_stats))

# Create model
learn = create_cnn(data, models.resnet101)

# Use mixed precision
learn.to_fp16()

# Find Learning Rate
learn.lr_find()
learn.recorder.plot()

# Fit
learn.fit_one_cycle(10)
learn.save('stage-1-101')

# Prep for fine tuning
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()

# Fine Tuning
learn.fit_one_cycle(10, max_lr=slice(1e-6, 3e-4))
learn.save('stage-2-101')

# Export
learn.to_fp32()
learn.export('prod-101.pkl')

sgugger · 2019-01-31T00:22:27Z

Ok, the problem was different: to_fp32 doesn't properly remove the MixedPrecision callback, and even if you don't call it, the MixedPrecision callback saves some stuff on the GPU. Both should be fixed now.

DnzzL · 2019-01-31T16:39:02Z

Hello.
Both machines have fastai 1.0.42 installed.
I have retrained my model with last version, but I still have this issue too.
Why was the boolean cpu removed for the loading function from the commit a92d6c6 ?

njaremko · 2019-01-31T16:52:35Z

You'll need to install fastai from master if you want this fix, as it's not in a release yet. (assuming you have the same issue with fp16)

It was removed because the logic was changed to always save models for CPU.

DnzzL · 2019-02-01T10:09:30Z

I have installed the last version of fastai with pip install git+https://github.com/fastai/fastai.git, so I have version 1.0.43.dev0 for training, ran all the pipeline again and save the classifier.
The inference is still version 1.0.42, but I still have the problem:

If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.

I think it's more flexible to be able to choose while loading rather than at saving.

sgugger · 2019-02-01T14:40:09Z

The boolean cpu wasn't working: there was still an error on some machines with CPU only, that's why it was removed. If you still get the error, it means you have some tensors on the GPU saved, and they shouldn't really be there. Sharing your code would help us debug this.

DnzzL · 2019-02-01T16:36:41Z

You'll have a déjà-vu, it's based on your DeepFrench notebook ;)

import fastai
import torch
print(fastai.__version__) # 1.0.43.dev0
print(torch.__version__) # 1.0.0

from fastai import *
from fastai.text import *
import pandas as pd

lang = "en" 
if lang == "fr":
  weights_pretrained = 'wref30k'
  itos_pretrained = 'itosref30k'
  pretained_data = (weights_pretrained, itos_pretrained)
if lang == "es":
  weights_pretrained = 'model-30k-vocab-noqrnn'
  itos_pretrained = 'itos_pretrained'
if lang != "en":
  pretained_data = (weights_pretrained, itos_pretrained)

PATH_LM = Path(f'data/{lang}')
PATH_CS = Path(f'data/{lang}')

train_df = pd.read_csv(f"{PATH_LM}/train.csv")
valid_df = pd.read_csv(f"{PATH_LM}/valid.csv")
test_df = pd.read_csv(f"{PATH_LM}/test.csv")

tokenizer = Tokenizer(lang=f'{lang}', n_cpus=8)
data_lm = TextLMDataBunch.from_df(PATH_LM, tokenizer=tokenizer, bs=32, train_df=train_df, valid_df=valid_df, test_df=test_df)

"""# Fine tuning LM"""

if lang == "en":
  learn = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0)
else:
  learn = language_model_learner(data_lm, pretrained_fnames=pretained_data, drop_mult=0)
learn.freeze()

learn.lr_find()
learn.recorder.plot(skip_start=0)
learn.fit_one_cycle(1, 1e-2)
learn.save(f'{lang}_head_pretrained')

learn.unfreeze()
learn.fit_one_cycle(2, 1e-3, moms=(0.8,0.7))
learn.save(f'{lang}_lm_fine_tuned')
learn.save_encoder(f'{lang}_ft_enc')

"""# Classification task"""

train_df = pd.read_csv(f"{PATH_CS}/train.csv", names=["label", "text"])
valid_df = pd.read_csv(f"{PATH_CS}/valid.csv")
test_df = pd.read_csv(f"{PATH_CS}/test.csv")

labelcounts = train_df.groupby(["label"]).size()
label_sum = len(train_df["label"])
class_imbalance = [(count/label_sum) for count in labelcounts]
print(class_imbalance)

data_clas = TextClasDataBunch.from_df(PATH_CS, train_df=train_df, valid_df=valid_df, test_df=test_df, tokenizer=tokenizer, vocab=data_lm.train_ds.vocab, bs=32)

learn = text_classifier_learner(data_clas, drop_mult=0.3)
learn.load_encoder(f'{lang}_ft_enc')
learn.freeze()

"""## Take class imbalance into account"""

weights_balance = [(1-count/label_sum) for count in labelcounts]
loss_weights = torch.FloatTensor(weights_balance).cuda()
learn.crit = partial(F.cross_entropy, weight=loss_weights)

"""## Train classifier"""

learn.lr_find()
learn.recorder.plot(skip_start=0)

starting_lr = 1e-2

learn.fit_one_cycle(2, starting_lr, moms=(0.8,0.7))

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(starting_lr/(2.6**4),1e-2), moms=(0.8,0.7))

learn.freeze_to(-3)
lr = starting_lr/2
learn.fit_one_cycle(1, slice(lr/(2.6**4),5e-3), moms=(0.8,0.7))

learn.unfreeze()
lr = starting_lr/10
learn.fit_one_cycle(2, slice(lr/(2.6**4),1e-3), moms=(0.8,0.7))
learn.fit_one_cycle(2, slice(lr/(2.6**4),1e-3), moms=(0.8,0.7))

learn.save(f'{lang}_ulmfit')
learn.export(f'{lang}_ulmfit.pkl')

sgugger · 2019-02-04T14:34:12Z

Ok, so it's the weights of your loss function that are the culprit. I'll try to find a more general fix that will go through everything in the state and save it on the CPU.

In the meantime, pushed a fix that will load the Learner on cpu if that's the default device. Since I had reports that the map_location wasn't working before (though I couldn't reproduce) it may not work in 100% of the cases.

DnzzL · 2019-02-04T15:27:23Z

Nice catch!
It works when I save after removing .cuda() for the metrics.
Thank you for your help.

anurag · 2019-02-04T18:09:56Z

Looking forward to the new release! I'l update Render's starter repo as soon as it's out.

…

On Mon, Feb 04, 2019 at 7:27 AM, Legrand Thomas < ***@***.*** > wrote: Nice catch! It works when I save after removing.cuda() for the metrics. Thank you for your help. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub ( #1502 (comment) ) , or mute the thread ( https://github.com/notifications/unsubscribe-auth/AADqATDZUNinHdTrdKXOwFmhGpuwVaHhks5vKFFugaJpZM4aNsb_ ).

alex000kim · 2019-02-06T23:04:13Z

Seeing the same error as @njaremko with 1.0.42

sgugger · 2019-02-06T23:05:25Z

It's only fixed in master, there hasn't been a new release yet.

alex000kim · 2019-02-06T23:07:08Z

Thanks @sgugger. Any estimates on the next release date?

sgugger · 2019-02-06T23:09:06Z

I think sometime next week. There has been a lot of changes with my digging-out of kwargs everywhere, so we want to make sure it's stable and nothing is broken before the next release.
In the meantime, you can use a dev install.

stas00 · 2019-02-06T23:17:49Z

In the meantime, you can use a dev install.

The simplest way to accomplish that is with:

pip install git+https://github.com/fastai/fastai

pouannes added 3 commits January 22, 2019 23:54

fix export and load_learner methods of Learner

90f1c74

change name of map_location argument of load_learner to device

bfabee2

add change to load_learner docstring

15827bd

sgugger merged commit 90c6dd4 into fastai:master Jan 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save load #1502

Save load #1502

pouannes commented Jan 22, 2019

sgugger commented Jan 23, 2019

njaremko commented Jan 30, 2019 •

edited

sgugger commented Jan 30, 2019

njaremko commented Jan 30, 2019 •

edited

sgugger commented Jan 30, 2019

njaremko commented Jan 30, 2019

sgugger commented Jan 31, 2019

DnzzL commented Jan 31, 2019

njaremko commented Jan 31, 2019 •

edited

DnzzL commented Feb 1, 2019 •

edited

sgugger commented Feb 1, 2019

DnzzL commented Feb 1, 2019 •

edited

sgugger commented Feb 4, 2019 •

edited

DnzzL commented Feb 4, 2019

anurag commented Feb 4, 2019 via email

alex000kim commented Feb 6, 2019 •

edited

sgugger commented Feb 6, 2019

alex000kim commented Feb 6, 2019

sgugger commented Feb 6, 2019

stas00 commented Feb 6, 2019 •

edited

Save load #1502

Save load #1502

Conversation

pouannes commented Jan 22, 2019

sgugger commented Jan 23, 2019

njaremko commented Jan 30, 2019 • edited

sgugger commented Jan 30, 2019

njaremko commented Jan 30, 2019 • edited

sgugger commented Jan 30, 2019

njaremko commented Jan 30, 2019

sgugger commented Jan 31, 2019

DnzzL commented Jan 31, 2019

njaremko commented Jan 31, 2019 • edited

DnzzL commented Feb 1, 2019 • edited

sgugger commented Feb 1, 2019

DnzzL commented Feb 1, 2019 • edited

sgugger commented Feb 4, 2019 • edited

DnzzL commented Feb 4, 2019

anurag commented Feb 4, 2019 via email

alex000kim commented Feb 6, 2019 • edited

sgugger commented Feb 6, 2019

alex000kim commented Feb 6, 2019

sgugger commented Feb 6, 2019

stas00 commented Feb 6, 2019 • edited

njaremko commented Jan 30, 2019 •

edited

njaremko commented Jan 30, 2019 •

edited

njaremko commented Jan 31, 2019 •

edited

DnzzL commented Feb 1, 2019 •

edited

DnzzL commented Feb 1, 2019 •

edited

sgugger commented Feb 4, 2019 •

edited

alex000kim commented Feb 6, 2019 •

edited

stas00 commented Feb 6, 2019 •

edited