added model saving, loading and checkpointing support to PyTorch #8

dmudiger · 2019-06-26T19:10:17Z

Adds model saving/loading support for the PyTorch implementation -
--save-model="model file" (e.g: model.pt)
--load-model="model file" (e.g: model.pt)

Also, the saved model can be used with --inference-only to run only testing on a previously trained (and saved) model.

berryweinst · 2019-07-02T15:15:28Z

I ran this command line after your addition:
CUDA_VISIBLE_DEVICES=0 python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --processed-data-file=./kaggle_data/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --raw-data-file /media/drive/Datasets/criteo/train.txt --use-gpu --save-model ./model/checkpoint.pt

But nothing was saved. I removed the condition in the main script as a WA for now to see if it works, but the main problem is that this stage of reading the batched takes all day long, and so if the model wasn't saved (or any other issue) I need to start it all over again. Is there anything to do with it?

Converted to tensors...done!
Sparse features = 26, Dense features = 13
Training data
Total number of batches 263115
Reading in batch 14345/263115

berryweinst · 2019-07-02T15:39:11Z

After it finished to count the data it crashed.

Reading in batch: 263115 / 263115

Testing data
Total number of batches 21926
Reading in batch: 21926 / 21926

time/loss/accuracy (if enabled):
Saving model to ./model/checkpoint.pt
Traceback (most recent call last):
File "dlrm_s_pytorch.py", line 855, in
"train_acc": gA,
NameError: name 'gA' is not defined

dmudiger · 2019-07-02T18:49:21Z

Hi, can you try withy the latest version of the code since I'm not able to see the same issue at my end.

Also, the model is saved only when the testing is done. So you would need to add --test-freq argument with a number of iterations interval for doing the testing. It would be best to set this to be sufficiently high since it this additional overhead of having to do evaluate on the whole test data-set.

berryweinst · 2019-07-03T11:35:16Z

I guess what was missing the --test-freq. Now it seem to work.

dmudiger · 2019-07-03T14:52:59Z

Happy to hear!

Also, for quick tests you can use the —num-batches argument to quickly test by running with only a limited number of batches and once it works you can run with the full data-set

berryweinst · 2019-07-03T16:33:02Z

Thanks a lot.
As for the data preparing. This reading stage (Reading in batch 14345/263115) can take all day long before the training starts. What exactly is taking so long? Is it the read and process each record in Criteo? Is there a way you can do this in a threaded way inside the batch pre-fetch with even a few workers involved?

IFU 20191203

…ebookresearch#8) * added model saving, loading and checkpointing support to PyTorch * minor fixes; updated README

added model saving, loading and checkpointing support to PyTorch

a349c0f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 26, 2019

dmudiger mentioned this pull request Jun 26, 2019

Checkpoint #7

Closed

minor fixes; updated README

6a8f868

mnaumovfb merged commit e83538f into facebookresearch:master Jul 1, 2019

jithunnair-amd pushed a commit to jithunnair-amd/dlrm that referenced this pull request Feb 25, 2020

Merge pull request facebookresearch#8 from jithunnair-amd/IFU_20191203

9f4b5ee

IFU 20191203

berryweinst mentioned this pull request Apr 16, 2020

Caffe2 using 8 GPUs #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added model saving, loading and checkpointing support to PyTorch #8

added model saving, loading and checkpointing support to PyTorch #8

dmudiger commented Jun 26, 2019

berryweinst commented Jul 2, 2019

berryweinst commented Jul 2, 2019

dmudiger commented Jul 2, 2019

berryweinst commented Jul 3, 2019

dmudiger commented Jul 3, 2019

berryweinst commented Jul 3, 2019

added model saving, loading and checkpointing support to PyTorch #8

added model saving, loading and checkpointing support to PyTorch #8

Conversation

dmudiger commented Jun 26, 2019

berryweinst commented Jul 2, 2019

berryweinst commented Jul 2, 2019

dmudiger commented Jul 2, 2019

berryweinst commented Jul 3, 2019

dmudiger commented Jul 3, 2019

berryweinst commented Jul 3, 2019