Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added model saving, loading and checkpointing support to PyTorch #8

Merged
merged 2 commits into from
Jul 1, 2019

Conversation

dmudiger
Copy link
Contributor

Adds model saving/loading support for the PyTorch implementation -
--save-model="model file" (e.g: model.pt)
--load-model="model file" (e.g: model.pt)

Also, the saved model can be used with --inference-only to run only testing on a previously trained (and saved) model.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 26, 2019
@dmudiger dmudiger mentioned this pull request Jun 26, 2019
@mnaumovfb mnaumovfb merged commit e83538f into facebookresearch:master Jul 1, 2019
@berryweinst
Copy link

I ran this command line after your addition:
CUDA_VISIBLE_DEVICES=0 python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --processed-data-file=./kaggle_data/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time --raw-data-file /media/drive/Datasets/criteo/train.txt --use-gpu --save-model ./model/checkpoint.pt

But nothing was saved. I removed the condition in the main script as a WA for now to see if it works, but the main problem is that this stage of reading the batched takes all day long, and so if the model wasn't saved (or any other issue) I need to start it all over again. Is there anything to do with it?

Converted to tensors...done!
Sparse features = 26, Dense features = 13
Training data
Total number of batches 263115
Reading in batch 14345/263115

@berryweinst
Copy link

After it finished to count the data it crashed.

Reading in batch: 263115 / 263115

Testing data
Total number of batches 21926
Reading in batch: 21926 / 21926

time/loss/accuracy (if enabled):
Saving model to ./model/checkpoint.pt
Traceback (most recent call last):
File "dlrm_s_pytorch.py", line 855, in
"train_acc": gA,
NameError: name 'gA' is not defined

@dmudiger
Copy link
Contributor Author

dmudiger commented Jul 2, 2019

Hi, can you try withy the latest version of the code since I'm not able to see the same issue at my end.

Also, the model is saved only when the testing is done. So you would need to add --test-freq argument with a number of iterations interval for doing the testing. It would be best to set this to be sufficiently high since it this additional overhead of having to do evaluate on the whole test data-set.

@berryweinst
Copy link

I guess what was missing the --test-freq. Now it seem to work.

@dmudiger
Copy link
Contributor Author

dmudiger commented Jul 3, 2019

Happy to hear!

Also, for quick tests you can use the —num-batches argument to quickly test by running with only a limited number of batches and once it works you can run with the full data-set

@berryweinst
Copy link

Thanks a lot.
As for the data preparing. This reading stage (Reading in batch 14345/263115) can take all day long before the training starts. What exactly is taking so long? Is it the read and process each record in Criteo? Is there a way you can do this in a threaded way inside the batch pre-fetch with even a few workers involved?

jithunnair-amd pushed a commit to jithunnair-amd/dlrm that referenced this pull request Feb 25, 2020
@berryweinst berryweinst mentioned this pull request Apr 16, 2020
YoungsukKim12 pushed a commit to YoungsukKim12/dlrm that referenced this pull request Oct 29, 2023
…ebookresearch#8)

* added model saving, loading and checkpointing support to PyTorch

* minor fixes; updated README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants