-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TeraByte training starts with high accuracy #65
Comments
The accuracy is initially high because the dataset is very skewed in terms of number of clicks/non-clicks (~3%/97%). So, the model can always guess non-click and still attain a high accuracy. If you are not sub-sampling the non-clicks, I would suggest using --mlperf-logging flag that would allow you to track multiple metrics, including AUC, which is better suited for this situation. ps. you should also add --test-freq=10240 to see the test metrics (i.e. AUC) at fixed intervals. |
Thanks, @mnaumovfb for the explanation. Nevertheless, I'm looking to reproduce these plots, that were generated using the suggested following command line (+adding the flage for sub-sampling), according to the instructions:
However, the training accuracy in the plot starts with ~79% and the training loss with 0.48, while in my run it still starts with the following: Can you please suggest how can I reproduce the exact result as in the plots? Thanks again. |
In your original command above you do not have --data-sub-sample-rate=0.875 flag. I suspect that to be the reason for the high accuracy (i.e. you are running on the full dataset). If the flag is added I would expect he accuracy to drop and match the one reported in the README. |
@mnaumovfb, but I am using this flag. I added it after your first comment. the command line I use is:
but still, the accuracy is very high. This is after 1024 steps: Finished training it 1024/2048437 of epoch 0, 63.01 ms/it, loss 0.138724, accuracy 96.726 % |
This flag affects the pre-processing of the dataset itself. If you did not have it during pre-processing and then just add it to the command line while training it will have no effect. Is that what you are doing or did you add it from the beginning? |
I added it after the pre-processing. |
If you simply run the full dataset with above options the accuracy metric will just oscillate around 97%. Therefore, when running with full dataset it is more meaningful to look at AUC metric. You can obtain it by adding --mlperf-logging and --test-freq flags as I have mentioned earlier. There is a caveat, though. By default the --mlperf-logging flag uses a different loader, so to switch to the default loader, which you have already used for pre-processing, you have to change the if statement on line 381 in dlrm_datat_pytorch to "if False:". I think this might work, but I do not guarantee it. |
Thank you, I will try that. |
Just FYI, I had the same issue when I saw this thread. I tried the suggestion mlperf option, but I still saw the 96.x% accuracy at the beginning of the training. |
@sgao3 Yeah exactly, for me too. Closing this issue. |
Hi,
I finally managed to reach TeraByte beginning the training, but the average accuracy seems to be suspiciously high. Does it mean that something wasn't processed correctly?
dlrm_s_pytorch.py --arch-sparse-feature-size=64 --arch-mlp-bot=13-512-256-64 --arch-mlp-top=512-512-256-1 --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./data/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu --memory-map
Using 4 GPU(s)...
Reading pre-processed data=./input/terabyte_processed.npz
Sparse features= 26, Dense features= 13
Reading pre-processed data=./input/terabyte_processed.npz
Sparse features= 26, Dense features= 13
time/loss/accuracy (if enabled):
Finished training it 1024/2048437 of epoch 0, 53.58 ms/it, loss 0.138725, accuracy 96.726 %
Finished training it 2048/2048437 of epoch 0, 35.21 ms/it, loss 0.136623, accuracy 96.696 %
Finished training it 3072/2048437 of epoch 0, 35.06 ms/it, loss 0.135998, accuracy 96.691 %
Finished training it 4096/2048437 of epoch 0, 34.12 ms/it, loss 0.135519, accuracy 96.685 %
The text was updated successfully, but these errors were encountered: