Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dnnl perf issue #605

Merged
merged 7 commits into from
Feb 21, 2020
Merged

Dnnl perf issue #605

merged 7 commits into from
Feb 21, 2020

Conversation

dcslin
Copy link
Member

@dcslin dcslin commented Feb 20, 2020

No description provided.

@dcslin
Copy link
Member Author

dcslin commented Feb 20, 2020

Hi this is to fix the performance issue #591

The issue is very weird and spent a lot time of checking small difference between the example code and our code. the only observation for now is creating dnnl primitive desc and memory desc in the convhandle would dramatically slow the performance ( 2ms to 80ms per oepration). Thus moving all the pd and md into the scope of individual operation for now.

And extra performance testings are added .

@dcslin dcslin marked this pull request as ready for review February 20, 2020 09:46
@dcslin
Copy link
Member Author

dcslin commented Feb 20, 2020

Currently dnnl mnist (60000x1x28x28) train(batch64, epoch1) time is 123sec

@chrishkchris
Copy link
Contributor

@dcslin I have retested your branch after you recleaned the code. I think this is ready for merge.

The CPU mnist cnn result is below:

root@71b7b910ae0b:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 582.440613, training accuracy = 0.794307
Evaluation accuracy = 0.926983, Elapsed Time = 138.448628s
Starting Epoch 1:
Training loss = 233.955933, training accuracy = 0.921958
Evaluation accuracy = 0.961038, Elapsed Time = 139.953837s
Starting Epoch 2:
Training loss = 167.640701, training accuracy = 0.944003
Evaluation accuracy = 0.970853, Elapsed Time = 149.805821s
Starting Epoch 3:
Training loss = 136.292664, training accuracy = 0.955176
Evaluation accuracy = 0.974159, Elapsed Time = 141.567021s
Starting Epoch 4:
Training loss = 116.591469, training accuracy = 0.961196
Evaluation accuracy = 0.971655, Elapsed Time = 139.504096s
Starting Epoch 5:
Training loss = 105.760490, training accuracy = 0.965015
Evaluation accuracy = 0.978966, Elapsed Time = 139.674059s
Starting Epoch 6:
Training loss = 93.552238, training accuracy = 0.968733
Evaluation accuracy = 0.977764, Elapsed Time = 140.324435s
Starting Epoch 7:
Training loss = 85.340057, training accuracy = 0.970818
Evaluation accuracy = 0.978365, Elapsed Time = 139.976697s
Starting Epoch 8:
Training loss = 84.529572, training accuracy = 0.971752
Evaluation accuracy = 0.981270, Elapsed Time = 138.948784s
Starting Epoch 9:
Training loss = 77.371544, training accuracy = 0.974019
Evaluation accuracy = 0.982572, Elapsed Time = 139.120852s

I think the issue #591 is totally resolved after merging this PR. Thanks a lot for your help @dcslin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants