Dnnl perf issue #605

dcslin · 2020-02-20T09:30:27Z

No description provided.

dcslin · 2020-02-20T09:46:22Z

Hi this is to fix the performance issue #591

The issue is very weird and spent a lot time of checking small difference between the example code and our code. the only observation for now is creating dnnl primitive desc and memory desc in the convhandle would dramatically slow the performance ( 2ms to 80ms per oepration). Thus moving all the pd and md into the scope of individual operation for now.

And extra performance testings are added .

dcslin · 2020-02-20T09:51:36Z

Currently dnnl mnist (60000x1x28x28) train(batch64, epoch1) time is 123sec

chrishkchris · 2020-02-20T11:37:02Z

@dcslin I have retested your branch after you recleaned the code. I think this is ready for merge.

The CPU mnist cnn result is below:

root@71b7b910ae0b:~/dcsysh/singa/examples/autograd# python3 mnist_cnn.py
Starting Epoch 0:
Training loss = 582.440613, training accuracy = 0.794307
Evaluation accuracy = 0.926983, Elapsed Time = 138.448628s
Starting Epoch 1:
Training loss = 233.955933, training accuracy = 0.921958
Evaluation accuracy = 0.961038, Elapsed Time = 139.953837s
Starting Epoch 2:
Training loss = 167.640701, training accuracy = 0.944003
Evaluation accuracy = 0.970853, Elapsed Time = 149.805821s
Starting Epoch 3:
Training loss = 136.292664, training accuracy = 0.955176
Evaluation accuracy = 0.974159, Elapsed Time = 141.567021s
Starting Epoch 4:
Training loss = 116.591469, training accuracy = 0.961196
Evaluation accuracy = 0.971655, Elapsed Time = 139.504096s
Starting Epoch 5:
Training loss = 105.760490, training accuracy = 0.965015
Evaluation accuracy = 0.978966, Elapsed Time = 139.674059s
Starting Epoch 6:
Training loss = 93.552238, training accuracy = 0.968733
Evaluation accuracy = 0.977764, Elapsed Time = 140.324435s
Starting Epoch 7:
Training loss = 85.340057, training accuracy = 0.970818
Evaluation accuracy = 0.978365, Elapsed Time = 139.976697s
Starting Epoch 8:
Training loss = 84.529572, training accuracy = 0.971752
Evaluation accuracy = 0.981270, Elapsed Time = 138.948784s
Starting Epoch 9:
Training loss = 77.371544, training accuracy = 0.974019
Evaluation accuracy = 0.982572, Elapsed Time = 139.120852s

I think the issue #591 is totally resolved after merging this PR. Thanks a lot for your help @dcslin

dcslin added 7 commits February 20, 2020 04:41

dnnl-performance-issue-investigation

06ec6ce

fixed dnnl conv forward

8e85fdc

fix dnnl conv

0aa590c

test operation convolution update for performance test

7a88251

fix dnnl conv performance drop

0c8f418

Merge remote-tracking branch 'upstream/dev' into dnnl-perf-issue

d9ba4aa

code format

758c9e8

dcslin marked this pull request as ready for review February 20, 2020 09:46

chrishkchris mentioned this pull request Feb 20, 2020

Dev branch cpu training problem (with conv and pool) #591

Closed

nudles merged commit fec5f8a into apache:dev Feb 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dnnl perf issue #605

Dnnl perf issue #605

dcslin commented Feb 20, 2020

dcslin commented Feb 20, 2020

dcslin commented Feb 20, 2020

chrishkchris commented Feb 20, 2020

Dnnl perf issue #605

Dnnl perf issue #605

Conversation

dcslin commented Feb 20, 2020

dcslin commented Feb 20, 2020

dcslin commented Feb 20, 2020

chrishkchris commented Feb 20, 2020