[MXNET-362] ensure same mkldnn engine is used for consistency #10616

ashokei · 2018-04-19T18:55:26Z

Description

Gluon data iterators may trigger different thread for execution context, this causes mkl-dnn engine to be inconsistent. Following snippet reproduces this issue.

import numpy as np
import mxnet as mx
from mxnet import gluon, nd
 
net = gluon.nn.HybridSequential()
with net.name_scope():
    net.add(gluon.nn.Conv2D(channels=32, kernel_size=3, activation=None))
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=mx.cpu())
 
val_data = gluon.data.DataLoader(
    gluon.data.vision.CIFAR10(train=False),
    batch_size=32, shuffle=False,num_workers=1)
 
# output should be 0.57521844
X = (32,3,32,32)
y = net(nd.array(np.ones(X))).asnumpy()
print(y[0][0][0][0])
 
# below line works!
# for _ in range(1):
# below line causes bug
for _ in val_data:
    y = net(nd.array(np.ones(X))).asnumpy()
    print(y[0][0][0][0])
    break

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

TaoLv · 2018-04-20T01:47:43Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

@@ -67,7 +67,8 @@ class CpuEngine {
 public:
  static CpuEngine *Get() {
    // I's thread-safe in C++11.


So we need remove this line of comment?

comment was correct, it says mkldnn engine is thread safe to use in mkldnn cpp api.

ashokei · 2018-04-25T20:02:14Z

@zheng-da added unittest, @marcoabreu can you please merge if ok. thanks.

zheng-da · 2018-04-26T05:43:01Z

tests/python/mkl/test_mkldnn.py

+
+    val_data = gluon.data.DataLoader(
+        gluon.data.vision.CIFAR10(train=False),
+        batch_size=32, shuffle=False, num_workers=1)


can you remove this data loader? it doesn't seem this test needs to use it

actually, the dataloader is the one that triggers different thread context, i tried without a dataloader (or even if i pass num_workers = 0 to above dataloader) it runs on same thread, so bug wont happen. Gluon DataLoader allows us to create a new thread as it iterates over data batch.

Thanks a lot for this test! Could you please elaborate the exact behaviour of this unit test in the test with a block comment. I agree with Da that at the moment, it's hard to grasp the exact problem from reading the code. For me it's hard to understand when different threads are getting started and what the exact issue is.

yes, please provide comments why we need data loader here. Thanks

Gluon DataLoader allows us to create a new thread as it iterates over data batch. Added comments to PR.

marcoabreu

Thanks a lot! Please add a jira ticket and we're good to go!

ashokei · 2018-04-26T23:11:50Z

@marcoabreu https://issues.apache.org/jira/browse/MXNET-362

ashokei · 2018-04-28T03:47:25Z

@zheng-da updated with dummy data, can you please accept/review if ok.

zheng-da · 2018-04-28T07:51:25Z

@ashokei it seems when using dummy data, it can't reproduce the bug.

…#10616) * ensure same mkldnn engine is used for consistency * add unittest for mkldnn engine thread testing * add comments for thread context switching * fix lint issue * use dummy data

* ensure same mkldnn engine is used for consistency * add unittest for mkldnn engine thread testing * add comments for thread context switching * fix lint issue * use dummy data

…#10616) * ensure same mkldnn engine is used for consistency * add unittest for mkldnn engine thread testing * add comments for thread context switching * fix lint issue * use dummy data

TaoLv reviewed Apr 20, 2018

View reviewed changes

ashokei force-pushed the mkldnn_engine_threading branch from 2446017 to a293d8c Compare April 25, 2018 20:01

zheng-da reviewed Apr 26, 2018

View reviewed changes

ashokei force-pushed the mkldnn_engine_threading branch from a293d8c to 3181b8c Compare April 26, 2018 20:31

marcoabreu approved these changes Apr 26, 2018

View reviewed changes

ashokei changed the title ~~ensure same mkldnn engine is used for consistency~~ [MXNET-362] ensure same mkldnn engine is used for consistency Apr 26, 2018

ashokei force-pushed the mkldnn_engine_threading branch from 3181b8c to ff93243 Compare April 26, 2018 23:47

ashokei added 5 commits April 27, 2018 20:20

ensure same mkldnn engine is used for consistency

7c5d29b

add unittest for mkldnn engine thread testing

49ffacf

add comments for thread context switching

108b443

fix lint issue

518109f

use dummy data

104781c

ashokei force-pushed the mkldnn_engine_threading branch from ff93243 to 104781c Compare April 28, 2018 03:31

marcoabreu merged commit f0d2776 into apache:master Apr 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-362] ensure same mkldnn engine is used for consistency #10616

[MXNET-362] ensure same mkldnn engine is used for consistency #10616

ashokei commented Apr 19, 2018

TaoLv Apr 20, 2018

ashokei Apr 20, 2018

ashokei commented Apr 25, 2018

zheng-da Apr 26, 2018

ashokei Apr 26, 2018 •

edited

Loading

marcoabreu Apr 26, 2018

zheng-da Apr 26, 2018

ashokei Apr 26, 2018

marcoabreu left a comment

ashokei commented Apr 26, 2018

ashokei commented Apr 28, 2018

zheng-da commented Apr 28, 2018

[MXNET-362] ensure same mkldnn engine is used for consistency #10616

[MXNET-362] ensure same mkldnn engine is used for consistency #10616

Conversation

ashokei commented Apr 19, 2018

Description

Checklist

Essentials

Changes

Comments

TaoLv Apr 20, 2018

Choose a reason for hiding this comment

ashokei Apr 20, 2018

Choose a reason for hiding this comment

ashokei commented Apr 25, 2018

zheng-da Apr 26, 2018

Choose a reason for hiding this comment

ashokei Apr 26, 2018 • edited Loading

Choose a reason for hiding this comment

marcoabreu Apr 26, 2018

Choose a reason for hiding this comment

zheng-da Apr 26, 2018

Choose a reason for hiding this comment

ashokei Apr 26, 2018

Choose a reason for hiding this comment

marcoabreu left a comment

Choose a reason for hiding this comment

ashokei commented Apr 26, 2018

ashokei commented Apr 28, 2018

zheng-da commented Apr 28, 2018

ashokei Apr 26, 2018 •

edited

Loading