Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-362] ensure same mkldnn engine is used for consistency #10616

Merged
merged 5 commits into from
Apr 28, 2018

Conversation

ashokei
Copy link
Contributor

@ashokei ashokei commented Apr 19, 2018

Description

Gluon data iterators may trigger different thread for execution context, this causes mkl-dnn engine to be inconsistent. Following snippet reproduces this issue.

import numpy as np
import mxnet as mx
from mxnet import gluon, nd
 
net = gluon.nn.HybridSequential()
with net.name_scope():
    net.add(gluon.nn.Conv2D(channels=32, kernel_size=3, activation=None))
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=mx.cpu())
 
val_data = gluon.data.DataLoader(
    gluon.data.vision.CIFAR10(train=False),
    batch_size=32, shuffle=False,num_workers=1)
 
# output should be 0.57521844
X = (32,3,32,32)
y = net(nd.array(np.ones(X))).asnumpy()
print(y[0][0][0][0])
 
# below line works!
# for _ in range(1):
# below line causes bug
for _ in val_data:
    y = net(nd.array(np.ones(X))).asnumpy()
    print(y[0][0][0][0])
    break

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@@ -67,7 +67,8 @@ class CpuEngine {
public:
static CpuEngine *Get() {
// I's thread-safe in C++11.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need remove this line of comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment was correct, it says mkldnn engine is thread safe to use in mkldnn cpp api.

@ashokei
Copy link
Contributor Author

ashokei commented Apr 25, 2018

@zheng-da added unittest, @marcoabreu can you please merge if ok. thanks.


val_data = gluon.data.DataLoader(
gluon.data.vision.CIFAR10(train=False),
batch_size=32, shuffle=False, num_workers=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this data loader? it doesn't seem this test needs to use it

Copy link
Contributor Author

@ashokei ashokei Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the dataloader is the one that triggers different thread context, i tried without a dataloader (or even if i pass num_workers = 0 to above dataloader) it runs on same thread, so bug wont happen. Gluon DataLoader allows us to create a new thread as it iterates over data batch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this test! Could you please elaborate the exact behaviour of this unit test in the test with a block comment. I agree with Da that at the moment, it's hard to grasp the exact problem from reading the code. For me it's hard to understand when different threads are getting started and what the exact issue is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please provide comments why we need data loader here. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gluon DataLoader allows us to create a new thread as it iterates over data batch. Added comments to PR.

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Please add a jira ticket and we're good to go!

@ashokei ashokei changed the title ensure same mkldnn engine is used for consistency [MXNET-362] ensure same mkldnn engine is used for consistency Apr 26, 2018
@ashokei
Copy link
Contributor Author

ashokei commented Apr 26, 2018

@ashokei
Copy link
Contributor Author

ashokei commented Apr 28, 2018

@zheng-da updated with dummy data, can you please accept/review if ok.

@marcoabreu marcoabreu merged commit f0d2776 into apache:master Apr 28, 2018
@zheng-da
Copy link
Contributor

@ashokei it seems when using dummy data, it can't reproduce the bug.

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 7, 2018
…#10616)

* ensure same mkldnn engine is used for consistency

* add unittest for mkldnn engine thread testing

* add comments for thread context switching

* fix lint issue

* use dummy data
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
…#10616)

* ensure same mkldnn engine is used for consistency

* add unittest for mkldnn engine thread testing

* add comments for thread context switching

* fix lint issue

* use dummy data
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 8, 2018
…#10616)

* ensure same mkldnn engine is used for consistency

* add unittest for mkldnn engine thread testing

* add comments for thread context switching

* fix lint issue

* use dummy data
anirudh2290 pushed a commit that referenced this pull request Jun 13, 2018
* ensure same mkldnn engine is used for consistency

* add unittest for mkldnn engine thread testing

* add comments for thread context switching

* fix lint issue

* use dummy data
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
…#10616)

* ensure same mkldnn engine is used for consistency

* add unittest for mkldnn engine thread testing

* add comments for thread context switching

* fix lint issue

* use dummy data
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants