Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-266] Fix cudnn_conv and cudnn_deconv deadlock #10392

Merged
merged 6 commits into from
Apr 4, 2018

Conversation

reminisce
Copy link
Contributor

@reminisce reminisce commented Apr 4, 2018

Description

This PR is expected to address the deadlock issue #10341 introduced by #9677.

Fixed the issue by not pushing the async function into the engine since this block of code is already being executed by an worker thread for a gpu context. Both cudnn_conv and cudnn_deconv are fixed. Previous unit test that was temporarily disabled due to the deadlock is re-enabled.

@piiswrong @eric-haibin-lin @zheng-da

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, thanks a lot for fixing this so quickly!

@reminisce
Copy link
Contributor Author

@KellenSunderland @marcoabreu Thanks for the initial investigation of the root cause. Even though this should be a safe fix, I am still not clear why the deadlock could happen. It seems that a thread was running the pushed async function again and again, and hence the worker thread has to wait until the var is released. Could you provide some insights. GDB doesn't seem to help very much here. I had to add a lot of logging messages to see the execution.

@reminisce reminisce changed the title Fix cudnn_conv and cudnn_deconv deadlock [MXNET-266] Fix cudnn_conv and cudnn_deconv deadlock Apr 4, 2018
&back_algo_w_)) {
// Not in algo registry, must determine via *Get*() or *Find*()
Engine::VarHandle var = Engine::Get()->NewVariable();
Engine::Get()->PushAsync([=](RunContext rctx, Engine::CallbackOnComplete on_complete) {
mshadow::Stream<gpu> *s = rctx.get_stream<gpu>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation of this piece of code is indeed not right. I changed it in a previous commit but it resulted in huge code difference that makes code review much harder. I can make the indentation right either after this PR is merged or after everyone finishes reviewing the current code changes. :)

&back_algo_, &back_algo_w_)) {
// Not in algo registry, must determine via *Get*() or *Find*()
Engine::VarHandle var = Engine::Get()->NewVariable();
Engine::Get()->PushAsync([=](RunContext rctx, Engine::CallbackOnComplete on_complete) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation?

@piiswrong
Copy link
Contributor

Please fix indentation and we can merge

@reminisce
Copy link
Contributor Author

Indentation fixed. Let's wait till the CI passes.

@piiswrong piiswrong merged commit f0f8af1 into apache:master Apr 4, 2018
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Fix deadlock of cudnn_conv wrapper

* Fix deconv deadlock

* Fix lint

* Revert "Fix lint"

This reverts commit 66f0936.

* Fix lint

* Fix indentation
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Fix deadlock of cudnn_conv wrapper

* Fix deconv deadlock

* Fix lint

* Revert "Fix lint"

This reverts commit 66f0936.

* Fix lint

* Fix indentation
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants