Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix omp fork race conditions and assert issue #17039

Merged
merged 1 commit into from Dec 11, 2019

Conversation

cjolivier01
Copy link
Member

Description

Fix for: #14979

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • [ X] Changes are complete (i.e. I finished coding on this PR)
  • [X ] All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • [ X] Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • [X ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

  • When forking, sometimes mxnet can hook its fork handler before openmp's, depending upon static startup order as well as variables such as whether OMP_NUM_THREADS is set. This is mitigated by forcing an OMP call before mxnet hooks a fork handler as well as first-thing after forking as the child. This avoids two scenarios:

    1. An OMP API call by mxnet before openmp's child fork handler has run, reinitializing its state to the new process
    2. Starting a thread which may call an OpenMP API call (or open a parallel region) before or during OMP library's child fork handler.
  • A considerable amount of multithreading is done by mxnext both during static init as ell as in atfork() handlers, and we may wish to change this over time. Due to the non-deterministic startup order of static init (and thus sometimes fork handling), it can cause unpredictable results.

  • Build omp in release mode to stop harmless assert. Comment notes that assert is harmless and represents a variable that fails to get set to null in their atfork_child handler (they set other similar variables to to null). The variable is set to NULL in the next line of code.

@cjolivier01 cjolivier01 changed the title Fix omp assert issue Fix omp fork race conditions and assert issue Dec 11, 2019
@cjolivier01
Copy link
Member Author

I actually also did change (removed from this PR in order to keep it simple) that stops forcing the creation of the engine during static init and the fork handlers, if it is desired. This has a couple of effects including reducing the overhead of resources and threads created simply by the library being loaded, which then, in turn, causes a lot of thread churn (Stop(), Start()) when forking. Even if hey never use mxnet, we spin up a lot of stuff unnecessarily. It actually gets lot faster on startp when this is done.

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! LGTM!

@szha
Copy link
Member

szha commented Dec 11, 2019

I actually also did change (removed from this PR in order to keep it simple) that stops forcing the creation of the engine during static init and the fork handlers, if it is desired. This has a couple of effects including reducing the overhead of resources and threads created simply by the library being loaded, which then, in turn, causes a lot of thread churn (Stop(), Start()) when forking. Even if hey never use mxnet, we spin up a lot of stuff unnecessarily. It actually gets lot faster on startp when this is done.

The change sounds useful. Do you plan to commit the change?

@cjolivier01
Copy link
Member Author

I actually also did change (removed from this PR in order to keep it simple) that stops forcing the creation of the engine during static init and the fork handlers, if it is desired. This has a couple of effects including reducing the overhead of resources and threads created simply by the library being loaded, which then, in turn, causes a lot of thread churn (Stop(), Start()) when forking. Even if hey never use mxnet, we spin up a lot of stuff unnecessarily. It actually gets lot faster on startp when this is done.

The change sounds useful. Do you plan to commit the change?

If I have time soon

leezu pushed a commit to leezu/mxnet that referenced this pull request Dec 30, 2019
leezu added a commit that referenced this pull request Dec 30, 2019
* Upgrade 3rdparty/openmp to release_90 version (#17012)

Fixes #10856

* Fix omp assert issue (#17039)

* Disable OpenMP offloading support for 3rdparty/openmp (#17098)

* Disable OpenMP offloading support for 3rdparty/openmp

OpenMP offloading was introduced some time during the past two years and is
enabled by default. With upgrading 3rdparty/openmp in
#17012
it was made part of the MXNet CMake build.

But we don't use OpenMP offloading and the Cuda target in the llvm OpenMP
Offloading build is broken in our setting.

* Update CMake on CI

* Fix reshape interoperability test (#17155)

* fix reshape interoperability test

* fix for scipy import

Co-authored-by: Chris Olivier <cjolivier01@gmail.com>
Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants