Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix Horovod build error due to missing exported symbols #17348

Merged
merged 4 commits into from
Jan 22, 2020

Conversation

apeforest
Copy link
Contributor

@apeforest apeforest commented Jan 16, 2020

Description

Export MXAPIHandle symbol and fixes #17292

Here are the causes of the problem:

  1. Horovod uses MX_API_BEGIN() and MX_API_END() from mxnet/c_api_error.h to catch and throw errors in horovod APIs: https://github.com/horovod/horovod/blob/master/horovod/mxnet/mpi_ops.cc#L224
    MX_API_BEGIN() is a macro that calls MXAPIHandleException https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/c_api_error.h#L36. Before [DEBUG] enable custom error type #17128, MXAPIHandleException is an inline function. And therefore when [DEBUG] enable custom error type #17128 introduced a new function call NormalizeError() inside MXAPIHandleException it broke Horovod integration because the symbol of NormalizeError is not whitelist by MXNet distribution.
  2. [MXNET-1438] Adding SDML loss function #17298 removed NormalizeError() from MXAPIHandleException and made it not inline. https://github.com/apache/incubator-mxnet/pull/17208/files#diff-875aa4c013dbd73b044531e439e8afddR67. This time the error becomes undefined symbol of MXAPIHandleException.

So to summarize, the problem is not that Horovod requires MXAPIHandleException function to be inline. The rootcause is that MXNet did not whitelist the symbol MXAPIHandleException but only whitelisted the symbols that are being used inside MXAPIHandleException function. It was okay when the function MXAPIHandleException is inline, but became a problem when it's not. A good practice is to whitelist symbol MXAPIHandleException instead of its internal functions.

Comments

@szha Please help to review.

@apeforest apeforest requested a review from szha as a code owner January 16, 2020 22:34
make/config/libmxnet.sym Outdated Show resolved Hide resolved
@apeforest apeforest force-pushed the bugfix/horovod_build branch 2 times, most recently from 264b53b to a275cf0 Compare January 19, 2020 17:18
@apeforest
Copy link
Contributor Author

@szha Thanks for the review and update. Please let me know if this is good to ship.

@szha szha merged commit a967d37 into apache:master Jan 22, 2020
@apeforest apeforest deleted the bugfix/horovod_build branch January 22, 2020 00:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't run horovod with latest nightly wheel
2 participants