horovod seg-fault with mxnet pip wheels #18772
Comments
|
Horovod includes the MXNet C++ headers and based on them interacts with the Engine: But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? I tried the following steps to compile in the container and it works fine. I think we can conclude that there is an ABI mismatch between the compiler used in the gcc7 provided by CentOS7 https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ and the compiler you used in AL2.
Output
|
We may want to remove the C++ API headers from the pip package, to prevent anyone from relying on C++ ABI by mistake. I think as soon as someone uses the C++ API headers to create C++ objects in their library and then passes them to the libmxnet.so via some C API or even Python API, there can be an ABI mismatch causing crash. |
Thanks for the investigation and good catch about the c++ headers. I agree. We need to rewrite the integration code using only the c APIs to avoid ABI compatibility issues |
I am working on a bug fix for mxnet master with my horovod branch: https://github.com/eric-haibin-lin/horovod/tree/mx2
I noticed that the example passes if I use mxnet built from source:
However, it segfault immediate after the first broadcast call if I use the mxnet nightly pip wheel from https://repo.mxnet.io/dist/python such as:
https://repo.mxnet.io/dist/python/cpu/mxnet-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
The text was updated successfully, but these errors were encountered: