Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

horovod seg-fault with mxnet pip wheels #18772

Open
eric-haibin-lin opened this issue Jul 23, 2020 · 4 comments
Open

horovod seg-fault with mxnet pip wheels #18772

eric-haibin-lin opened this issue Jul 23, 2020 · 4 comments

Comments

@eric-haibin-lin
Copy link
Member

I am working on a bug fix for mxnet master with my horovod branch: https://github.com/eric-haibin-lin/horovod/tree/mx2

I noticed that the example passes if I use mxnet built from source:

# install mxnet 
git clone --recursive https://github.com/apache/incubator-mxnet.git mxnet
cd mxnet
cp config/linux.cmake config.cmake
rm -rf build
mkdir -p build && cd build
cmake -GNinja ..
cmake --build . --parallel 48
cd ../python; python setup develop --user; 
cd ./mxnet; ln -s ../../include include; ln -s ../../3rdparty 3rdparty; 

# install horovod 
cd horovod; python setup.py install --user; 

# run example 
cd example; horovodrun -np 2 mxnet2_mnist.py 

However, it segfault immediate after the first broadcast call if I use the mxnet nightly pip wheel from https://repo.mxnet.io/dist/python such as:
https://repo.mxnet.io/dist/python/cpu/mxnet-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl

----------Python Info----------
Version      : 3.7.6
Compiler     : GCC 7.3.1 20180712 (Red Hat 7.3.1-6)
Build        : ('default', 'Feb 26 2020 20:54:15')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.1.1
Directory    : /home/ec2-user/.local/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 2.0.0
Directory    : /home/ec2-user/src/mxnet/python/mxnet
Num GPUs     : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.14.173-137.229.amzn2.x86_64-x86_64-with-glibc2.2.5
system       : Linux
node         : ip-172-31-81-80.ec2.internal
release      : 4.14.173-137.229.amzn2.x86_64
version      : #1 SMP Wed Apr 1 18:06:08 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             1208.761
BogoMIPS:            4999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
@eric-haibin-lin
Copy link
Member Author

[1,0]<stdout>:(gdb) bt
[1,0]<stdout>:#0  0x00007ffff7419b80 in pthread_mutex_lock () from /lib64/libpthread.so.0
[1,0]<stdout>:#1  0x00007fff68a1b81d in mxnet::engine::ThreadedVar::AppendWriteDependency(mxnet::engine::OprBlock*) ()
[1,0]<stdout>:   from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
[1,0]<stdout>:#2  0x00007fff68a176ff in mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool) ()
[1,0]<stdout>:   from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
[1,0]<stdout>:#3  0x00007fff68a147a7 in mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool) ()
[1,0]<stdout>:   from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
[1,0]<stdout>:#4  0x00007fff688f5f42 in MXEnginePushAsync ()
[1,0]<stdout>:   from /home/ec2-user/.local/lib/python3.7/site-packages/mxnet/libmxnet.so
[1,0]<stdout>:#5  0x00007ffdcc11ace9 in horovod::mxnet::PushHorovodOperation (
[1,0]<stdout>:    op_type=op_type@entry=horovod::common::Request::BROADCAST,
[1,0]<stdout>:    input=input@entry=0x182fb90, output=output@entry=0x182fb90,
[1,0]<stdout>:    name=name@entry=0x7ffdd5e63f20 "0.bias", priority=priority@entry=0,
[1,0]<stdout>:    root_rank=root_rank@entry=0) at horovod/mxnet/mpi_ops.cc:138
[1,0]<stdout>:#6  0x00007ffdcc116010 in horovod::mxnet::horovod_mxnet_broadcast_async (
[1,0]<stdout>:    input=0x182fb90, output=0x182fb90, name=0x7ffdd5e63f20 "0.bias",
[1,0]<stdout>:    root_rank=0, priority=0) at horovod/mxnet/mpi_ops.cc:301

@leezu
Copy link
Contributor

leezu commented Jul 23, 2020

Horovod includes the MXNet C++ headers and based on them interacts with the Engine:

https://github.com/horovod/horovod/blob/cf022be959a7c9431a8415729758b26dec1a87e5/horovod/mxnet/mpi_ops.h#L23-L24

But C++ does not have a stable ABI and your Horovod may not be compiled with the same ABI as the MXNet binary wheel. Could this be the source of the crash? Have you tried reproducing this when building Horovod in the same container as used for building the binary wheels?

I tried the following steps to compile in the container and it works fine. I think we can conclude that there is an ABI mismatch between the compiler used in the gcc7 provided by CentOS7 https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ and the compiler you used in AL2.

docker run --privileged --cap-add=NET_ADMIN --gpus=all  -it mxnetci/build.centos7_gpu_cu102 /usr/sbin/init
docker container list
docker container exec -it aa5253f2282f bash
source /opt/rh/devtoolset-7/enable
source /opt/rh/rh-python36/enable
pip install pyyaml cffi
cd /usr/local/src
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz
tar xf openmpi-4.0.4.tar.gz
./configure --prefix=/usr/local
make all install -j$(nproc)
git clone --recursive -b mx2 https://github.com/eric-haibin-lin/horovod.git
cd horovod
pip install https://repo.mxnet.io/dist/python/cu102/mxnet_cu102-2.0.0b20200721-py2.py3-none-manylinux2014_x86_64.whl
HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_TENSORFLOW=1 python setup.py install --user

cd examples
yum install openssh-server
systemctl start sshd
/root/.local/bin/horovodrun -np 2 python /mnt/horovod/examples/mxnet2_mnist.py

Output

Thu Jul 23 21:04:17 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 100] Training: accuracy=0.860938
Thu Jul 23 21:04:17 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 100] Training: accuracy=0.853594
Thu Jul 23 21:04:18 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 200] Training: accuracy=0.908203
Thu Jul 23 21:04:18 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 200] Training: accuracy=0.913125
Thu Jul 23 21:04:19 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 300] Training: accuracy=0.933281
Thu Jul 23 21:04:19 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 300] Training: accuracy=0.930937
Thu Jul 23 21:04:20 2020[0]<stderr>:INFO:root:[Epoch 0 Batch 400] Training: accuracy=0.942305
Thu Jul 23 21:04:20 2020[1]<stderr>:INFO:root:[Epoch 0 Batch 400] Training: accuracy=0.943477
Thu Jul 23 21:04:20 2020[0]<stderr>:INFO:root:Epoch[0]  Speed=15403.68 samples/s        Time cost=3.888941
Thu Jul 23 21:04:21 2020[0]<stderr>:INFO:root:Epoch[0]  Train: accuracy=0.947683        Validation: accuracy=0.981370
Thu Jul 23 21:04:22 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 100] Training: accuracy=0.982031
Thu Jul 23 21:04:22 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 100] Training: accuracy=0.980938
Thu Jul 23 21:04:23 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 200] Training: accuracy=0.984453
Thu Jul 23 21:04:23 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 200] Training: accuracy=0.982266
Thu Jul 23 21:04:24 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 300] Training: accuracy=0.985000
Thu Jul 23 21:04:24 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 300] Training: accuracy=0.983958
Thu Jul 23 21:04:25 2020[0]<stderr>:INFO:root:[Epoch 1 Batch 400] Training: accuracy=0.984883
Thu Jul 23 21:04:25 2020[1]<stderr>:INFO:root:[Epoch 1 Batch 400] Training: accuracy=0.983828
Thu Jul 23 21:04:25 2020[0]<stderr>:INFO:root:Epoch[1]  Speed=14106.52 samples/s        Time cost=4.246548
Thu Jul 23 21:04:26 2020[0]<stderr>:INFO:root:Epoch[1]  Train: accuracy=0.985443        Validation: accuracy=0.985877
Thu Jul 23 21:04:27 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 100] Training: accuracy=0.988594
Thu Jul 23 21:04:27 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 100] Training: accuracy=0.987656
Thu Jul 23 21:04:28 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 200] Training: accuracy=0.989922
Thu Jul 23 21:04:28 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 200] Training: accuracy=0.988125
Thu Jul 23 21:04:29 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 300] Training: accuracy=0.989948
Thu Jul 23 21:04:29 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 300] Training: accuracy=0.988958
Thu Jul 23 21:04:30 2020[0]<stderr>:INFO:root:[Epoch 2 Batch 400] Training: accuracy=0.989805
Thu Jul 23 21:04:30 2020[1]<stderr>:INFO:root:[Epoch 2 Batch 400] Training: accuracy=0.989062
Thu Jul 23 21:04:30 2020[0]<stderr>:INFO:root:Epoch[2]  Speed=14098.05 samples/s        Time cost=4.249099
Thu Jul 23 21:04:31 2020[0]<stderr>:INFO:root:Epoch[2]  Train: accuracy=0.990051        Validation: accuracy=0.988181
Thu Jul 23 21:04:32 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 100] Training: accuracy=0.993281
Thu Jul 23 21:04:32 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 100] Training: accuracy=0.990625
Thu Jul 23 21:04:33 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 200] Training: accuracy=0.993359
Thu Jul 23 21:04:33 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 200] Training: accuracy=0.991172
Thu Jul 23 21:04:34 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 300] Training: accuracy=0.991927
Thu Jul 23 21:04:34 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 300] Training: accuracy=0.993125
Thu Jul 23 21:04:35 2020[0]<stderr>:INFO:root:[Epoch 3 Batch 400] Training: accuracy=0.993008
Thu Jul 23 21:04:35 2020[1]<stderr>:INFO:root:[Epoch 3 Batch 400] Training: accuracy=0.992031
Thu Jul 23 21:04:35 2020[0]<stderr>:INFO:root:Epoch[3]  Speed=14035.98 samples/s        Time cost=4.267888
Thu Jul 23 21:04:36 2020[0]<stderr>:INFO:root:Epoch[3]  Train: accuracy=0.993323        Validation: accuracy=0.989984
Thu Jul 23 21:04:37 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 100] Training: accuracy=0.995625
Thu Jul 23 21:04:37 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 100] Training: accuracy=0.994219
Thu Jul 23 21:04:38 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 200] Training: accuracy=0.995000
Thu Jul 23 21:04:38 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 200] Training: accuracy=0.996250
Thu Jul 23 21:04:39 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 300] Training: accuracy=0.995260
Thu Jul 23 21:04:39 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 300] Training: accuracy=0.995313
Thu Jul 23 21:04:40 2020[0]<stderr>:INFO:root:[Epoch 4 Batch 400] Training: accuracy=0.995039
Thu Jul 23 21:04:40 2020[1]<stderr>:INFO:root:[Epoch 4 Batch 400] Training: accuracy=0.995195
Thu Jul 23 21:04:40 2020[0]<stderr>:INFO:root:Epoch[4]  Speed=14055.33 samples/s        Time cost=4.262014
Thu Jul 23 21:04:41 2020[0]<stderr>:INFO:root:Epoch[4]  Train: accuracy=0.995493        Validation: accuracy=0.991486

@leezu
Copy link
Contributor

leezu commented Jul 23, 2020

We may want to remove the C++ API headers from the pip package, to prevent anyone from relying on C++ ABI by mistake. I think as soon as someone uses the C++ API headers to create C++ objects in their library and then passes them to the libmxnet.so via some C API or even Python API, there can be an ABI mismatch causing crash.

@eric-haibin-lin
Copy link
Member Author

Thanks for the investigation and good catch about the c++ headers. I agree. We need to rewrite the integration code using only the c APIs to avoid ABI compatibility issues

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants