Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Error with mxnet extension #5

Closed
haoxintong opened this issue Jun 27, 2019 · 9 comments
Closed

Build Error with mxnet extension #5

haoxintong opened this issue Jun 27, 2019 · 9 comments
Labels
bug Something isn't working

Comments

@haoxintong
Copy link
Contributor

Describe the bug
Failed with building byteps with MXNet extension.

The output of import byteps.mxnet as bps is :

OSError: /home/anaconda3/lib/python3.5/site-packages/byteps-0.1.0-py3.5-linux-x86_64.egg/byteps/mxnet/c_lib.cpython-35m-x86_64-linux-gnu.so: cannot open shared object file: No such file or directory

Envs

  • OS: ubuntu16.04 and 18.04
  • GCC version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
  • CUDA and NCCL version: cuda10.0
  • MXNet version: cu100 1.5.0b20190418

To Reproduce

python setup.py install

Error Info:

byteps/mxnet/tensor_util.cc: In static member function ‘static void byteps::mxnet::TensorUtil::ResizeNd(mxnet::NDArray*, int, int64_t*)’:
byteps/mxnet/tensor_util.cc:139:29: error: no matching function for call to ‘mxnet::TShape::TShape(int&)’
   TShape mx_shape(nDimension);
                             ^

In mxnet/tuple.h there is no constructor function for only dimension input, so I changed the code from TShape mx_shape(nDimension) to TShape mx_shape(nDimension, 0).
Then it works fine for me.

Im not sure if the reason is the version of mxnet.

@ymjiang
Copy link
Member

ymjiang commented Jun 27, 2019

Mind sharing your mxnet version? In our tests, MXNet 1.4.1 should work normally.

@bobzhuyb
Copy link
Member

Thank you for the report! It's probably mxnet version issue. We will fix it soon.

@haoxintong
Copy link
Contributor Author

I got another problem, when I try to run the test:

DMLC_ROLE=worker \
DMLC_PS_ROOT_URI=10.0.0.1 \
DMLC_PS_ROOT_PORT=9000 \
DMLC_WORKER_ID=0 \
DMLC_NUM_WORKER=1 \
DMLC_NUM_SERVER=1 \
python ./launcher/launch.py python ./tests/test_mxnet.py 

it will failed with:

BytePS launching worker
[2019-06-27 19:08:17.340669: F byteps/common/communicator.cc:135] Check failed: (ret) >= (0) /usr/local/socket_send_0 bind failed: Permission denied
Aborted (core dumped)
Exception in thread Thread-1:
Traceback (most recent call last):
 ...
 ...
subprocess.CalledProcessError: Command 'python test_mxnet.py' returned non-zero exit status 134

And before this, I had a little trouble to convert python2 style print to python3 print() in launcher/launch.py. :P

@ymjiang
Copy link
Member

ymjiang commented Jun 27, 2019

@haoxintong Before you launch the job, can you check if /usr/local/socket_send_0 exists? (it shouldn't) If it does, rm it and try again.

@bobzhuyb
Copy link
Member

Do you have write permission to /usr/local/socket_send_0?

@ymjiang I think we should at least make this path configurable.

@bobzhuyb bobzhuyb added the bug Something isn't working label Jun 27, 2019
@bobzhuyb
Copy link
Member

By the way, we are facing a similar issue as here horovod/horovod#884
GCC 5.x will cause troubles for BytePS + MXNet.

We'll apply a similar fix as Horovod. Before that, I suggest you try our dockerfile for mxnet.
https://github.com/bytedance/byteps/blob/master/docker/Dockerfile.worker.mxnet

We'll keep you posted.

@haoxintong
Copy link
Contributor Author

Hi,
@ymjiang , I checked the file, and there is no /usr/local/socket_sedn_0.
@bobzhuyb I run the command with non root user, using env of anaconda python3.5, and have no access to write /usr/local.

When I run "sudo ...", it seemed ok, but got a new problem as horovod/horovod#884.

[2019-06-28 10:29:50.923619: F byteps/common/core_loops.cc:
Segmentation fault: 11

267] Check failed: r == ncclSuccess NCCL error: unhandled cuda error
Aborted (core dumped)

Besides using docker, any ideas about solving permission denied and segmentation fault?

@ymjiang
Copy link
Member

ymjiang commented Jun 28, 2019

The socket problem should be solved by 5dabf0c.

Regard to the segmentation fault, mind share your gcc version? Could you try to pin gcc to 4.9 and then install BytePS? Like this.

@haoxintong
Copy link
Contributor Author

I followed the dockerfile, it works for me now.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants