New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Installation] Trouble shooting llvmlite
and NCCL_ERROR_UNHANDLED_CUDA_ERROR
#496
Comments
llvmlite
and cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
llvmlite
and NCCL_ERROR_UNHANDLED_CUDA_ERROR
@JiahaoYao thanks! You may consider contributing a section to this troubleshoot doc: https://alpa-projects.github.io/install.html#troubleshooting :) |
Sure, happy to work on That! |
Issue: install and run the alpa examples on docker and get the issue below. But the cupy communication works. (base) /alpa% python3 tests/test_install.py
.2022-06-08 17:37:21,276 INFO packaging.py:323 -- Pushing file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip' (66.42MiB) to Ray cluster...
2022-06-08 17:37:21,970 INFO packaging.py:332 -- Successfully pushed file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip'.
E2022-06-08 17:37:34,340 ERROR worker.py:244 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.init_p2p_communicator() (pid=18797, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7efaa83b3370>)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
comm = NcclCommunicator(world_size, nccl_unique_id, rank)
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/test_install.py", line 105, in test_2_pipeline_parallel
actual_state = p_train_step(state, batch)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 100, in __call__
executable, _, out_tree, args_flat = self._decode_args_and_get_executable(*args)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 168, in _decode_args_and_get_executable
executable = _compile_parallel_executable(
File "/home/ray/anaconda3/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
ans = call(fun, *args)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 194, in _compile_parallel_executable
return method.compile_executable(fun, in_tree, out_tree_thunk,
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/parallel_method.py", line 168, in compile_executable
return compile_pipeshard_executable(
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 186, in compile_pipeshard_executable
executable = PipeshardDriverExecutable(
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 125, in __init__
task.create_resharding_communicators()
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 528, in create_resharding_communicators
ray.get(task_dones)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 2012, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=18796, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fe9bc3812e0>)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
comm = NcclCommunicator(world_size, nccl_unique_id, rank)
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
----------------------------------------------------------------------
Ran 2 tests in 20.574s
FAILED (errors=1) (base) ray:j/alpa% python3 -c "from cupy.cuda import nccl" Solution: use the amazon ec2 / gcp instead. |
Issue: Traceback (most recent call last):
File "alpa_mnist_example.py", line 208, in <module>
train_mnist()
File "alpa_mnist_example.py", line 183, in train_mnist
train_func({"train": train_dataset}, config)
File "alpa_mnist_example.py", line 158, in train_func
state, train_loss, train_accuracy = train_epoch(state, jax_dataset)
File "alpa_mnist_example.py", line 121, in train_epoch
state, loss, accuracy = train_step(state, batch_images, batch_labels)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 100, in __call__
executable, _, out_tree, args_flat = self._decode_args_and_get_executable(*args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 168, in _decode_args_and_get_executable
executable = _compile_parallel_executable(
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
ans = call(fun, *args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 194, in _compile_parallel_executable
return method.compile_executable(fun, in_tree, out_tree_thunk,
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 90, in compile_executable
return compile_shard_executable(fun, in_tree, out_tree_thunk,
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/shard_parallel/compile_executable.py", line 72, in compile_shard_executable
return shard_parallel_internal(
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/shard_parallel/compile_executable.py", line 114, in shard_parallel_internal
hlo_module, strategy_config = run_auto_sharding_pass(
File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/shard_parallel/auto_sharding.py", line 309, in run_auto_sharding_pass
compiled_module = xe.run_auto_sharding(xla_computation, compile_options)
TypeError: run_auto_sharding(): incompatible function arguments. The following argument types are supported:
1. (hlo_module: jaxlib.xla_extension.HloModule, compile_options: jaxlib.xla_extension.CompileOptions = <jaxlib.xla_extension.CompileOptions object at 0x7f98f2387670>) -> Status
Invoked with: <jaxlib.xla_extension.XlaComputation object at 0x7f96a004c330>, <jaxlib.xla_extension.CompileOptions object at 0x7f96a7f296f0> Solution: pip install -U alpa |
Hi Jiahao! It would be great if you can add the issues and their corresponding solutions to this doc and make a PR. Really appreciate your help! |
Issue: the cuda driver version is too old +-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 38C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 38C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 38C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+ Reference: solution: update the driver for CUDA 11.2. |
Forgot to mention, the cuda version on the amazon ec2 is not the same as the one shown in e.g.
I am changing the
If the cuda still messed up, one might also need to run this command python -m cupyx.tools.install_library --cuda 11.3 --library nccl |
I see @zhuohan123, cool! will do! |
the error: scipy /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' the fix: reinstall the scipy pip uninstall scipy
pip install scipy |
(tensorflow2_p39) ubuntu@ip-172-31-54-179:~$ strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_3.4.20
GLIBCXX_3.4.21
GLIBCXX_3.4.22
GLIBCXX_3.4.23
GLIBCXX_3.4.24
GLIBCXX_3.4.25
GLIBCXX_3.4.26
GLIBCXX_3.4.27
GLIBCXX_3.4.28
GLIBCXX_3.4.29
GLIBCXX_DEBUG_MESSAGE_LENGTH |
(tensorflow2_p39) ubuntu@ip-172-31-54-179:~$ python
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/__init__.py", line 154, in <module>
from . import fft
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/__init__.py", line 79, in <module>
from ._helper import next_fast_len
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_helper.py", line 3, in <module>
from ._pocketfft import helper as _helper
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_pocketfft/__init__.py", line 3, in <module>
from .basic import *
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_pocketfft/basic.py", line 6, in <module>
from . import pypocketfft as pfft
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-39-x86_64-linux-gnu.so)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/__init__.py", line 37, in <module>
from jax import config as _config_module
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/config.py", line 18, in <module>
from jax._src.config import config
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/_src/config.py", line 27, in <module>
from jax._src import lib
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/_src/lib/__init__.py", line 101, in <module>
import jaxlib.lapack as lapack
File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jaxlib/lapack.py", line 24, in <module>
from . import _lapack
ImportError: initialization failed
>>> exit() |
conda install -c conda-forge gcc=12.1.0
pip uninstall scipy
pip install scipy==1.6 |
another error: (alpa) ubuntu@ip-172-31-54-179:~/anaconda3/envs/alpa/lib/python3.8/site-packages/ray$ python -c 'import alpa'
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/__init__.py", line 18, in <module>
from cupy import _core # NOQA
File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/_core/__init__.py", line 1, in <module>
from cupy._core import core # NOQA
File "cupy/_core/core.pyx", line 1, in init cupy._core.core
File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/cuda/__init__.py", line 8, in <module>
from cupy.cuda import compiler # NOQA
File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/cuda/compiler.py", line 13, in <module>
from cupy.cuda import device
File "cupy/cuda/device.pyx", line 1, in init cupy.cuda.device
ImportError: /home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy_backends/cuda/api/runtime.cpython-38-x86_64-linux-gnu.so: symbol cudaMemPoolSetAttribute version libcudart.so.11.0 not defined in file libcudart.so.11.0 with link time reference
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/alpa/__init__.py", line 3, in <module>
from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 13, in <module>
from alpa.device_mesh import (init_global_cluster, shutdown_global_cluster)
File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 45, in <module>
import cupy
File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/__init__.py", line 20, in <module>
raise ImportError(f'''
ImportError:
================================================================
Failed to import CuPy.
If you installed CuPy via wheels (cupy-cudaXXX or cupy-rocm-X-X), make sure that the package matches with the version of CUDA or ROCm installed.
On Linux, you may need to set LD_LIBRARY_PATH environment variable depending on how you installed CUDA/ROCm.
On Windows, try setting CUDA_PATH environment variable.
Check the Installation Guide for details:
https://docs.cupy.dev/en/latest/install.html
Original error:
ImportError: /home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy_backends/cuda/api/runtime.cpython-38-x86_64-linux-gnu.so: symbol cudaMemPoolSetAttribute version libcudart.so.11.0 not defined in file libcudart.so.11.0 with link time reference
================================================================ my solution: LD_LIBRARY_PATH=/usr/local/cuda:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH Reference: |
In my case, I firstly using NCCL_DEBUG=WARN to find that there are gpus can't communicate each other under p2p setting. Then i use nvidia-smi topo -m to show the connection matrix between the GPUs and the CPUs. It shows that only SYS and PIX supported. Then i setting NCCL_P2P_LEVEL= PIX and also NCCL_SHM_DISABLE=1. it fixed now |
Thanks @zhisbug and @merrymercy for helping and guiding me through the installation, here are several trouble shooting for each error message and hope this can be helpful to other.
Issue:
cupy
reportscupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
Trigger:
Error Message:
Solution:
I might messed up cupy version when switching the cuda
Error Message:
or Numba issue
Solution:
Fixed by installing the latest version of
alpa
Trigger:
Error Message:
Solution:
Fixed by removing the lock on ubuntu: link
And then reinstall the package.
The text was updated successfully, but these errors were encountered: