[Installation] Trouble shooting `llvmlite` and `NCCL_ERROR_UNHANDLED_CUDA_ERROR` #496

JiahaoYao · 2022-06-09T05:02:17Z

Thanks @zhisbug and @merrymercy for helping and guiding me through the installation, here are several trouble shooting for each error message and hope this can be helpful to other.

Issue: cupy reports cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
Trigger:

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/benchmark/cupy$ python profile_communication.py

Error Message:

Traceback (most recent call last):
  File "profile_communication.py", line 261, in <module>
    ray.get([w.profile.remote() for w in workers])
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/worker.py", line 1843, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::GpuHost.profile() (pid=5622, ip=10.0.0.171, repr=<profile_communication.GpuHost object at 0x7f3c56d05ac0>)
  File "profile_communication.py", line 199, in profile
    self.profile_allreduce(1 << i, cp.float32, [list(range(self.world_size))])
  File "profile_communication.py", line 80, in profile_allreduce
    comm = self.init_communicator(groups)
  File "profile_communication.py", line 73, in init_communicator
    comm = cp.cuda.nccl.NcclCommunicator(
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

Solution:

python -m cupyx.tools.install_library --cuda 11.3 --library nccl

I might messed up cupy version when switching the cuda

Error Message:

numba OSError: Could not load shared object file: libllvmlite.so

or Numba issue

Traceback (most recent call last):
  File "tests/test_install.py", line 13, in <module>
    from alpa import (init, parallelize, grad, ShardParallel,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/__init__.py", line 1, in <module>
    from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 15, in <module>
    from alpa.parallel_method import ParallelMethod, ShardParallel
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 26, in <module>
    from alpa.pipeline_parallel.compile_executable import compile_pipeshard_executable
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 28, in <module>
    from alpa.pipeline_parallel.stage_construction import (
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/stage_construction.py", line 10, in <module>
    import numba
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/__init__.py", line 19, in <module>
    from numba.core import config
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/core/config.py", line 16, in <module>
    import llvmlite.binding as ll
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/__init__.py", line 4, in <module>
    from .dylib import *
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/dylib.py", line 3, in <module>
    from llvmlite.binding import ffi
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/ffi.py", line 191, in <module>
    raise OSError("Could not load shared object file: {}".format(_lib_name))

Solution:
Fixed by installing the latest version of alpa

pip uninstall alpa
pip install pip --upgrade
pip install alpa --upgrade

Trigger:

sudo apt install coinor-cbc

Error Message:

[E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)](https://askubuntu.com/questions/1109982/e-could-not-get-lock-var-lib-dpkg-lock-frontend-open-11-resource-temporari)

Solution:
Fixed by removing the lock on ubuntu: link

sudo rm /var/lib/apt/lists/lock
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock*

sudo dpkg --configure -a
sudo apt update

And then reinstall the package.

still wip

The text was updated successfully, but these errors were encountered:

zhisbug · 2022-06-09T09:11:53Z

@JiahaoYao thanks! You may consider contributing a section to this troubleshoot doc: https://alpa-projects.github.io/install.html#troubleshooting :)

JiahaoYao · 2022-06-09T10:24:41Z

Sure, happy to work on That!

JiahaoYao · 2022-06-13T04:41:14Z

Issue: install and run the alpa examples on docker and get the issue below. But the cupy communication works.

(base) /alpa% python3 tests/test_install.py

.2022-06-08 17:37:21,276        INFO packaging.py:323 -- Pushing file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip' (66.42MiB) to Ray cluster...
2022-06-08 17:37:21,970 INFO packaging.py:332 -- Successfully pushed file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip'.
E2022-06-08 17:37:34,340        ERROR worker.py:244 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.init_p2p_communicator() (pid=18797, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7efaa83b3370>)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_install.py", line 105, in test_2_pipeline_parallel
    actual_state = p_train_step(state, batch)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 100, in __call__
    executable, _, out_tree, args_flat = self._decode_args_and_get_executable(*args)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 168, in _decode_args_and_get_executable
    executable = _compile_parallel_executable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
    ans = call(fun, *args)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 194, in _compile_parallel_executable
    return method.compile_executable(fun, in_tree, out_tree_thunk,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/parallel_method.py", line 168, in compile_executable
    return compile_pipeshard_executable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 186, in compile_pipeshard_executable
    executable = PipeshardDriverExecutable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 125, in __init__
    task.create_resharding_communicators()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 528, in create_resharding_communicators
    ray.get(task_dones)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 2012, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=18796, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fe9bc3812e0>)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

----------------------------------------------------------------------
Ran 2 tests in 20.574s

FAILED (errors=1)

(base) ray:j/alpa% python3 -c "from cupy.cuda import nccl"

Solution: use the amazon ec2 / gcp instead.

JiahaoYao · 2022-06-13T04:42:55Z

Issue:

Traceback (most recent call last):
  File "alpa_mnist_example.py", line 208, in <module>
    train_mnist()
  File "alpa_mnist_example.py", line 183, in train_mnist
    train_func({"train": train_dataset}, config)
  File "alpa_mnist_example.py", line 158, in train_func
    state, train_loss, train_accuracy = train_epoch(state, jax_dataset)
  File "alpa_mnist_example.py", line 121, in train_epoch
    state, loss, accuracy = train_step(state, batch_images, batch_labels)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 100, in __call__
    executable, _, out_tree, args_flat = self._decode_args_and_get_executable(*args)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 168, in _decode_args_and_get_executable
    executable = _compile_parallel_executable(
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
    ans = call(fun, *args)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 194, in _compile_parallel_executable
    return method.compile_executable(fun, in_tree, out_tree_thunk,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 90, in compile_executable
    return compile_shard_executable(fun, in_tree, out_tree_thunk,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/shard_parallel/compile_executable.py", line 72, in compile_shard_executable
    return shard_parallel_internal(
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/shard_parallel/compile_executable.py", line 114, in shard_parallel_internal
    hlo_module, strategy_config = run_auto_sharding_pass(
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/shard_parallel/auto_sharding.py", line 309, in run_auto_sharding_pass
    compiled_module = xe.run_auto_sharding(xla_computation, compile_options)
TypeError: run_auto_sharding(): incompatible function arguments. The following argument types are supported:
    1. (hlo_module: jaxlib.xla_extension.HloModule, compile_options: jaxlib.xla_extension.CompileOptions = <jaxlib.xla_extension.CompileOptions object at 0x7f98f2387670>) -> Status

Invoked with: <jaxlib.xla_extension.XlaComputation object at 0x7f96a004c330>, <jaxlib.xla_extension.CompileOptions object at 0x7f96a7f296f0>

Solution:
the alpa version is stale, update the alpa package.

pip install -U alpa

zhuohan123 · 2022-06-13T04:45:06Z

Hi Jiahao! It would be great if you can add the issues and their corresponding solutions to this doc and make a PR. Really appreciate your help!

JiahaoYao · 2022-06-13T04:48:07Z

Issue: the cuda driver version is too old

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   40C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Reference:
CUDA Compatibility :: NVIDIA Data Center GPU Driver Documentation

solution: update the driver for CUDA 11.2.

JiahaoYao · 2022-06-13T04:50:30Z

Issue: install and run the alpa examples on docker and get the issue below. But the cupy communication works.

(base) /alpa% python3 tests/test_install.py

.2022-06-08 17:37:21,276        INFO packaging.py:323 -- Pushing file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip' (66.42MiB) to Ray cluster...
2022-06-08 17:37:21,970 INFO packaging.py:332 -- Successfully pushed file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip'.
E2022-06-08 17:37:34,340        ERROR worker.py:244 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.init_p2p_communicator() (pid=18797, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7efaa83b3370>)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_install.py", line 105, in test_2_pipeline_parallel
    actual_state = p_train_step(state, batch)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 100, in __call__
    executable, _, out_tree, args_flat = self._decode_args_and_get_executable(*args)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 168, in _decode_args_and_get_executable
    executable = _compile_parallel_executable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
    ans = call(fun, *args)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 194, in _compile_parallel_executable
    return method.compile_executable(fun, in_tree, out_tree_thunk,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/parallel_method.py", line 168, in compile_executable
    return compile_pipeshard_executable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 186, in compile_pipeshard_executable
    executable = PipeshardDriverExecutable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 125, in __init__
    task.create_resharding_communicators()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 528, in create_resharding_communicators
    ray.get(task_dones)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 2012, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=18796, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fe9bc3812e0>)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

----------------------------------------------------------------------
Ran 2 tests in 20.574s

FAILED (errors=1)

(base) ray:j/alpa% python3 -c "from cupy.cuda import nccl"

Solution: use the amazon ec2 / gcp instead.

Forgot to mention, the cuda version on the amazon ec2 is not the same as the one shown in nvidia-smi. One might need to manually create the symbolic link for the cuda.

e.g.

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/tests$ ls /usr/local/
bin/       cuda/      cuda-11.0/ cuda-11.1/ cuda-11.2/ cuda-11.3/ cuda-11.4/ cuda-11.5/ cuda-11.6/ etc/       games/     include/   lib/       man/       sbin/      share/     src/

I am changing the cuda to cuda-113 by using the symbol link

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/tests$ ls /usr/local/cuda -al
lrwxrwxrwx 1 root root 20 Jun  9 03:53 /usr/local/cuda -> /usr/local/cuda-11.3

If the cuda still messed up, one might also need to run this command

python -m cupyx.tools.install_library --cuda 11.3 --library nccl

JiahaoYao · 2022-06-13T04:51:11Z

Hi Jiahao! It would be great if you can add the issues and their corresponding solutions to this doc and make a PR. Really appreciate your help!

I see @zhuohan123, cool! will do!

JiahaoYao · 2022-07-17T00:51:37Z

the error:

scipy /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30'

the fix:

reinstall the scipy

pip uninstall scipy 
pip install scipy

JiahaoYao · 2022-07-18T21:55:53Z

(tensorflow2_p39) ubuntu@ip-172-31-54-179:~$ strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_3.4.20
GLIBCXX_3.4.21
GLIBCXX_3.4.22
GLIBCXX_3.4.23
GLIBCXX_3.4.24
GLIBCXX_3.4.25
GLIBCXX_3.4.26
GLIBCXX_3.4.27
GLIBCXX_3.4.28
GLIBCXX_3.4.29
GLIBCXX_DEBUG_MESSAGE_LENGTH

JiahaoYao · 2022-07-18T21:56:06Z

(tensorflow2_p39) ubuntu@ip-172-31-54-179:~$ python
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/__init__.py", line 154, in <module>
    from . import fft
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/__init__.py", line 79, in <module>
    from ._helper import next_fast_len
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_helper.py", line 3, in <module>
    from ._pocketfft import helper as _helper
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_pocketfft/__init__.py", line 3, in <module>
    from .basic import *
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_pocketfft/basic.py", line 6, in <module>
    from . import pypocketfft as pfft
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-39-x86_64-linux-gnu.so)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/__init__.py", line 37, in <module>
    from jax import config as _config_module
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/config.py", line 18, in <module>
    from jax._src.config import config
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/_src/config.py", line 27, in <module>
    from jax._src import lib
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/_src/lib/__init__.py", line 101, in <module>
    import jaxlib.lapack as lapack
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jaxlib/lapack.py", line 24, in <module>
    from . import _lapack
ImportError: initialization failed
>>> exit()

JiahaoYao · 2022-07-18T21:57:25Z

suggested fix: https://askubuntu.com/questions/575505/glibcxx-3-4-20-not-found-how-to-fix-this-error

JiahaoYao · 2022-07-18T22:19:07Z

my fix from https://stackoverflow.com/questions/72540359/glibcxx-3-4-30-not-found-for-librosa-in-conda-virtual-environment-after-tryin

conda install -c conda-forge gcc=12.1.0
pip uninstall scipy 
pip install scipy==1.6

JiahaoYao · 2022-07-18T23:11:09Z

another error:

(alpa) ubuntu@ip-172-31-54-179:~/anaconda3/envs/alpa/lib/python3.8/site-packages/ray$ python -c 'import alpa'
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/__init__.py", line 18, in <module>
    from cupy import _core  # NOQA
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/_core/__init__.py", line 1, in <module>
    from cupy._core import core  # NOQA
  File "cupy/_core/core.pyx", line 1, in init cupy._core.core
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/cuda/__init__.py", line 8, in <module>
    from cupy.cuda import compiler  # NOQA
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/cuda/compiler.py", line 13, in <module>
    from cupy.cuda import device
  File "cupy/cuda/device.pyx", line 1, in init cupy.cuda.device
ImportError: /home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy_backends/cuda/api/runtime.cpython-38-x86_64-linux-gnu.so: symbol cudaMemPoolSetAttribute version libcudart.so.11.0 not defined in file libcudart.so.11.0 with link time reference

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/alpa/__init__.py", line 3, in <module>
    from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 13, in <module>
    from alpa.device_mesh import (init_global_cluster, shutdown_global_cluster)
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 45, in <module>
    import cupy
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/__init__.py", line 20, in <module>
    raise ImportError(f'''
ImportError: 
================================================================
Failed to import CuPy.

If you installed CuPy via wheels (cupy-cudaXXX or cupy-rocm-X-X), make sure that the package matches with the version of CUDA or ROCm installed.

On Linux, you may need to set LD_LIBRARY_PATH environment variable depending on how you installed CUDA/ROCm.
On Windows, try setting CUDA_PATH environment variable.

Check the Installation Guide for details:
  https://docs.cupy.dev/en/latest/install.html

Original error:
  ImportError: /home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy_backends/cuda/api/runtime.cpython-38-x86_64-linux-gnu.so: symbol cudaMemPoolSetAttribute version libcudart.so.11.0 not defined in file libcudart.so.11.0 with link time reference
================================================================

my solution:

LD_LIBRARY_PATH=/usr/local/cuda:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Reference:

dumpmemory · 2022-12-26T14:31:52Z

Thanks @zhisbug and @merrymercy for helping and guiding me through the installation, here are several trouble shooting for each error message and hope this can be helpful to other.

Issue: cupy reports cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error Trigger:

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/benchmark/cupy$ python profile_communication.py

Error Message:

Traceback (most recent call last):
  File "profile_communication.py", line 261, in <module>
    ray.get([w.profile.remote() for w in workers])
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/worker.py", line 1843, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::GpuHost.profile() (pid=5622, ip=10.0.0.171, repr=<profile_communication.GpuHost object at 0x7f3c56d05ac0>)
  File "profile_communication.py", line 199, in profile
    self.profile_allreduce(1 << i, cp.float32, [list(range(self.world_size))])
  File "profile_communication.py", line 80, in profile_allreduce
    comm = self.init_communicator(groups)
  File "profile_communication.py", line 73, in init_communicator
    comm = cp.cuda.nccl.NcclCommunicator(
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

Solution:

python -m cupyx.tools.install_library --cuda 11.3 --library nccl

I might messed up cupy version when switching the cuda

Error Message:

numba OSError: Could not load shared object file: libllvmlite.so

or Numba issue

Traceback (most recent call last):
  File "tests/test_install.py", line 13, in <module>
    from alpa import (init, parallelize, grad, ShardParallel,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/__init__.py", line 1, in <module>
    from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 15, in <module>
    from alpa.parallel_method import ParallelMethod, ShardParallel
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 26, in <module>
    from alpa.pipeline_parallel.compile_executable import compile_pipeshard_executable
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 28, in <module>
    from alpa.pipeline_parallel.stage_construction import (
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/stage_construction.py", line 10, in <module>
    import numba
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/__init__.py", line 19, in <module>
    from numba.core import config
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/core/config.py", line 16, in <module>
    import llvmlite.binding as ll
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/__init__.py", line 4, in <module>
    from .dylib import *
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/dylib.py", line 3, in <module>
    from llvmlite.binding import ffi
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/ffi.py", line 191, in <module>
    raise OSError("Could not load shared object file: {}".format(_lib_name))

Solution: Fixed by installing the latest version of alpa

pip uninstall alpa
pip install pip --upgrade
pip install alpa --upgrade

Trigger:

sudo apt install coinor-cbc

Error Message:

[E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)](https://askubuntu.com/questions/1109982/e-could-not-get-lock-var-lib-dpkg-lock-frontend-open-11-resource-temporari)

Solution: Fixed by removing the lock on ubuntu: link

sudo rm /var/lib/apt/lists/lock
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock*

sudo dpkg --configure -a
sudo apt update

And then reinstall the package.

still wip

In my case, I firstly using NCCL_DEBUG=WARN to find that there are gpus can't communicate each other under p2p setting. Then i use nvidia-smi topo -m to show the connection matrix between the GPUs and the CPUs. It shows that only SYS and PIX supported. Then i setting NCCL_P2P_LEVEL= PIX and also NCCL_SHM_DISABLE=1. it fixed now

JiahaoYao changed the title ~~[Installation] Trouble shooting llvmlite and cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error~~ [Installation] Trouble shooting llvmlite and NCCL_ERROR_UNHANDLED_CUDA_ERROR Jun 9, 2022

JiahaoYao closed this as completed Jun 13, 2022

merrymercy mentioned this issue Jul 4, 2022

Distributed Generation Script - benchmark_text_gen.py - CUDA related error #570

Closed

zhisbug mentioned this issue Jul 5, 2022

Error on testing alpa installation #582

Closed

JiahaoYao mentioned this issue Jul 22, 2022

[installation] set_seed not an attribute #625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Installation] Trouble shooting `llvmlite` and `NCCL_ERROR_UNHANDLED_CUDA_ERROR` #496

[Installation] Trouble shooting `llvmlite` and `NCCL_ERROR_UNHANDLED_CUDA_ERROR` #496

JiahaoYao commented Jun 9, 2022 •

edited

zhisbug commented Jun 9, 2022

JiahaoYao commented Jun 9, 2022

JiahaoYao commented Jun 13, 2022 •

edited

JiahaoYao commented Jun 13, 2022

zhuohan123 commented Jun 13, 2022

JiahaoYao commented Jun 13, 2022

JiahaoYao commented Jun 13, 2022

JiahaoYao commented Jun 13, 2022

JiahaoYao commented Jul 17, 2022

JiahaoYao commented Jul 18, 2022

JiahaoYao commented Jul 18, 2022

JiahaoYao commented Jul 18, 2022

JiahaoYao commented Jul 18, 2022

JiahaoYao commented Jul 18, 2022 •

edited

dumpmemory commented Dec 26, 2022

[Installation] Trouble shooting llvmlite and NCCL_ERROR_UNHANDLED_CUDA_ERROR #496

[Installation] Trouble shooting llvmlite and NCCL_ERROR_UNHANDLED_CUDA_ERROR #496

Comments

JiahaoYao commented Jun 9, 2022 • edited

zhisbug commented Jun 9, 2022

JiahaoYao commented Jun 9, 2022

JiahaoYao commented Jun 13, 2022 • edited

JiahaoYao commented Jun 13, 2022

zhuohan123 commented Jun 13, 2022

JiahaoYao commented Jun 13, 2022

JiahaoYao commented Jun 13, 2022

JiahaoYao commented Jun 13, 2022

JiahaoYao commented Jul 17, 2022

JiahaoYao commented Jul 18, 2022

JiahaoYao commented Jul 18, 2022

JiahaoYao commented Jul 18, 2022

JiahaoYao commented Jul 18, 2022

JiahaoYao commented Jul 18, 2022 • edited

dumpmemory commented Dec 26, 2022

[Installation] Trouble shooting `llvmlite` and `NCCL_ERROR_UNHANDLED_CUDA_ERROR` #496

[Installation] Trouble shooting `llvmlite` and `NCCL_ERROR_UNHANDLED_CUDA_ERROR` #496

JiahaoYao commented Jun 9, 2022 •

edited

JiahaoYao commented Jun 13, 2022 •

edited

JiahaoYao commented Jul 18, 2022 •

edited