Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Installation] Trouble shooting llvmlite and NCCL_ERROR_UNHANDLED_CUDA_ERROR #496

Closed
1 task done
JiahaoYao opened this issue Jun 9, 2022 · 15 comments
Closed
1 task done

Comments

@JiahaoYao
Copy link
Contributor

JiahaoYao commented Jun 9, 2022

Thanks @zhisbug and @merrymercy for helping and guiding me through the installation, here are several trouble shooting for each error message and hope this can be helpful to other.

Issue: cupy reports cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
Trigger:

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/benchmark/cupy$ python profile_communication.py

Error Message:

Traceback (most recent call last):
  File "profile_communication.py", line 261, in <module>
    ray.get([w.profile.remote() for w in workers])
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/worker.py", line 1843, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::GpuHost.profile() (pid=5622, ip=10.0.0.171, repr=<profile_communication.GpuHost object at 0x7f3c56d05ac0>)
  File "profile_communication.py", line 199, in profile
    self.profile_allreduce(1 << i, cp.float32, [list(range(self.world_size))])
  File "profile_communication.py", line 80, in profile_allreduce
    comm = self.init_communicator(groups)
  File "profile_communication.py", line 73, in init_communicator
    comm = cp.cuda.nccl.NcclCommunicator(
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

Solution:

python -m cupyx.tools.install_library --cuda 11.3 --library nccl

I might messed up cupy version when switching the cuda

Error Message:

numba OSError: Could not load shared object file: libllvmlite.so

or Numba issue

Traceback (most recent call last):
  File "tests/test_install.py", line 13, in <module>
    from alpa import (init, parallelize, grad, ShardParallel,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/__init__.py", line 1, in <module>
    from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 15, in <module>
    from alpa.parallel_method import ParallelMethod, ShardParallel
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 26, in <module>
    from alpa.pipeline_parallel.compile_executable import compile_pipeshard_executable
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 28, in <module>
    from alpa.pipeline_parallel.stage_construction import (
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/stage_construction.py", line 10, in <module>
    import numba
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/__init__.py", line 19, in <module>
    from numba.core import config
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/core/config.py", line 16, in <module>
    import llvmlite.binding as ll
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/__init__.py", line 4, in <module>
    from .dylib import *
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/dylib.py", line 3, in <module>
    from llvmlite.binding import ffi
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/ffi.py", line 191, in <module>
    raise OSError("Could not load shared object file: {}".format(_lib_name))

Solution:
Fixed by installing the latest version of alpa

pip uninstall alpa
pip install pip --upgrade
pip install alpa --upgrade

Trigger:

sudo apt install coinor-cbc

Error Message:

[E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)](https://askubuntu.com/questions/1109982/e-could-not-get-lock-var-lib-dpkg-lock-frontend-open-11-resource-temporari)

Solution:
Fixed by removing the lock on ubuntu: link

sudo rm /var/lib/apt/lists/lock
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock*

sudo dpkg --configure -a
sudo apt update

And then reinstall the package.

  • still wip
@JiahaoYao JiahaoYao changed the title [Installation] Trouble shooting llvmlite and cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error [Installation] Trouble shooting llvmlite and NCCL_ERROR_UNHANDLED_CUDA_ERROR Jun 9, 2022
@zhisbug
Copy link
Member

zhisbug commented Jun 9, 2022

@JiahaoYao thanks! You may consider contributing a section to this troubleshoot doc: https://alpa-projects.github.io/install.html#troubleshooting :)

@JiahaoYao
Copy link
Contributor Author

Sure, happy to work on That!

@JiahaoYao
Copy link
Contributor Author

JiahaoYao commented Jun 13, 2022

Issue: install and run the alpa examples on docker and get the issue below. But the cupy communication works.

(base) /alpa% python3 tests/test_install.py

.2022-06-08 17:37:21,276        INFO packaging.py:323 -- Pushing file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip' (66.42MiB) to Ray cluster...
2022-06-08 17:37:21,970 INFO packaging.py:332 -- Successfully pushed file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip'.
E2022-06-08 17:37:34,340        ERROR worker.py:244 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.init_p2p_communicator() (pid=18797, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7efaa83b3370>)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_install.py", line 105, in test_2_pipeline_parallel
    actual_state = p_train_step(state, batch)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 100, in __call__
    executable, _, out_tree, args_flat = self._decode_args_and_get_executable(*args)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 168, in _decode_args_and_get_executable
    executable = _compile_parallel_executable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
    ans = call(fun, *args)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 194, in _compile_parallel_executable
    return method.compile_executable(fun, in_tree, out_tree_thunk,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/parallel_method.py", line 168, in compile_executable
    return compile_pipeshard_executable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 186, in compile_pipeshard_executable
    executable = PipeshardDriverExecutable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 125, in __init__
    task.create_resharding_communicators()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 528, in create_resharding_communicators
    ray.get(task_dones)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 2012, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=18796, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fe9bc3812e0>)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

----------------------------------------------------------------------
Ran 2 tests in 20.574s

FAILED (errors=1)
(base) ray:j/alpa% python3 -c "from cupy.cuda import nccl"                  

Solution: use the amazon ec2 / gcp instead.

@JiahaoYao
Copy link
Contributor Author

Issue:

Traceback (most recent call last):
  File "alpa_mnist_example.py", line 208, in <module>
    train_mnist()
  File "alpa_mnist_example.py", line 183, in train_mnist
    train_func({"train": train_dataset}, config)
  File "alpa_mnist_example.py", line 158, in train_func
    state, train_loss, train_accuracy = train_epoch(state, jax_dataset)
  File "alpa_mnist_example.py", line 121, in train_epoch
    state, loss, accuracy = train_step(state, batch_images, batch_labels)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 100, in __call__
    executable, _, out_tree, args_flat = self._decode_args_and_get_executable(*args)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 168, in _decode_args_and_get_executable
    executable = _compile_parallel_executable(
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
    ans = call(fun, *args)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 194, in _compile_parallel_executable
    return method.compile_executable(fun, in_tree, out_tree_thunk,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 90, in compile_executable
    return compile_shard_executable(fun, in_tree, out_tree_thunk,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/shard_parallel/compile_executable.py", line 72, in compile_shard_executable
    return shard_parallel_internal(
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/shard_parallel/compile_executable.py", line 114, in shard_parallel_internal
    hlo_module, strategy_config = run_auto_sharding_pass(
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/shard_parallel/auto_sharding.py", line 309, in run_auto_sharding_pass
    compiled_module = xe.run_auto_sharding(xla_computation, compile_options)
TypeError: run_auto_sharding(): incompatible function arguments. The following argument types are supported:
    1. (hlo_module: jaxlib.xla_extension.HloModule, compile_options: jaxlib.xla_extension.CompileOptions = <jaxlib.xla_extension.CompileOptions object at 0x7f98f2387670>) -> Status

Invoked with: <jaxlib.xla_extension.XlaComputation object at 0x7f96a004c330>, <jaxlib.xla_extension.CompileOptions object at 0x7f96a7f296f0>

Solution:
the alpa version is stale, update the alpa package.

pip install -U alpa

@zhuohan123
Copy link
Member

Hi Jiahao! It would be great if you can add the issues and their corresponding solutions to this doc and make a PR. Really appreciate your help!

@JiahaoYao
Copy link
Contributor Author

Issue: the cuda driver version is too old

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   40C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Reference:
CUDA Compatibility :: NVIDIA Data Center GPU Driver Documentation

image

solution: update the driver for CUDA 11.2.

@JiahaoYao
Copy link
Contributor Author

Issue: install and run the alpa examples on docker and get the issue below. But the cupy communication works.

(base) /alpa% python3 tests/test_install.py

.2022-06-08 17:37:21,276        INFO packaging.py:323 -- Pushing file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip' (66.42MiB) to Ray cluster...
2022-06-08 17:37:21,970 INFO packaging.py:332 -- Successfully pushed file package 'gcs://_ray_pkg_a61b0f419e5596bad8db35aa15c8886e.zip'.
E2022-06-08 17:37:34,340        ERROR worker.py:244 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.init_p2p_communicator() (pid=18797, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7efaa83b3370>)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_install.py", line 105, in test_2_pipeline_parallel
    actual_state = p_train_step(state, batch)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 100, in __call__
    executable, _, out_tree, args_flat = self._decode_args_and_get_executable(*args)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 168, in _decode_args_and_get_executable
    executable = _compile_parallel_executable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/jax/linear_util.py", line 272, in memoized_fun
    ans = call(fun, *args)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/api.py", line 194, in _compile_parallel_executable
    return method.compile_executable(fun, in_tree, out_tree_thunk,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/parallel_method.py", line 168, in compile_executable
    return compile_pipeshard_executable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 186, in compile_pipeshard_executable
    executable = PipeshardDriverExecutable(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 125, in __init__
    task.create_resharding_communicators()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 528, in create_resharding_communicators
    ray.get(task_dones)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 2012, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=18796, ip=172.31.6.91, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fe9bc3812e0>)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 474, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 749, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 620, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 108, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

----------------------------------------------------------------------
Ran 2 tests in 20.574s

FAILED (errors=1)
(base) ray:j/alpa% python3 -c "from cupy.cuda import nccl"                  

Solution: use the amazon ec2 / gcp instead.

Forgot to mention, the cuda version on the amazon ec2 is not the same as the one shown in nvidia-smi. One might need to manually create the symbolic link for the cuda.

e.g.

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/tests$ ls /usr/local/
bin/       cuda/      cuda-11.0/ cuda-11.1/ cuda-11.2/ cuda-11.3/ cuda-11.4/ cuda-11.5/ cuda-11.6/ etc/       games/     include/   lib/       man/       sbin/      share/     src/

I am changing the cuda to cuda-113 by using the symbol link

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/tests$ ls /usr/local/cuda -al
lrwxrwxrwx 1 root root 20 Jun  9 03:53 /usr/local/cuda -> /usr/local/cuda-11.3

If the cuda still messed up, one might also need to run this command

python -m cupyx.tools.install_library --cuda 11.3 --library nccl

@JiahaoYao
Copy link
Contributor Author

Hi Jiahao! It would be great if you can add the issues and their corresponding solutions to this doc and make a PR. Really appreciate your help!

I see @zhuohan123, cool! will do!

@JiahaoYao
Copy link
Contributor Author

the error:

scipy /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30'

the fix:

reinstall the scipy

pip uninstall scipy 
pip install scipy

@JiahaoYao
Copy link
Contributor Author

(tensorflow2_p39) ubuntu@ip-172-31-54-179:~$ strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_3.4.20
GLIBCXX_3.4.21
GLIBCXX_3.4.22
GLIBCXX_3.4.23
GLIBCXX_3.4.24
GLIBCXX_3.4.25
GLIBCXX_3.4.26
GLIBCXX_3.4.27
GLIBCXX_3.4.28
GLIBCXX_3.4.29
GLIBCXX_DEBUG_MESSAGE_LENGTH

@JiahaoYao
Copy link
Contributor Author

(tensorflow2_p39) ubuntu@ip-172-31-54-179:~$ python
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/__init__.py", line 154, in <module>
    from . import fft
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/__init__.py", line 79, in <module>
    from ._helper import next_fast_len
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_helper.py", line 3, in <module>
    from ._pocketfft import helper as _helper
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_pocketfft/__init__.py", line 3, in <module>
    from .basic import *
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_pocketfft/basic.py", line 6, in <module>
    from . import pypocketfft as pfft
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-39-x86_64-linux-gnu.so)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/__init__.py", line 37, in <module>
    from jax import config as _config_module
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/config.py", line 18, in <module>
    from jax._src.config import config
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/_src/config.py", line 27, in <module>
    from jax._src import lib
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jax/_src/lib/__init__.py", line 101, in <module>
    import jaxlib.lapack as lapack
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p39/lib/python3.9/site-packages/jaxlib/lapack.py", line 24, in <module>
    from . import _lapack
ImportError: initialization failed
>>> exit()

@JiahaoYao
Copy link
Contributor Author

@JiahaoYao
Copy link
Contributor Author

my fix from https://stackoverflow.com/questions/72540359/glibcxx-3-4-30-not-found-for-librosa-in-conda-virtual-environment-after-tryin

conda install -c conda-forge gcc=12.1.0
pip uninstall scipy 
pip install scipy==1.6

@JiahaoYao
Copy link
Contributor Author

JiahaoYao commented Jul 18, 2022

another error:

(alpa) ubuntu@ip-172-31-54-179:~/anaconda3/envs/alpa/lib/python3.8/site-packages/ray$ python -c 'import alpa'
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/__init__.py", line 18, in <module>
    from cupy import _core  # NOQA
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/_core/__init__.py", line 1, in <module>
    from cupy._core import core  # NOQA
  File "cupy/_core/core.pyx", line 1, in init cupy._core.core
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/cuda/__init__.py", line 8, in <module>
    from cupy.cuda import compiler  # NOQA
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/cuda/compiler.py", line 13, in <module>
    from cupy.cuda import device
  File "cupy/cuda/device.pyx", line 1, in init cupy.cuda.device
ImportError: /home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy_backends/cuda/api/runtime.cpython-38-x86_64-linux-gnu.so: symbol cudaMemPoolSetAttribute version libcudart.so.11.0 not defined in file libcudart.so.11.0 with link time reference

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/alpa/__init__.py", line 3, in <module>
    from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 13, in <module>
    from alpa.device_mesh import (init_global_cluster, shutdown_global_cluster)
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 45, in <module>
    import cupy
  File "/home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy/__init__.py", line 20, in <module>
    raise ImportError(f'''
ImportError: 
================================================================
Failed to import CuPy.

If you installed CuPy via wheels (cupy-cudaXXX or cupy-rocm-X-X), make sure that the package matches with the version of CUDA or ROCm installed.

On Linux, you may need to set LD_LIBRARY_PATH environment variable depending on how you installed CUDA/ROCm.
On Windows, try setting CUDA_PATH environment variable.

Check the Installation Guide for details:
  https://docs.cupy.dev/en/latest/install.html

Original error:
  ImportError: /home/ubuntu/anaconda3/envs/alpa/lib/python3.8/site-packages/cupy_backends/cuda/api/runtime.cpython-38-x86_64-linux-gnu.so: symbol cudaMemPoolSetAttribute version libcudart.so.11.0 not defined in file libcudart.so.11.0 with link time reference
================================================================

my solution:

LD_LIBRARY_PATH=/usr/local/cuda:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Reference:

@dumpmemory
Copy link
Contributor

Thanks @zhisbug and @merrymercy for helping and guiding me through the installation, here are several trouble shooting for each error message and hope this can be helpful to other.

Issue: cupy reports cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error Trigger:

(tensorflow2_p38) ubuntu@ip-10-0-0-171:~/alpa/benchmark/cupy$ python profile_communication.py

Error Message:

Traceback (most recent call last):
  File "profile_communication.py", line 261, in <module>
    ray.get([w.profile.remote() for w in workers])
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/ray/worker.py", line 1843, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::GpuHost.profile() (pid=5622, ip=10.0.0.171, repr=<profile_communication.GpuHost object at 0x7f3c56d05ac0>)
  File "profile_communication.py", line 199, in profile
    self.profile_allreduce(1 << i, cp.float32, [list(range(self.world_size))])
  File "profile_communication.py", line 80, in profile_allreduce
    comm = self.init_communicator(groups)
  File "profile_communication.py", line 73, in init_communicator
    comm = cp.cuda.nccl.NcclCommunicator(
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

Solution:

python -m cupyx.tools.install_library --cuda 11.3 --library nccl

I might messed up cupy version when switching the cuda

Error Message:

numba OSError: Could not load shared object file: libllvmlite.so

or Numba issue

Traceback (most recent call last):
  File "tests/test_install.py", line 13, in <module>
    from alpa import (init, parallelize, grad, ShardParallel,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/__init__.py", line 1, in <module>
    from alpa.api import (init, shutdown, parallelize, grad, value_and_grad,
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/api.py", line 15, in <module>
    from alpa.parallel_method import ParallelMethod, ShardParallel
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/parallel_method.py", line 26, in <module>
    from alpa.pipeline_parallel.compile_executable import compile_pipeshard_executable
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 28, in <module>
    from alpa.pipeline_parallel.stage_construction import (
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/alpa/pipeline_parallel/stage_construction.py", line 10, in <module>
    import numba
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/__init__.py", line 19, in <module>
    from numba.core import config
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/numba/core/config.py", line 16, in <module>
    import llvmlite.binding as ll
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/__init__.py", line 4, in <module>
    from .dylib import *
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/dylib.py", line 3, in <module>
    from llvmlite.binding import ffi
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/llvmlite/binding/ffi.py", line 191, in <module>
    raise OSError("Could not load shared object file: {}".format(_lib_name))

Solution: Fixed by installing the latest version of alpa

pip uninstall alpa
pip install pip --upgrade
pip install alpa --upgrade

Trigger:

sudo apt install coinor-cbc

Error Message:

[E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)](https://askubuntu.com/questions/1109982/e-could-not-get-lock-var-lib-dpkg-lock-frontend-open-11-resource-temporari)

Solution: Fixed by removing the lock on ubuntu: link

sudo rm /var/lib/apt/lists/lock
sudo rm /var/cache/apt/archives/lock
sudo rm /var/lib/dpkg/lock*

sudo dpkg --configure -a
sudo apt update

And then reinstall the package.

  • still wip

In my case, I firstly using NCCL_DEBUG=WARN to find that there are gpus can't communicate each other under p2p setting. Then i use nvidia-smi topo -m to show the connection matrix between the GPUs and the CPUs. It shows that only SYS and PIX supported. Then i setting NCCL_P2P_LEVEL= PIX and also NCCL_SHM_DISABLE=1. it fixed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants