Skip to content
This repository was archived by the owner on Oct 19, 2024. It is now read-only.
This repository was archived by the owner on Oct 19, 2024. It is now read-only.

[BUG]Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices. #452

@zyc-bit

Description

@zyc-bit

Hi,
I don't know if it's appropriate to file this as a bug, but it's been bugging me for a long time and I have no way to fix it.

I'm operating on a cluster. Ray saw my GPU but alpa didn't. I followed the installation documentation Install Alpa. And I confirmed I used --enable_cuda when I compiled jax-alpa. When running tests/test_install.py errors are reported, you can see the error log attached below for more details.

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker):
  • Python version: 3.7
  • CUDA version: 11.2
  • NCCL version: 2.9.8
  • cupy version: 112
  • GPU model and memory: A100 80G
  • Alpa version: 0.0.0 (conda list and pip list show the Alpa version is 0.0.0)
  • TensorFlow version:
  • JAX version: 0.3.5

To Reproduce
I ran

RAY_ADDRESS="10.140.1.112:6379" XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/cuda-11.2" srun -p caif_dev --gres=gpu:1 -w SH-IDC1-10-140-1-112 -n1 bash test_install.sh

and my test_install.sh is:

echo "ray being starting"
ray start --head --node-ip-address 10.140.0.112 --address='10.140.0.112:6379'
echo "succeed==========="
ray status
echo "now running python script"
XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/cuda-11.2" python /mnt/cache/zhangyuchang/750/alpa/tests/test_install.py
ray stop

Log

phoenix-srun: Job 126515 scheduled successfully!
Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition.
Current PHX_PRIORITY is normal

ray being starting
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
2022-05-10 19:50:46,487 ERROR services.py:1474 -- Failed to start the dashboard: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

2022-05-10 19:50:46,487 ERROR services.py:1475 -- Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
Traceback (most recent call last):
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/_private/services.py", line 1451, in start_dashboard
    raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard
 The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule

2022-05-10 19:50:20,155 WARN scripts.py:652 -- Specifying --address for external Redis address is deprecated. Please specify environment variable RAY_REDIS_ADDRESS=10.140.1.112:6379 instead.
2022-05-10 19:50:20,155 INFO scripts.py:659 -- Will use `10.140.1.112:6379` as external Redis server address(es). If the primary one is not reachable, we starts new one(s) with `--port` in local.
2022-05-10 19:50:20,155 INFO scripts.py:681 -- The primary external redis server `10.140.1.112:6379` is not reachable. Will starts new one(s) with `--port` in local.
2022-05-10 19:50:20,190 INFO scripts.py:697 -- Local node IP: 10.140.1.112
2022-05-10 19:50:47,423 SUCC scripts.py:739 -- --------------------
2022-05-10 19:50:47,423 SUCC scripts.py:740 -- Ray runtime started.
2022-05-10 19:50:47,423 SUCC scripts.py:741 -- --------------------
2022-05-10 19:50:47,423 INFO scripts.py:743 -- Next steps
2022-05-10 19:50:47,423 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2022-05-10 19:50:47,423 INFO scripts.py:749 --   ray start --address='10.140.1.112:6379'
2022-05-10 19:50:47,423 INFO scripts.py:752 -- Alternatively, use the following Python code:
2022-05-10 19:50:47,423 INFO scripts.py:754 -- import ray
2022-05-10 19:50:47,424 INFO scripts.py:767 -- ray.init(address='auto', _node_ip_address='10.140.1.112')
2022-05-10 19:50:47,424 INFO scripts.py:771 -- To connect to this Ray runtime from outside of the cluster, for example to
2022-05-10 19:50:47,424 INFO scripts.py:775 -- connect to a remote cluster from your laptop directly, use the following
2022-05-10 19:50:47,424 INFO scripts.py:778 -- Python code:
2022-05-10 19:50:47,424 INFO scripts.py:780 -- import ray
2022-05-10 19:50:47,424 INFO scripts.py:786 -- ray.init(address='ray://<head_node_ip_address>:10001')
2022-05-10 19:50:47,424 INFO scripts.py:792 -- If connection fails, check your firewall settings and network configuration.
2022-05-10 19:50:47,424 INFO scripts.py:798 -- To terminate the Ray runtime, run
2022-05-10 19:50:47,424 INFO scripts.py:799 --   ray stop
succeed===========
Node status
---------------------------------------------------------------
Healthy:
 1 node_42a384b6d502cd18b6d052e98420df74116e83b73ef8146e44596910
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/128.0 CPU
 0.0/1.0 GPU
 0.00/804.774 GiB memory
 0.00/186.265 GiB object_store_memory

Demands:
 (no resource demands)
now running python script
2022-05-10 19:53:29.234043: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:272] failed call to cuInit: UNKNOWN ERROR (34)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
EE
======================================================================
ERROR: test_1_shard_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 128, in <module>
    runner.run(suite())
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
      File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 80, in test_1_shard_parallel
    actual_state = parallel_train_step(state, batch)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 108, in ret_func
    global_config.memory_budget_per_device, *abstract_args)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/linear_util.py", line 272, in memoized_fun
    ans = call(fun, *args)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 176, in parallelize_callable
    memory_budget_per_device, *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 127, in shard_parallel_callable
    *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 235, in shard_parallel_internal_gradient_accumulation
    backend = xb.get_backend("gpu")
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/monkey_patch.py", line 41, in override_get_backend
    return default_get_backend(*args, **kwargs)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/lib/xla_bridge.py", line 314, in get_backend
    return _get_backend_uncached(platform)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/lib/xla_bridge.py", line 304, in _get_backend_uncached
    raise RuntimeError(f"Backend '{platform}' failed to initialize: "
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 80, in test_1_shard_parallel
    actual_state = parallel_train_step(state, batch)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 108, in ret_func
    global_config.memory_budget_per_device, *abstract_args)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 176, in parallelize_callable
    memory_budget_per_device, *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 127, in shard_parallel_callable
    *avals)
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 235, in shard_parallel_internal_gradient_accumulation
    backend = xb.get_backend("gpu")
  File "/mnt/cache/zhangyuchang/750/alpa/alpa/monkey_patch.py", line 41, in override_get_backend
    return default_get_backend(*args, **kwargs)
RuntimeError: Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices.

======================================================================
ERROR: test_2
_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 86, in test_2_pipeline_parallel
    ray.init(address="auto")
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/worker.py", line 1072, in init
    connect_only=True,
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/node.py", line 177, in __init__
    self.validate_ip_port(self.address)
  File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/node.py", line 332, in validate_ip_port
    _ = int(port)
ValueError: invalid literal for int() with base 10: '10.140.1.112'

----------------------------------------------------------------------
Ran 2 tests in 3.069s

FAILED (errors=2)

and my Environment Variables are:

CC=/mnt/cache/share/gcc/gcc-7.5.0/bin/gcc-7.5.0/bin/gcc
CONDA_DEFAULT_ENV=alpa_ray_7.5.0
CONDA_EXE=/mnt/cache/share/platform/env/miniconda3.7/bin/conda
CONDA_PREFIX=/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0
CONDA_PROMPT_MODIFIER='(alpa_ray_7.5.0) '
CONDA_PYTHON_EXE=/mnt/cache/share/platform/env/miniconda3.7/bin/python
CONDA_SHLVL=1
CPATH=/mnt/cache/share/cuda-11.2/targets/x86_64-linux/include/:
CUDACXX=/mnt/cache/share/cuda-11.2/bin/nvcc
CUDA_HOME=/mnt/cache/share/cuda-11.2
CUDA_PATH=/mnt/cache/share/cuda-11.2
CUDA_TOOLKIT_ROOT_DIR=/mnt/cache/share/cuda-11.2
CXX=/mnt/cache/share/gcc/gcc-7.5.0/bin/g++
HISTCONTROL=ignoredups
HISTSIZE=50000
HISTTIMEFORMAT='%F %T zhangyuchang '
HOME=/mnt/lustre/zhangyuchang
HOSTNAME=SH-IDC1-10-140-0-32
LANG=en_US.UTF-8
LD_LIBRARY_PATH=/mnt/cache/share/cuda-11.2/lib64:/mnt/cache/share/cuda-11.2/extras/CUPTI/lib64:/mnt/cache/share/gcc/gcc-7.5.0/lib:/mnt/cache/share/gcc/gcc-7.5.0/lib64:/mnt/cache/share/gcc/gcc-7.5.0/include:/mnt/cache/share/gcc/gcc-7.5.0/bin:/mnt/cache/share/gcc/gmp-4.3.2/lib/:/mnt/cache/share/gcc/mpfr-2.4.2/lib/:/mnt/cache/share/gcc/mpc-0.8.1/lib/:/mnt/cache/share/cuda-11.2/targets/x86_64-linux/lib:/mnt/lustre/zhangyuchang/bin:/mnt/cache/share/platform/dep/nccl-2.9.8-cuda11.0/lib/:/mnt/cache/share/platform/dep/binutils-2.27/lib:/mnt/cache/share/platform/dep/openmpi-4.0.5-cuda11.0/lib:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/lib64:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/extras/CUPTI/lib64/:/mnt/cache/share/platform/env/miniconda3.6/lib:/mnt/cache/share/platform/dep/nccl-2.9.8-cuda11.0/lib/:/mnt/cache/share/platform/dep/binutils-2.27/lib:/mnt/cache/share/platform/dep/openmpi-4.0.5-cuda11.0/lib:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/lib64:/mnt/cache/share/platform/dep/cuda11.0-cudnn8.0/extras/CUPTI/lib64/:/mnt/cache/share/platform/env/miniconda3.6/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64/
LESS=-R
LESSOPEN='||/usr/bin/lesspipe.sh %s'
LIBRARY_PATH=/mnt/cache/share/cuda-11.2/lib64:
LOADEDMODULES=''
LOGNAME=zhangyuchang
LSCOLORS=Gxfxcxdxbxegedabagacad
LS_COLORS='rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:'
MAIL=/var/spool/mail/zhangyuchang
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
MODULESHOME=/usr/share/Modules
NCCL_INSTALL_PATH=/mnt/cache/share/platform/dep/nccl-2.9.8-cuda11.0
NCCL_SOCKET_IFNAME=eth0
PAGER=less
PATH=/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/bin:/mnt/cache/share/cuda-11.2/bin:/mnt/cache/share/gcc/gcc-7.5.0/bin/:/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray/bin:/mnt/cache/share/platform/env/miniconda3.7/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/mnt/lustre/zhangyuchang/bin:/mnt/lustre/zhangyuchang/bin
PWD=/mnt/cache/zhangyuchang/750/alpa
QT_GRAPHICSSYSTEM_CHECKED=1
SHELL=/usr/local/bash/bin/bash
SHLVL=3
SSH_CLIENT='10.201.32.68 45569 22'
SSH_CONNECTION='10.201.36.3 52001 10.140.0.32 22'
SSH_TTY=/dev/pts/267
TERM=screen
TF_PATH=/mnt/cache/zhangyuchang/750/tensorflow-alpa
TMUX=/tmp/tmux-200000422/default,194688,3
TMUX_PANE=%3
USER=zhangyuchang
XDG_RUNTIME_DIR=/run/user/200000422
XDG_SESSION_ID=680243
ZSH=/mnt/lustre/zhangyuchang/.oh-my-zsh
_=export

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions