I'm operating on a cluster. Ray saw my GPU but alpa didn't. I followed the installation documentation Install Alpa. And I confirmed I used --enable_cuda when I compiled jax-alpa. When running tests/test_install.py errors are reported, you can see the error log attached below for more details.
RAY_ADDRESS="10.140.1.112:6379" XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/cuda-11.2" srun -p caif_dev --gres=gpu:1 -w SH-IDC1-10-140-1-112 -n1 bash test_install.sh
echo "ray being starting"
ray start --head --node-ip-address 10.140.0.112 --address='10.140.0.112:6379'
echo "succeed==========="
ray status
echo "now running python script"
XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/cuda-11.2" python /mnt/cache/zhangyuchang/750/alpa/tests/test_install.py
ray stop
phoenix-srun: Job 126515 scheduled successfully!
Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition.
Current PHX_PRIORITY is normal
ray being starting
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
2022-05-10 19:50:46,487 ERROR services.py:1474 -- Failed to start the dashboard: Failed to start the dashboard
The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
2022-05-10 19:50:46,487 ERROR services.py:1475 -- Failed to start the dashboard
The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
Traceback (most recent call last):
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/_private/services.py", line 1451, in start_dashboard
raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard
The last 10 lines of /tmp/ray/session_2022-05-10_19-50-20_209620_45113/logs/dashboard.log:
2022-05-10 19:50:38,947 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
2022-05-10 19:50:20,155 WARN scripts.py:652 -- Specifying --address for external Redis address is deprecated. Please specify environment variable RAY_REDIS_ADDRESS=10.140.1.112:6379 instead.
2022-05-10 19:50:20,155 INFO scripts.py:659 -- Will use `10.140.1.112:6379` as external Redis server address(es). If the primary one is not reachable, we starts new one(s) with `--port` in local.
2022-05-10 19:50:20,155 INFO scripts.py:681 -- The primary external redis server `10.140.1.112:6379` is not reachable. Will starts new one(s) with `--port` in local.
2022-05-10 19:50:20,190 INFO scripts.py:697 -- Local node IP: 10.140.1.112
2022-05-10 19:50:47,423 SUCC scripts.py:739 -- --------------------
2022-05-10 19:50:47,423 SUCC scripts.py:740 -- Ray runtime started.
2022-05-10 19:50:47,423 SUCC scripts.py:741 -- --------------------
2022-05-10 19:50:47,423 INFO scripts.py:743 -- Next steps
2022-05-10 19:50:47,423 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2022-05-10 19:50:47,423 INFO scripts.py:749 -- ray start --address='10.140.1.112:6379'
2022-05-10 19:50:47,423 INFO scripts.py:752 -- Alternatively, use the following Python code:
2022-05-10 19:50:47,423 INFO scripts.py:754 -- import ray
2022-05-10 19:50:47,424 INFO scripts.py:767 -- ray.init(address='auto', _node_ip_address='10.140.1.112')
2022-05-10 19:50:47,424 INFO scripts.py:771 -- To connect to this Ray runtime from outside of the cluster, for example to
2022-05-10 19:50:47,424 INFO scripts.py:775 -- connect to a remote cluster from your laptop directly, use the following
2022-05-10 19:50:47,424 INFO scripts.py:778 -- Python code:
2022-05-10 19:50:47,424 INFO scripts.py:780 -- import ray
2022-05-10 19:50:47,424 INFO scripts.py:786 -- ray.init(address='ray://<head_node_ip_address>:10001')
2022-05-10 19:50:47,424 INFO scripts.py:792 -- If connection fails, check your firewall settings and network configuration.
2022-05-10 19:50:47,424 INFO scripts.py:798 -- To terminate the Ray runtime, run
2022-05-10 19:50:47,424 INFO scripts.py:799 -- ray stop
succeed===========
Node status
---------------------------------------------------------------
Healthy:
1 node_42a384b6d502cd18b6d052e98420df74116e83b73ef8146e44596910
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/128.0 CPU
0.0/1.0 GPU
0.00/804.774 GiB memory
0.00/186.265 GiB object_store_memory
Demands:
(no resource demands)
now running python script
2022-05-10 19:53:29.234043: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:272] failed call to cuInit: UNKNOWN ERROR (34)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
EE
======================================================================
ERROR: test_1_shard_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 128, in <module>
runner.run(suite())
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/runner.py", line 176, in run
test(result)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/suite.py", line 122, in run
test(result)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/case.py", line 676, in __call__
return self.run(*args, **kwds)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/unittest/case.py", line 628, in run
testMethod()
File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 80, in test_1_shard_parallel
actual_state = parallel_train_step(state, batch)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 108, in ret_func
global_config.memory_budget_per_device, *abstract_args)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/linear_util.py", line 272, in memoized_fun
ans = call(fun, *args)
File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 176, in parallelize_callable
memory_budget_per_device, *avals)
File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 127, in shard_parallel_callable
*avals)
File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 235, in shard_parallel_internal_gradient_accumulation
backend = xb.get_backend("gpu")
File "/mnt/cache/zhangyuchang/750/alpa/alpa/monkey_patch.py", line 41, in override_get_backend
return default_get_backend(*args, **kwargs)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/lib/xla_bridge.py", line 314, in get_backend
return _get_backend_uncached(platform)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/jax/_src/lib/xla_bridge.py", line 304, in _get_backend_uncached
raise RuntimeError(f"Backend '{platform}' failed to initialize: "
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices.
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 80, in test_1_shard_parallel
actual_state = parallel_train_step(state, batch)
File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 108, in ret_func
global_config.memory_budget_per_device, *abstract_args)
File "/mnt/cache/zhangyuchang/750/alpa/alpa/api.py", line 176, in parallelize_callable
memory_budget_per_device, *avals)
File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 127, in shard_parallel_callable
*avals)
File "/mnt/cache/zhangyuchang/750/alpa/alpa/shard_parallel/shard_callable.py", line 235, in shard_parallel_internal_gradient_accumulation
backend = xb.get_backend("gpu")
File "/mnt/cache/zhangyuchang/750/alpa/alpa/monkey_patch.py", line 41, in override_get_backend
return default_get_backend(*args, **kwargs)
RuntimeError: Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices.
======================================================================
ERROR: test_2
_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/mnt/cache/zhangyuchang/750/alpa/tests/test_install.py", line 86, in test_2_pipeline_parallel
ray.init(address="auto")
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/worker.py", line 1072, in init
connect_only=True,
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/node.py", line 177, in __init__
self.validate_ip_port(self.address)
File "/mnt/lustre/zhangyuchang/.conda/envs/alpa_ray_7.5.0/lib/python3.7/site-packages/ray/node.py", line 332, in validate_ip_port
_ = int(port)
ValueError: invalid literal for int() with base 10: '10.140.1.112'
----------------------------------------------------------------------
Ran 2 tests in 3.069s
FAILED (errors=2)
Hi,
I don't know if it's appropriate to file this as a bug, but it's been bugging me for a long time and I have no way to fix it.
I'm operating on a cluster. Ray saw my GPU but alpa didn't. I followed the installation documentation Install Alpa. And I confirmed I used --enable_cuda when I compiled jax-alpa. When running
tests/test_install.pyerrors are reported, you can see the error log attached below for more details.System information and environment
conda listandpip listshow the Alpa version is 0.0.0)To Reproduce
I ran
and my test_install.sh is:
Log
and my Environment Variables are: