You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.
Hi Alpa developers, thanks for your implementation of such impressive project.
I'm currently trying to use Alpa, but encountered problems.
After I successfully built Alpa following the guidance, I cannot pass the test_install.py, due to Ray issues. (But I can run Ray examples in the doc successfully)
The following is the log after I execute python3 tests/test_install.py. I'm using the latest commit.
The environment is:
Hardware: 8 NVIDIA Tesla P100 GPUs
docker image: nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04
Python 3.8
It seems to fail in the alpa/device_mesh.py (#L848) , because it cannot set the runtime_env successfully.
Hope you can give some suggestions, thanks!
.(raylet) Traceback (most recent call last):
(raylet) File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py", line 391, in <module>
(raylet) loop.run_until_complete(agent.run())
(raylet) File "/usr/lib/python3.8/asyncio/base_events.py", line 608, in run_until_complete
(raylet) return future.result()
(raylet) File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py", line 178, in run
(raylet) modules = self._load_modules()
(raylet) File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py", line 120, in _load_modules
(raylet) c = cls(self)
(raylet) File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
(raylet) self._metrics_agent = MetricsAgent(
(raylet) File "/usr/local/lib/python3.8/dist-packages/ray/_private/metrics_agent.py", line 75, in __init__
(raylet) prometheus_exporter.new_stats_exporter(
(raylet) File "/usr/local/lib/python3.8/dist-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
(raylet) exporter = PrometheusStatsExporter(
(raylet) File "/usr/local/lib/python3.8/dist-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
(raylet) self.serve_http()
(raylet) File "/usr/local/lib/python3.8/dist-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
(raylet) start_http_server(
(raylet) File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 167, in start_wsgi_server
(raylet) TmpServer.address_family, addr = _get_best_family(addr, port)
(raylet) File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 156, in _get_best_family
(raylet) infos = socket.getaddrinfo(address, port)
(raylet) File "/usr/lib/python3.8/socket.py", line 914, in getaddrinfo
(raylet) for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
(raylet) socket.gaierror: [Errno -2] Name or service not known
(raylet)
(raylet) During handling of the above exception, another exception occurred:
(raylet)
(raylet) Traceback (most recent call last):
(raylet) File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py", line 407, in <module>
(raylet) gcs_publisher = GcsPublisher(args.gcs_address)
(raylet) TypeError: __init__() takes 1 positional argument but 2 were given
Exception in thread Thread-9:
Exception in thread Thread-10:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/alpa-2/alpa/alpa/pipeline_parallel/pipeshard_parallel.py", line 325, in launch_func
self._target(*self._args, **self._kwargs)
File "/home/alpa-2/alpa/alpa/pipeline_parallel/pipeshard_parallel.py", line 325, in launch_func
physical_meshes[i] = virtual_meshes[i].get_physical_mesh()
File "/home/alpa-2/alpa/alpa/device_mesh.py", line 1357, in get_physical_mesh
physical_meshes[i] = virtual_meshes[i].get_physical_mesh()
File "/home/alpa-2/alpa/alpa/device_mesh.py", line 1357, in get_physical_mesh
return DistributedPhysicalDeviceMesh(
File "/home/alpa-2/alpa/alpa/device_mesh.py", line 791, in __init__
return DistributedPhysicalDeviceMesh(
File "/home/alpa-2/alpa/alpa/device_mesh.py", line 791, in __init__
self._launch_xla_servers()
File "/home/alpa-2/alpa/alpa/device_mesh.py", line 848, in _launch_xla_servers
self._launch_xla_servers()
File "/home/alpa-2/alpa/alpa/device_mesh.py", line 848, in _launch_xla_servers
self.sync_workers()
File "/home/alpa-2/alpa/alpa/device_mesh.py", line 1063, in sync_workers
self.sync_workers()
File "/home/alpa-2/alpa/alpa/device_mesh.py", line 1063, in sync_workers
ray.get([w.sync.remote() for w in self.workers])
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
ray.get([w.sync.remote() for w in self.workers])
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 1765, in get
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 1765, in get
raise value
raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.
(pid=gcs_server) [2022-04-07 08:47:35,923 E 8360 8360] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node cd6438e3757c0527cd8742226b4af76c9a331c638b64b7d93f66506d for actor 714e6419c01bcbd1b050271001000000(MeshHostWorker.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-04-07 08:47:35,923 E 8360 8360] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node cd6438e3757c0527cd8742226b4af76c9a331c638b64b7d93f66506d for actor c741609f8db7a71210f89e1501000000(MeshHostWorker.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
E
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/test_install.py", line 127, in <module>
runner.run(suite())
File "/usr/lib/python3.8/unittest/runner.py", line 176, in run
test(result)
File "/usr/lib/python3.8/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/usr/lib/python3.8/unittest/suite.py", line 122, in run
test(result)
File "/usr/lib/python3.8/unittest/case.py", line 736, in __call__
return self.run(*args, **kwds)
File "/usr/lib/python3.8/unittest/case.py", line 676, in run
self._callTestMethod(testMethod)
File "/usr/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
method()
File "tests/test_install.py", line 112, in test_2_pipeline_parallel
actual_state = parallel_train_step(state, batch)
File "/home/jax-alpa/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/home/alpa-2/alpa/alpa/api.py", line 107, in ret_func
compiled_func = parallelize_callable(
File "/home/jax-alpa/jax/linear_util.py", line 263, in memoized_fun
ans = call(fun, *args)
File "/home/alpa-2/alpa/alpa/api.py", line 180, in parallelize_callable
return pipeshard_parallel_callable(fun, in_tree, out_tree_thunk,
File "/home/jax-alpa/jax/linear_util.py", line 263, in memoized_fun
ans = call(fun, *args)
File "/home/alpa-2/alpa/alpa/pipeline_parallel/pipeshard_parallel.py", line 204, in pipeshard_parallel_callable
jp = DecentralizedDistributedRuntime(pipeline_stages=xla_stages,
File "/home/alpa-2/alpa/alpa/pipeline_parallel/decentralized_distributed_runtime.py", line 176, in __init__
super().__init__(pipeline_stages=pipeline_stages,
File "/home/alpa-2/alpa/alpa/pipeline_parallel/base_runtime.py", line 122, in __init__
self._establish_nccl_groups()
File "/home/alpa-2/alpa/alpa/pipeline_parallel/base_runtime.py", line 202, in _establish_nccl_groups
self.physical_meshes.establish_nccl_group(i, j)
File "/home/alpa-2/alpa/alpa/pipeline_parallel/device_mesh_group.py", line 36, in establish_nccl_group
device_strs = OrderedSet(src_mesh.device_strs + dst_mesh.device_strs)
jax._src.traceback_util.UnfilteredStackTrace: AttributeError: 'NoneType' object has no attribute 'device_strs'
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "tests/test_install.py", line 112, in test_2_pipeline_parallel
actual_state = parallel_train_step(state, batch)
File "/home/alpa-2/alpa/alpa/api.py", line 107, in ret_func
compiled_func = parallelize_callable(
File "/home/alpa-2/alpa/alpa/api.py", line 180, in parallelize_callable
return pipeshard_parallel_callable(fun, in_tree, out_tree_thunk,
File "/home/alpa-2/alpa/alpa/pipeline_parallel/pipeshard_parallel.py", line 204, in pipeshard_parallel_callable
jp = DecentralizedDistributedRuntime(pipeline_stages=xla_stages,
File "/home/alpa-2/alpa/alpa/pipeline_parallel/decentralized_distributed_runtime.py", line 176, in __init__
super().__init__(pipeline_stages=pipeline_stages,
File "/home/alpa-2/alpa/alpa/pipeline_parallel/base_runtime.py", line 122, in __init__
self._establish_nccl_groups()
File "/home/alpa-2/alpa/alpa/pipeline_parallel/base_runtime.py", line 202, in _establish_nccl_groups
self.physical_meshes.establish_nccl_group(i, j)
File "/home/alpa-2/alpa/alpa/pipeline_parallel/device_mesh_group.py", line 36, in establish_nccl_group
device_strs = OrderedSet(src_mesh.device_strs + dst_mesh.device_strs)
AttributeError: 'NoneType' object has no attribute 'device_strs'
----------------------------------------------------------------------
Ran 2 tests in 59.183s
FAILED (errors=1)
The text was updated successfully, but these errors were encountered:
Hi Alpa developers, thanks for your implementation of such impressive project.
I'm currently trying to use Alpa, but encountered problems.
After I successfully built Alpa following the guidance, I cannot pass the
test_install.py
, due to Ray issues. (But I can run Ray examples in the doc successfully)The following is the log after I execute
python3 tests/test_install.py
. I'm using the latest commit.The environment is:
Hardware: 8 NVIDIA Tesla P100 GPUs
docker image: nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04
Python 3.8
It seems to fail in the
alpa/device_mesh.py
(#L848) , because it cannot set the runtime_env successfully.Hope you can give some suggestions, thanks!
The text was updated successfully, but these errors were encountered: