Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

runtime_env failed to be set up #377

Closed
Vamix opened this issue Apr 7, 2022 · 2 comments
Closed

runtime_env failed to be set up #377

Vamix opened this issue Apr 7, 2022 · 2 comments

Comments

@Vamix
Copy link

Vamix commented Apr 7, 2022

Hi Alpa developers, thanks for your implementation of such impressive project.
I'm currently trying to use Alpa, but encountered problems.
After I successfully built Alpa following the guidance, I cannot pass the test_install.py, due to Ray issues. (But I can run Ray examples in the doc successfully)

The following is the log after I execute python3 tests/test_install.py. I'm using the latest commit.

The environment is:
Hardware: 8 NVIDIA Tesla P100 GPUs
docker image: nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04
Python 3.8

It seems to fail in the alpa/device_mesh.py (#L848) , because it cannot set the runtime_env successfully.

Hope you can give some suggestions, thanks!

.(raylet) Traceback (most recent call last):
(raylet)   File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py", line 391, in <module>
(raylet)     loop.run_until_complete(agent.run())
(raylet)   File "/usr/lib/python3.8/asyncio/base_events.py", line 608, in run_until_complete
(raylet)     return future.result()
(raylet)   File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py", line 178, in run
(raylet)     modules = self._load_modules()
(raylet)   File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py", line 120, in _load_modules
(raylet)     c = cls(self)
(raylet)   File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
(raylet)     self._metrics_agent = MetricsAgent(
(raylet)   File "/usr/local/lib/python3.8/dist-packages/ray/_private/metrics_agent.py", line 75, in __init__
(raylet)     prometheus_exporter.new_stats_exporter(
(raylet)   File "/usr/local/lib/python3.8/dist-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
(raylet)     exporter = PrometheusStatsExporter(
(raylet)   File "/usr/local/lib/python3.8/dist-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
(raylet)     self.serve_http()
(raylet)   File "/usr/local/lib/python3.8/dist-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
(raylet)     start_http_server(
(raylet)   File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 167, in start_wsgi_server
(raylet)     TmpServer.address_family, addr = _get_best_family(addr, port)
(raylet)   File "/usr/local/lib/python3.8/dist-packages/prometheus_client/exposition.py", line 156, in _get_best_family
(raylet)     infos = socket.getaddrinfo(address, port)
(raylet)   File "/usr/lib/python3.8/socket.py", line 914, in getaddrinfo
(raylet)     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
(raylet) socket.gaierror: [Errno -2] Name or service not known
(raylet) 
(raylet) During handling of the above exception, another exception occurred:
(raylet) 
(raylet) Traceback (most recent call last):
(raylet)   File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/agent.py", line 407, in <module>
(raylet)     gcs_publisher = GcsPublisher(args.gcs_address)
(raylet) TypeError: __init__() takes 1 positional argument but 2 were given
Exception in thread Thread-9:
Exception in thread Thread-10:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/pipeshard_parallel.py", line 325, in launch_func
    self._target(*self._args, **self._kwargs)
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/pipeshard_parallel.py", line 325, in launch_func
    physical_meshes[i] = virtual_meshes[i].get_physical_mesh()
  File "/home/alpa-2/alpa/alpa/device_mesh.py", line 1357, in get_physical_mesh
    physical_meshes[i] = virtual_meshes[i].get_physical_mesh()
  File "/home/alpa-2/alpa/alpa/device_mesh.py", line 1357, in get_physical_mesh
    return DistributedPhysicalDeviceMesh(
  File "/home/alpa-2/alpa/alpa/device_mesh.py", line 791, in __init__
    return DistributedPhysicalDeviceMesh(
  File "/home/alpa-2/alpa/alpa/device_mesh.py", line 791, in __init__
    self._launch_xla_servers()
  File "/home/alpa-2/alpa/alpa/device_mesh.py", line 848, in _launch_xla_servers
    self._launch_xla_servers()
  File "/home/alpa-2/alpa/alpa/device_mesh.py", line 848, in _launch_xla_servers
    self.sync_workers()
  File "/home/alpa-2/alpa/alpa/device_mesh.py", line 1063, in sync_workers
    self.sync_workers()
  File "/home/alpa-2/alpa/alpa/device_mesh.py", line 1063, in sync_workers
    ray.get([w.sync.remote() for w in self.workers])
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    ray.get([w.sync.remote() for w in self.workers])
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 1765, in get
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 1765, in get
    raise value
    raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.
(pid=gcs_server) [2022-04-07 08:47:35,923 E 8360 8360] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node cd6438e3757c0527cd8742226b4af76c9a331c638b64b7d93f66506d for actor 714e6419c01bcbd1b050271001000000(MeshHostWorker.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-04-07 08:47:35,923 E 8360 8360] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node cd6438e3757c0527cd8742226b4af76c9a331c638b64b7d93f66506d for actor c741609f8db7a71210f89e1501000000(MeshHostWorker.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
E
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_install.py", line 127, in <module>
    runner.run(suite())
  File "/usr/lib/python3.8/unittest/runner.py", line 176, in run
    test(result)
  File "/usr/lib/python3.8/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib/python3.8/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/lib/python3.8/unittest/case.py", line 736, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib/python3.8/unittest/case.py", line 676, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
    method()
  File "tests/test_install.py", line 112, in test_2_pipeline_parallel
    actual_state = parallel_train_step(state, batch)
  File "/home/jax-alpa/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/alpa-2/alpa/alpa/api.py", line 107, in ret_func
    compiled_func = parallelize_callable(
  File "/home/jax-alpa/jax/linear_util.py", line 263, in memoized_fun
    ans = call(fun, *args)
  File "/home/alpa-2/alpa/alpa/api.py", line 180, in parallelize_callable
    return pipeshard_parallel_callable(fun, in_tree, out_tree_thunk,
  File "/home/jax-alpa/jax/linear_util.py", line 263, in memoized_fun
    ans = call(fun, *args)
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/pipeshard_parallel.py", line 204, in pipeshard_parallel_callable
    jp = DecentralizedDistributedRuntime(pipeline_stages=xla_stages,
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/decentralized_distributed_runtime.py", line 176, in __init__
    super().__init__(pipeline_stages=pipeline_stages,
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/base_runtime.py", line 122, in __init__
    self._establish_nccl_groups()
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/base_runtime.py", line 202, in _establish_nccl_groups
    self.physical_meshes.establish_nccl_group(i, j)
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/device_mesh_group.py", line 36, in establish_nccl_group
    device_strs = OrderedSet(src_mesh.device_strs + dst_mesh.device_strs)
jax._src.traceback_util.UnfilteredStackTrace: AttributeError: 'NoneType' object has no attribute 'device_strs'

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "tests/test_install.py", line 112, in test_2_pipeline_parallel
    actual_state = parallel_train_step(state, batch)
  File "/home/alpa-2/alpa/alpa/api.py", line 107, in ret_func
    compiled_func = parallelize_callable(
  File "/home/alpa-2/alpa/alpa/api.py", line 180, in parallelize_callable
    return pipeshard_parallel_callable(fun, in_tree, out_tree_thunk,
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/pipeshard_parallel.py", line 204, in pipeshard_parallel_callable
    jp = DecentralizedDistributedRuntime(pipeline_stages=xla_stages,
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/decentralized_distributed_runtime.py", line 176, in __init__
    super().__init__(pipeline_stages=pipeline_stages,
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/base_runtime.py", line 122, in __init__
    self._establish_nccl_groups()
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/base_runtime.py", line 202, in _establish_nccl_groups
    self.physical_meshes.establish_nccl_group(i, j)
  File "/home/alpa-2/alpa/alpa/pipeline_parallel/device_mesh_group.py", line 36, in establish_nccl_group
    device_strs = OrderedSet(src_mesh.device_strs + dst_mesh.device_strs)
AttributeError: 'NoneType' object has no attribute 'device_strs'

----------------------------------------------------------------------
Ran 2 tests in 59.183s

FAILED (errors=1)
@merrymercy
Copy link
Member

This seems to be a bug of ray. Try the Nightlies https://docs.ray.io/en/latest/ray-overview/installation.html#daily-releases-nightlies

@Vamix
Copy link
Author

Vamix commented Apr 9, 2022

This seems to be a bug of ray. Try the Nightlies https://docs.ray.io/en/latest/ray-overview/installation.html#daily-releases-nightlies

Thanks for your reply! I fixed this by switching to another docker image, nvcr.io/nvidia/tensorflow:22.02-tf2-py3.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants