-
Notifications
You must be signed in to change notification settings - Fork 209
Open
Description
单机运行两个程序,分别占用不同的两卡,sh文件中参数分别设置为
export MULTI_TENANT=1
export MASTER_PORT=6379
export DASHBOARD_PORT=8265和
export MULTI_TENANT=1
export MASTER_PORT=6380
export DASHBOARD_PORT=8266在任务1完成后,任务2报错显示:
Traceback (most recent call last):
File "/fs/fast/ROLL/examples/start_rlvr_vl_custom_pipeline.py", line 34, in
<module>
main()
File "/fs/fast/ROLL/examples/start_rlvr_vl_custom_pipeline.py", line 30, in
main
pipeline.run()
File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/utils/
_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/fs/fast/ROLL/roll/pipeline/rlvr/rlvr_custom_vlm_pipeline.py", line 52
4, in run
self.do_checkpoint(global_step=global_step)
File "/fs/fast/ROLL/roll/pipeline/base_pipeline.py", line 84, in do_checkpoi
nt
ckpt_metrics = DataProto.materialize_concat(data_refs=ckpt_metrics_refs)
File "/fs/fast/ROLL/roll/distributed/scheduler/protocol.py", line 854, in m$
terialize_concat
data: List["DataProto"] = ray.get(data_refs, timeout=timeout)
File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private
/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private
/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private
/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/fs/fast/anaconda3/envs/qwen/lib/python3.10/site-packages/ray/_private
/worker.py", line 932, in get_objects
raise value
ray.exceptions.ActorUnavailableError: The actor 1483e6d031abd82a654eb09902000000 is unavai
lable: The actor is temporarily unavailable: RpcError: RPC Error message: Socket closed; R
PC Error details: rpc_code: 14. The task may or maynot have been executed on the actor.
[2026-02-08 19:39:17,054 E 16388 17032] gcs_rpc_client.h:196: Failed to connect to GCS wit
hin 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or i
s killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. http
s://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-dir
ectory-structure. The program will terminate.由于之前也跑过3任务,任务1完成后任务2和3报同样的错,因此怀疑是单机任务并行导致任务被杀,不知道有没有什么解决方法
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels