Here is the traceback.
Traceback (most recent call last):
File "/ROLL/roll/distributed/scheduler/decorator.py", line 296, in inner [repeated 7x across cluster]
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/ROLL/roll/distributed/strategy/megatron_strategy.py", line 978, in initialize [repeated 14x across cluster]
self.strategy.initialize(model_provider=default_actor_model_provider) [repeated 7x across cluster]
self.forward_backward_func = get_forward_backward_func()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/megatron/core/pipeline_parallel/schedules.py", line 114, in get_forward_backward_func [repeated 7x across cluster]
pipeline_model_parallel_size = parallel_state.get_pipeline_model_parallel_world_size()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/megatron/core/parallel_state.py", line 1559, in get_pipeline_model_parallel_world_size [repeated 7x across cluster]
pp_group = get_pipeline_model_parallel_group()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/megatron/core/parallel_state.py", line 1400, in get_pipeline_model_parallel_group [repeated 7x across cluster]
_PIPELINE_MODEL_PARALLEL_GROUP is not None
AssertionError: pipeline_model parallel group is not initialized
When run
https://github.com/alibaba/ROLL/blob/main/examples/qwen2.5-7B-sft_megatron/sft_config.yamlwith megatron backend. I will met a error reportedAssertionError: pipeline_model parallel group is ot initialized.Here is the traceback.