-
Notifications
You must be signed in to change notification settings - Fork 102
Description
<x_grpo_trainer.XGRPOTrainer object at 0x7f49e9163e50>
2025-03-26 23:00:15 - INFO - main - *** Train ***
[INFO|deepspeed.py:386] 2025-03-26 23:00:15,856 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB)
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
[1/1] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/conda/lib -L/opt/conda/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/conda/lib -L/opt/conda/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand: No such file or directory
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Loading extension module cpu_adam...
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
[rank1]: subprocess.run(
[rank1]: File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
[rank1]: raise CalledProcessError(retcode, process.args,
[rank1]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
[rank1]: The above exception was the direct cause of the following exception:
[rank1]: Traceback (most recent call last):
[rank1]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 275, in
[rank1]: main(script_args, training_args, model_args )
[rank1]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 239, in main
[rank1]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop
[rank1]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1392, in prepare
[rank1]: result = self._prepare_deepspeed(*args)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1942, in _prepare_deepspeed
[rank1]: optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 97, in map_pytorch_optim_to_deepspeed
[rank1]: return optimizer_class(optimizer.param_groups, **defaults)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
[rank1]: self.ds_opt_adam = CPUAdamBuilder().load()
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 540, in load
[rank1]: return self.jit_load(verbose)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 587, in jit_load
[rank1]: op_module = load(name=self.name,
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1314, in load
[rank1]: return _jit_compile(
[rank1]: ^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1721, in _jit_compile
[rank1]: _write_ninja_file_and_build_library(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1833, in _write_ninja_file_and_build_library
[rank1]: _run_ninja_build(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
[rank1]: raise RuntimeError(message) from e
[rank1]: RuntimeError: Error building extension 'cpu_adam'
[rank2]: Traceback (most recent call last):
[rank2]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 275, in
[rank2]: main(script_args, training_args, model_args )
[rank2]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 239, in main
[rank2]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank2]: return inner_training_loop(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop
[rank2]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1392, in prepare
[rank2]: result = self._prepare_deepspeed(*args)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1942, in _prepare_deepspeed
[rank2]: optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 97, in map_pytorch_optim_to_deepspeed
[rank2]: return optimizer_class(optimizer.param_groups, **defaults)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
[rank2]: self.ds_opt_adam = CPUAdamBuilder().load()
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 540, in load
[rank2]: return self.jit_load(verbose)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 587, in jit_load
[rank2]: op_module = load(name=self.name,
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1314, in load
[rank2]: return _jit_compile(
[rank2]: ^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1746, in _jit_compile
[rank2]: return _import_module_from_library(name, build_directory, is_python_module)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2140, in _import_module_from_library
[rank2]: module = importlib.util.module_from_spec(spec)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "", line 573, in module_from_spec
[rank2]: File "", line 1233, in create_module
[rank2]: File "", line 241, in _call_with_frames_removed
[rank2]: ImportError: /root/.cache/torch_extensions/py311_cu124/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f82abc1d8a0>
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
self.ds_opt_adam.destroy_adam(self.opt_id)
^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/conda/lib -L/opt/conda/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/conda/lib -L/opt/conda/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand: No such file or directory
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
[rank0]: subprocess.run(
[rank0]: File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
[rank0]: raise CalledProcessError(retcode, process.args,
[rank0]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
[rank0]: The above exception was the direct cause of the following exception: