Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

微调过程中:RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #242

Open
Dagoli opened this issue Oct 19, 2023 · 3 comments

Comments

@Dagoli
Copy link

Dagoli commented Oct 19, 2023

[INFO|trainer.py:1712] 2023-10-19 09:44:55,247 >> ***** Running training *****
[INFO|trainer.py:1713] 2023-10-19 09:44:55,247 >> Num examples = 9,861
[INFO|trainer.py:1714] 2023-10-19 09:44:55,247 >> Num Epochs = 10
[INFO|trainer.py:1715] 2023-10-19 09:44:55,247 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1718] 2023-10-19 09:44:55,247 >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1719] 2023-10-19 09:44:55,247 >> Gradient Accumulation steps = 8
[INFO|trainer.py:1720] 2023-10-19 09:44:55,247 >> Total optimization steps = 1,540
[INFO|trainer.py:1721] 2023-10-19 09:44:55,252 >> Number of trainable parameters = 19,988,480
0%| | 0/1540 [00:00<?, ?it/s][WARNING|logging.py:305] 2023-10-19 09:44:55,318 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[WARNING|logging.py:305] 2023-10-19 09:44:55,325 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[WARNING|logging.py:305] 2023-10-19 09:44:55,342 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[WARNING|logging.py:305] 2023-10-19 09:44:55,346 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[WARNING|logging.py:305] 2023-10-19 09:44:55,347 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[WARNING|logging.py:305] 2023-10-19 09:44:55,350 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[WARNING|logging.py:305] 2023-10-19 09:44:55,351 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[WARNING|logging.py:305] 2023-10-19 09:44:55,355 >> use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
/usr/local/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Traceback (most recent call last):
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in
main()
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in
main()
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in
main()
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in
main()
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in
main()
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in
main()
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
0%| | 0/1540 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in
main()
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback (most recent call last):
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 690, in
main()
File "/home/xxx/Llama2-Chinese/train/sft/finetune_clm_lora.py", line 651, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1835, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2690, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1979, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1895, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1902, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
[2023-10-19 09:44:57,212] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092520
[2023-10-19 09:44:57,257] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092521
[2023-10-19 09:44:57,413] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092522
[2023-10-19 09:44:57,590] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092523
[2023-10-19 09:44:57,632] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092524
[2023-10-19 09:44:57,650] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092525
[2023-10-19 09:44:57,650] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092526
[2023-10-19 09:44:57,668] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2092527

@abulice
Copy link

abulice commented Oct 19, 2023 via email

@Dagoli
Copy link
Author

Dagoli commented Oct 20, 2023

@abulice 已解决 vim /usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:line1902,增加loss.requires_grad_()

@forever22777
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants