You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run the run_ecthr.sh script in experiments folder
Such error occurs:
Traceback (most recent call last):
File "main_ecthr.py", line 505, in <module>
main()
File "main_ecthr.py", line 454, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 1498, in train
return inner_training_loop(
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 1832, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 2038, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 2758, in evaluate
output = eval_loop(
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 2936, in evaluation_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 3177, in prediction_step
loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
File "/workspace/MaxPlain/lexglue/experiments/trainer.py", line 8, in compute_loss
outputs = model(**inputs)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 1556, in forward
outputs = self.bert(
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/MaxPlain/lexglue/models/hierbert.py", line 100, in forward
seg_encoder_outputs = self.seg_encoder(encoder_outputs)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 238, in forward
output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 437, in forward
return torch._transformer_encoder_layer_fwd(
RuntimeError: expected scalar type Half but found Float
I try to debug it. And find that it maybe due to the trainer fail to put model to dtype=torch.fp16.
I also tried the evaluation. It will fail and report the same error.
# Evaluation
if training_args.do_eval:
logger.info("*** Evaluate ***")
metrics = trainer.evaluate(eval_dataset=eval_dataset)
max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
After I remove --fp16 --fp16_full_eval in the run_ecthr.sh, it works as expected.
The text was updated successfully, but these errors were encountered:
RichardHGL
changed the title
But about fp16 in experiment run_ecthr.sh
Bug about fp16 in experiment run_ecthr.sh
Aug 10, 2022
Hi @RichardHGL, these HF arguments/parameters ( --fp16--fp16_full_eval) are only applicable (working) when there are available (and correctly configured) NVIDIA GPUs in a machine station (server or cluster) and also torch is correctly configured to use these compute resources.
Did you use the script in such an environment? If yes, what kind of GPUs were available?
Thanks for your reply! I used RTX A6000, it's an NVIDIA GPU. I think the torch is correctly configured. But it's okay. I'll just use fp32 to run the code.
When I run the run_ecthr.sh script in experiments folder
Such error occurs:
I try to debug it. And find that it maybe due to the trainer fail to put model to
dtype=torch.fp16
.I also tried the evaluation. It will fail and report the same error.
After I remove
--fp16 --fp16_full_eval
in the run_ecthr.sh, it works as expected.The text was updated successfully, but these errors were encountered: