Bug about fp16 in experiment run_ecthr.sh #26

RichardHGL · 2022-08-10T06:57:46Z

When I run the run_ecthr.sh script in experiments folder
Such error occurs:

Traceback (most recent call last):
  File "main_ecthr.py", line 505, in <module>
    main()
  File "main_ecthr.py", line 454, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 1498, in train
    return inner_training_loop(
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 1832, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 2038, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 2758, in evaluate
    output = eval_loop(
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 2936, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 3177, in prediction_step
    loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
  File "/workspace/MaxPlain/lexglue/experiments/trainer.py", line 8, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 1556, in forward
    outputs = self.bert(
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/MaxPlain/lexglue/models/hierbert.py", line 100, in forward
    seg_encoder_outputs = self.seg_encoder(encoder_outputs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 238, in forward
    output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 437, in forward
    return torch._transformer_encoder_layer_fwd(
RuntimeError: expected scalar type Half but found Float

I try to debug it. And find that it maybe due to the trainer fail to put model to dtype=torch.fp16.
I also tried the evaluation. It will fail and report the same error.

# Evaluation
    if training_args.do_eval:
        logger.info("*** Evaluate ***")
        metrics = trainer.evaluate(eval_dataset=eval_dataset)

        max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
        metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)

After I remove --fp16 --fp16_full_eval in the run_ecthr.sh, it works as expected.

The text was updated successfully, but these errors were encountered:

iliaschalkidis · 2022-08-10T18:41:16Z

Hi @RichardHGL, these HF arguments/parameters ( --fp16 --fp16_full_eval) are only applicable (working) when there are available (and correctly configured) NVIDIA GPUs in a machine station (server or cluster) and also torch is correctly configured to use these compute resources.

Did you use the script in such an environment? If yes, what kind of GPUs were available?

RichardHGL · 2022-08-11T09:21:57Z

Thanks for your reply! I used RTX A6000, it's an NVIDIA GPU. I think the torch is correctly configured. But it's okay. I'll just use fp32 to run the code.

RichardHGL changed the title ~~But about fp16 in experiment run_ecthr.sh~~ Bug about fp16 in experiment run_ecthr.sh Aug 10, 2022

iliaschalkidis closed this as completed Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug about fp16 in experiment run_ecthr.sh #26

Bug about fp16 in experiment run_ecthr.sh #26

RichardHGL commented Aug 10, 2022

iliaschalkidis commented Aug 10, 2022

RichardHGL commented Aug 11, 2022 •

edited

Bug about fp16 in experiment run_ecthr.sh #26

Bug about fp16 in experiment run_ecthr.sh #26

Comments

RichardHGL commented Aug 10, 2022

iliaschalkidis commented Aug 10, 2022

RichardHGL commented Aug 11, 2022 • edited

RichardHGL commented Aug 11, 2022 •

edited