Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug about fp16 in experiment run_ecthr.sh #26

Closed
RichardHGL opened this issue Aug 10, 2022 · 2 comments
Closed

Bug about fp16 in experiment run_ecthr.sh #26

RichardHGL opened this issue Aug 10, 2022 · 2 comments

Comments

@RichardHGL
Copy link

When I run the run_ecthr.sh script in experiments folder
Such error occurs:

Traceback (most recent call last):
  File "main_ecthr.py", line 505, in <module>
    main()
  File "main_ecthr.py", line 454, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 1498, in train
    return inner_training_loop(
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 1832, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 2038, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 2758, in evaluate
    output = eval_loop(
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 2936, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/trainer.py", line 3177, in prediction_step
    loss, outputs = self.compute_loss(model, inputs, return_outputs=True)
  File "/workspace/MaxPlain/lexglue/experiments/trainer.py", line 8, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 1556, in forward
    outputs = self.bert(
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/MaxPlain/lexglue/models/hierbert.py", line 100, in forward
    seg_encoder_outputs = self.seg_encoder(encoder_outputs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 238, in forward
    output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/OmniXAI/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 437, in forward
    return torch._transformer_encoder_layer_fwd(
RuntimeError: expected scalar type Half but found Float

I try to debug it. And find that it maybe due to the trainer fail to put model to dtype=torch.fp16.
I also tried the evaluation. It will fail and report the same error.

# Evaluation
    if training_args.do_eval:
        logger.info("*** Evaluate ***")
        metrics = trainer.evaluate(eval_dataset=eval_dataset)

        max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
        metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)

After I remove --fp16 --fp16_full_eval in the run_ecthr.sh, it works as expected.

@RichardHGL RichardHGL changed the title But about fp16 in experiment run_ecthr.sh Bug about fp16 in experiment run_ecthr.sh Aug 10, 2022
@iliaschalkidis
Copy link
Collaborator

Hi @RichardHGL, these HF arguments/parameters ( --fp16 --fp16_full_eval) are only applicable (working) when there are available (and correctly configured) NVIDIA GPUs in a machine station (server or cluster) and also torch is correctly configured to use these compute resources.

Did you use the script in such an environment? If yes, what kind of GPUs were available?

@RichardHGL
Copy link
Author

RichardHGL commented Aug 11, 2022

Thanks for your reply! I used RTX A6000, it's an NVIDIA GPU. I think the torch is correctly configured. But it's okay. I'll just use fp32 to run the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants