-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fp16 compatibility #17
Comments
I read that this is fixed by downgrading pytorch to 1.14.0. But when I do that, I get a segmenation fault. Nothing is easy. |
I force the n_gpu to be 1, now it crashes with
|
Yes, it seems to be a version issue that emerges recently. I revised the script according to huggingface/transformers#10199 (comment), and it works for me. |
yes, I got past all these problems, but still ends up with oom, even on colab. I raised a new issue. |
I am running sh_albert_cls.sh. It crashed with
Iteration: 0%| | 0/10860 [00:03<?, ?it/s]
Epoch: 0%| | 0/2 [00:03<?, ?it/s]
Traceback (most recent call last):
File "./examples/run_cls.py", line 645, in
main()
File "./examples/run_cls.py", line 533, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "./examples/run_cls.py", line 159, in train
outputs = model(**inputs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/bruce/AwesomeMRC/transformer-mrc/transformers/modeling_albert.py", line 688, in forward
inputs_embeds=inputs_embeds
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/bruce/AwesomeMRC/transformer-mrc/transformers/modeling_albert.py", line 524, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration
I commented out the argument
--fp16
But still get the same error.
The messages are really not telling me much what's wrong. Any ideas?
The text was updated successfully, but these errors were encountered: