Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fp16 compatibility #17

Closed
bhomass opened this issue Mar 17, 2021 · 4 comments
Closed

fp16 compatibility #17

bhomass opened this issue Mar 17, 2021 · 4 comments

Comments

@bhomass
Copy link

bhomass commented Mar 17, 2021

I am running sh_albert_cls.sh. It crashed with

Iteration: 0%| | 0/10860 [00:03<?, ?it/s]
Epoch: 0%| | 0/2 [00:03<?, ?it/s]
Traceback (most recent call last):
File "./examples/run_cls.py", line 645, in
main()
File "./examples/run_cls.py", line 533, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "./examples/run_cls.py", line 159, in train
outputs = model(**inputs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/bruce/AwesomeMRC/transformer-mrc/transformers/modeling_albert.py", line 688, in forward
inputs_embeds=inputs_embeds
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/bruce/AwesomeMRC/transformer-mrc/transformers/modeling_albert.py", line 524, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration

I commented out the argument

--fp16

But still get the same error.
The messages are really not telling me much what's wrong. Any ideas?

@bhomass
Copy link
Author

bhomass commented Mar 24, 2021

I read that this is fixed by downgrading pytorch to 1.14.0.
https://github.com/huggingface/transformers/issues/4189

But when I do that, I get a segmenation fault. Nothing is easy.

@bhomass
Copy link
Author

bhomass commented Mar 24, 2021

I force the n_gpu to be 1, now it crashes with

RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 7.93 GiB total capacity; 7.18 GiB already allocated; 35.94 MiB free; 7.31 GiB reserved in total by PyTorch)

@cooelf
Copy link
Owner

cooelf commented Mar 25, 2021

Yes, it seems to be a version issue that emerges recently. I revised the script according to huggingface/transformers#10199 (comment), and it works for me.

@bhomass
Copy link
Author

bhomass commented Mar 26, 2021

yes, I got past all these problems, but still ends up with oom, even on colab. I raised a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants