fp16 compatibility #17

bhomass · 2021-03-17T03:04:23Z

I am running sh_albert_cls.sh. It crashed with

Iteration: 0%| | 0/10860 [00:03<?, ?it/s]
Epoch: 0%| | 0/2 [00:03<?, ?it/s]
Traceback (most recent call last):
File "./examples/run_cls.py", line 645, in
main()
File "./examples/run_cls.py", line 533, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "./examples/run_cls.py", line 159, in train
outputs = model(**inputs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/bruce/AwesomeMRC/transformer-mrc/transformers/modeling_albert.py", line 688, in forward
inputs_embeds=inputs_embeds
File "/data/anaconda3/envs/mrc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/bruce/AwesomeMRC/transformer-mrc/transformers/modeling_albert.py", line 524, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration

I commented out the argument

--fp16

But still get the same error.
The messages are really not telling me much what's wrong. Any ideas?

bhomass · 2021-03-24T18:53:24Z

I read that this is fixed by downgrading pytorch to 1.14.0.
https://github.com/huggingface/transformers/issues/4189

But when I do that, I get a segmenation fault. Nothing is easy.

bhomass · 2021-03-24T22:41:04Z

I force the n_gpu to be 1, now it crashes with

RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 7.93 GiB total capacity; 7.18 GiB already allocated; 35.94 MiB free; 7.31 GiB reserved in total by PyTorch)

cooelf · 2021-03-25T06:03:14Z

Yes, it seems to be a version issue that emerges recently. I revised the script according to huggingface/transformers#10199 (comment), and it works for me.

bhomass · 2021-03-26T00:01:12Z

yes, I got past all these problems, but still ends up with oom, even on colab. I raised a new issue.

cooelf closed this as completed May 22, 2021

BenfengXu mentioned this issue Sep 28, 2021

Stop Iteration BenfengXu/SSAN#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp16 compatibility #17

fp16 compatibility #17

bhomass commented Mar 17, 2021 •

edited

Loading

bhomass commented Mar 24, 2021

bhomass commented Mar 24, 2021

cooelf commented Mar 25, 2021

bhomass commented Mar 26, 2021

fp16 compatibility #17

fp16 compatibility #17

Comments

bhomass commented Mar 17, 2021 • edited Loading

--fp16

bhomass commented Mar 24, 2021

bhomass commented Mar 24, 2021

cooelf commented Mar 25, 2021

bhomass commented Mar 26, 2021

bhomass commented Mar 17, 2021 •

edited

Loading