Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Multi-GPU #17

Closed
bwang482 opened this issue Mar 11, 2021 · 1 comment
Closed

Issue with Multi-GPU #17

bwang482 opened this issue Mar 11, 2021 · 1 comment

Comments

@bwang482
Copy link

bwang482 commented Mar 11, 2021

  • transformers version: 4.3.3
  • Platform: Linux-4.15.0-132-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.7.1 (True)
  • Tensorflow version (GPU?): 2.3.0 (True)
  • Using GPU in script?: Yes, multi GeForce RTX 2080 Ti GPUs
  • NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2

I use os.environ["CUDA_VISIBLE_DEVICES"]="6,7" to choose GPUs and everything else in the code is pretty straightforward with using BertClassifier() as model. I am able to run it with CPU with no such issue.

    model = BertClassifier()
    model.bert_model = 'bert-base-uncased'
    model.max_seq_length = 512
    model.train_batch_size = 8
    model.eval_batch_size = 8

I had some issue with Transformers then I resolved it by actually removing the bits of code that sets up DataParallel, huggingface/transformers#10634. I am still not sure why this happens.

0it [00:00, ?it/s]Building sklearn text classifier...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
Loading Pytorch checkpoint
train data size: 1320, validation data size: 146
Training  :   0%|                                                                                                                                             | 0/42 [00:09<?, ?it/s]
0it [00:27, ?it/s]                                                                                                                                            | 0/42 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "seg_pred_skl.py", line 46, in <module>
    model.fit(X_train, y_train)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/sklearn.py", line 374, in fit
    self.model = finetune(self.model, texts_a, texts_b, labels, config)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/finetune.py", line 121, in finetune
    loss, _ = model(*batch)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/model/model.py", line 95, in forward
    output_all_encoded_layers=False)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/sdb/env1/lib/python3.6/site-packages/bert_sklearn/model/pytorch_pretrained/modeling.py", line 959, in forward
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration
@bwang482
Copy link
Author

Issue resolved by following what is discussed in pytorch/pytorch#40457.

Updating Line 959 and 973 in bert_sklearn/model/pytorch_pretrained/modeling.py to:

extended_attention_mask = extended_attention_mask.to(dtype=input_ids.dtype) # fp16 compatibility

and

head_mask = head_mask.to(dtype=input_ids.dtype)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant