Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colab demo fails #67

Closed
LanceNorskog opened this issue May 26, 2019 · 6 comments
Closed

Colab demo fails #67

LanceNorskog opened this issue May 26, 2019 · 6 comments

Comments

@LanceNorskog
Copy link

The Colab demo fails. I believe the problem is that the CUDA version install now on Colab is not what this version of PyTorch uses. I noticed that in general the Colab Python package versions are ahead of what this demo uses.

If you want to have a permanently working Colab demo, I suspect that you need to nuke & pave the standard Colab runtime. That is, remove everything installed by PIP, the CUDA runtime, whatever else you can think of, and install from the repos. For example, the Python packages need:
!pip freeze > /tmp/all_packages.txt
!pip uninstall -r /tmp/all_packages.txt

Also, the demo demands GPU and will not work in CPU-only mode. I have not tried the TPU runtime, but suspect it will also not work.

Stack trace:

/content/pythia/pythia/.vector_cache/glove.6B.zip: 862MB [01:03, 13.5MB/s]
100%|█████████▉| 399163/400000 [00:50<00:00, 7829.19it/s]

RuntimeError Traceback (most recent call last)
in ()
----> 1 demo = PythiaDemo()

8 frames
in init(self)
40 def init(self):
41 self._init_processors()
---> 42 self.pythia_model = self._build_pythia_model()
43 self.detection_model = self._build_detection_model()
44 self.resnet_model = self._build_resnet_model()

in _build_pythia_model(self)
82
83 model.load_state_dict(state_dict)
---> 84 model.to("cuda")
85 model.eval()
86

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
379 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
380
--> 381 return self._apply(convert)
382
383 def register_backward_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in _apply(self, fn)
115 def _apply(self, fn):
116 ret = super(RNNBase, self)._apply(fn)
--> 117 self.flatten_parameters()
118 return ret
119

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in flatten_parameters(self)
111 all_weights, (4 if self.bias else 2),
112 self.input_size, rnn.get_cudnn_mode(self.mode), self.hidden_size, self.num_layers,
--> 113 self.batch_first, bool(self.bidirectional))
114
115 def _apply(self, fn):

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

@apsdehal
Copy link
Contributor

Hi,
Can you rerun the cell which threw this error? This error does happen sometime and when I rerun that cell it doesn't happen again.

@LanceNorskog
Copy link
Author

Can you rerun the cell which threw this error? This error does happen sometime and when I rerun that cell it doesn't happen again.
Thanks! It made it past this error on the second run, but the second run failed on a stack trace I failed to capture: the core error was "weak ref set changed during iteration". It ran the third time.

Somewhere in Pythia or its constituent packages is some code that uses weak reference sets in an unreliable way. (Note: a Weak Reference collection can have elements removed at random, asynchronously, by the garbage collector.) It is possible that the library/application coder assumed that the implementor of the weak reference set would handle this asynchronous problem, but clearly the implementor did not.

Y'all do have some reliability problems, so I'm leaving this open, if only as a signpost to the unwary.

@apsdehal
Copy link
Contributor

Thanks. I believe this issue happens only in colab demo. I will spend some time this week to figure out what is the issue.

@apsdehal
Copy link
Contributor

apsdehal commented May 30, 2019

The issue was occurring due to compatibility issues with CUDNN, CUDA and PyTorch versions. Colab recommends users to use their default PyTorch version instead of installing their own which fixes this issue. I have modify the colab file accordingly. Please comment back or reopen if this issue persists.

@LanceNorskog
Copy link
Author

Cool!

apsdehal pushed a commit that referenced this issue May 8, 2020
Summary:
Refactor of our BERT based models. This removes the use of sample list from our `*Pretraining` or `*Classification` heads and these modules can be called directly with tensors or strings inputs. This will help to make these modules scriptable. Enabling scripting for these models will come in next set of PRs.

This PR also consolidates all classification heads for different datasets. `training_head_type` is set to `classification` for all along with a `num_labels` configuration. For nlvr2 we are keeping the `training_head_type` as `nlvr2` as we need to specially modify the `hidden_size` for that dataset.

Tested with current visual bert/vilbert models as well as loading old models.
Pull Request resolved: fairinternal/mmf-internal#67

Reviewed By: apsdehal

Differential Revision: D21272728

Pulled By: vedanuj

fbshipit-source-id: 715af8be62caa1e4e10f84a63eb45499f30a6362
apsdehal pushed a commit that referenced this issue May 8, 2020
Summary:
Refactor of our BERT based models. This removes the use of sample list from our `*Pretraining` or `*Classification` heads and these modules can be called directly with tensors or strings inputs. This will help to make these modules scriptable. Enabling scripting for these models will come in next set of PRs.

This PR also consolidates all classification heads for different datasets. `training_head_type` is set to `classification` for all along with a `num_labels` configuration. For nlvr2 we are keeping the `training_head_type` as `nlvr2` as we need to specially modify the `hidden_size` for that dataset.

Tested with current visual bert/vilbert models as well as loading old models.
Pull Request resolved: https://github.com/fairinternal/pythia-internal/pull/67

Reviewed By: apsdehal

Differential Revision: D21272728

Pulled By: vedanuj

fbshipit-source-id: 715af8be62caa1e4e10f84a63eb45499f30a6362
@adeelahmad-co
Copy link

Go and change runtime to GPU then restart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants