-
Notifications
You must be signed in to change notification settings - Fork 923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Colab demo fails #67
Comments
Hi, |
Somewhere in Pythia or its constituent packages is some code that uses weak reference sets in an unreliable way. (Note: a Weak Reference collection can have elements removed at random, asynchronously, by the garbage collector.) It is possible that the library/application coder assumed that the implementor of the weak reference set would handle this asynchronous problem, but clearly the implementor did not. Y'all do have some reliability problems, so I'm leaving this open, if only as a signpost to the unwary. |
Thanks. I believe this issue happens only in colab demo. I will spend some time this week to figure out what is the issue. |
The issue was occurring due to compatibility issues with CUDNN, CUDA and PyTorch versions. Colab recommends users to use their default PyTorch version instead of installing their own which fixes this issue. I have modify the colab file accordingly. Please comment back or reopen if this issue persists. |
Cool! |
Summary: Refactor of our BERT based models. This removes the use of sample list from our `*Pretraining` or `*Classification` heads and these modules can be called directly with tensors or strings inputs. This will help to make these modules scriptable. Enabling scripting for these models will come in next set of PRs. This PR also consolidates all classification heads for different datasets. `training_head_type` is set to `classification` for all along with a `num_labels` configuration. For nlvr2 we are keeping the `training_head_type` as `nlvr2` as we need to specially modify the `hidden_size` for that dataset. Tested with current visual bert/vilbert models as well as loading old models. Pull Request resolved: fairinternal/mmf-internal#67 Reviewed By: apsdehal Differential Revision: D21272728 Pulled By: vedanuj fbshipit-source-id: 715af8be62caa1e4e10f84a63eb45499f30a6362
Summary: Refactor of our BERT based models. This removes the use of sample list from our `*Pretraining` or `*Classification` heads and these modules can be called directly with tensors or strings inputs. This will help to make these modules scriptable. Enabling scripting for these models will come in next set of PRs. This PR also consolidates all classification heads for different datasets. `training_head_type` is set to `classification` for all along with a `num_labels` configuration. For nlvr2 we are keeping the `training_head_type` as `nlvr2` as we need to specially modify the `hidden_size` for that dataset. Tested with current visual bert/vilbert models as well as loading old models. Pull Request resolved: https://github.com/fairinternal/pythia-internal/pull/67 Reviewed By: apsdehal Differential Revision: D21272728 Pulled By: vedanuj fbshipit-source-id: 715af8be62caa1e4e10f84a63eb45499f30a6362
Go and change runtime to GPU then restart |
The Colab demo fails. I believe the problem is that the CUDA version install now on Colab is not what this version of PyTorch uses. I noticed that in general the Colab Python package versions are ahead of what this demo uses.
If you want to have a permanently working Colab demo, I suspect that you need to nuke & pave the standard Colab runtime. That is, remove everything installed by PIP, the CUDA runtime, whatever else you can think of, and install from the repos. For example, the Python packages need:
!pip freeze > /tmp/all_packages.txt
!pip uninstall -r /tmp/all_packages.txt
Also, the demo demands GPU and will not work in CPU-only mode. I have not tried the TPU runtime, but suspect it will also not work.
Stack trace:
/content/pythia/pythia/.vector_cache/glove.6B.zip: 862MB [01:03, 13.5MB/s]
100%|█████████▉| 399163/400000 [00:50<00:00, 7829.19it/s]
RuntimeError Traceback (most recent call last)
in ()
----> 1 demo = PythiaDemo()
8 frames
in init(self)
40 def init(self):
41 self._init_processors()
---> 42 self.pythia_model = self._build_pythia_model()
43 self.detection_model = self._build_detection_model()
44 self.resnet_model = self._build_resnet_model()
in _build_pythia_model(self)
82
83 model.load_state_dict(state_dict)
---> 84 model.to("cuda")
85 model.eval()
86
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
379 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
380
--> 381 return self._apply(convert)
382
383 def register_backward_hook(self, hook):
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in _apply(self, fn)
115 def _apply(self, fn):
116 ret = super(RNNBase, self)._apply(fn)
--> 117 self.flatten_parameters()
118 return ret
119
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in flatten_parameters(self)
111 all_weights, (4 if self.bias else 2),
112 self.input_size, rnn.get_cudnn_mode(self.mode), self.hidden_size, self.num_layers,
--> 113 self.batch_first, bool(self.bidirectional))
114
115 def _apply(self, fn):
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
The text was updated successfully, but these errors were encountered: