Colab demo fails #67

LanceNorskog · 2019-05-26T05:23:07Z

The Colab demo fails. I believe the problem is that the CUDA version install now on Colab is not what this version of PyTorch uses. I noticed that in general the Colab Python package versions are ahead of what this demo uses.

If you want to have a permanently working Colab demo, I suspect that you need to nuke & pave the standard Colab runtime. That is, remove everything installed by PIP, the CUDA runtime, whatever else you can think of, and install from the repos. For example, the Python packages need:
!pip freeze > /tmp/all_packages.txt
!pip uninstall -r /tmp/all_packages.txt

Also, the demo demands GPU and will not work in CPU-only mode. I have not tried the TPU runtime, but suspect it will also not work.

Stack trace:

/content/pythia/pythia/.vector_cache/glove.6B.zip: 862MB [01:03, 13.5MB/s]
100%|█████████▉| 399163/400000 [00:50<00:00, 7829.19it/s]

RuntimeError Traceback (most recent call last)
in ()
----> 1 demo = PythiaDemo()

8 frames
in init(self)
40 def init(self):
41 self._init_processors()
---> 42 self.pythia_model = self._build_pythia_model()
43 self.detection_model = self._build_detection_model()
44 self.resnet_model = self._build_resnet_model()

in _build_pythia_model(self)
82
83 model.load_state_dict(state_dict)
---> 84 model.to("cuda")
85 model.eval()
86

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
379 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
380
--> 381 return self._apply(convert)
382
383 def register_backward_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in _apply(self, fn)
115 def _apply(self, fn):
116 ret = super(RNNBase, self)._apply(fn)
--> 117 self.flatten_parameters()
118 return ret
119

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in flatten_parameters(self)
111 all_weights, (4 if self.bias else 2),
112 self.input_size, rnn.get_cudnn_mode(self.mode), self.hidden_size, self.num_layers,
--> 113 self.batch_first, bool(self.bidirectional))
114
115 def _apply(self, fn):

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

apsdehal · 2019-05-26T05:43:16Z

Hi,
Can you rerun the cell which threw this error? This error does happen sometime and when I rerun that cell it doesn't happen again.

LanceNorskog · 2019-05-26T21:15:48Z

Can you rerun the cell which threw this error? This error does happen sometime and when I rerun that cell it doesn't happen again.
Thanks! It made it past this error on the second run, but the second run failed on a stack trace I failed to capture: the core error was "weak ref set changed during iteration". It ran the third time.

Somewhere in Pythia or its constituent packages is some code that uses weak reference sets in an unreliable way. (Note: a Weak Reference collection can have elements removed at random, asynchronously, by the garbage collector.) It is possible that the library/application coder assumed that the implementor of the weak reference set would handle this asynchronous problem, but clearly the implementor did not.

Y'all do have some reliability problems, so I'm leaving this open, if only as a signpost to the unwary.

apsdehal · 2019-05-26T21:17:49Z

Thanks. I believe this issue happens only in colab demo. I will spend some time this week to figure out what is the issue.

apsdehal · 2019-05-30T00:28:26Z

The issue was occurring due to compatibility issues with CUDNN, CUDA and PyTorch versions. Colab recommends users to use their default PyTorch version instead of installing their own which fixes this issue. I have modify the colab file accordingly. Please comment back or reopen if this issue persists.

LanceNorskog · 2019-05-31T01:30:30Z

Cool!

Summary: Refactor of our BERT based models. This removes the use of sample list from our `*Pretraining` or `*Classification` heads and these modules can be called directly with tensors or strings inputs. This will help to make these modules scriptable. Enabling scripting for these models will come in next set of PRs. This PR also consolidates all classification heads for different datasets. `training_head_type` is set to `classification` for all along with a `num_labels` configuration. For nlvr2 we are keeping the `training_head_type` as `nlvr2` as we need to specially modify the `hidden_size` for that dataset. Tested with current visual bert/vilbert models as well as loading old models. Pull Request resolved: fairinternal/mmf-internal#67 Reviewed By: apsdehal Differential Revision: D21272728 Pulled By: vedanuj fbshipit-source-id: 715af8be62caa1e4e10f84a63eb45499f30a6362

Summary: Refactor of our BERT based models. This removes the use of sample list from our `*Pretraining` or `*Classification` heads and these modules can be called directly with tensors or strings inputs. This will help to make these modules scriptable. Enabling scripting for these models will come in next set of PRs. This PR also consolidates all classification heads for different datasets. `training_head_type` is set to `classification` for all along with a `num_labels` configuration. For nlvr2 we are keeping the `training_head_type` as `nlvr2` as we need to specially modify the `hidden_size` for that dataset. Tested with current visual bert/vilbert models as well as loading old models. Pull Request resolved: https://github.com/fairinternal/pythia-internal/pull/67 Reviewed By: apsdehal Differential Revision: D21272728 Pulled By: vedanuj fbshipit-source-id: 715af8be62caa1e4e10f84a63eb45499f30a6362

adeelahmad-co · 2020-10-13T16:39:41Z

Go and change runtime to GPU then restart

apsdehal closed this as completed May 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colab demo fails #67

Colab demo fails #67

LanceNorskog commented May 26, 2019

apsdehal commented May 26, 2019

LanceNorskog commented May 26, 2019

apsdehal commented May 26, 2019

apsdehal commented May 30, 2019 •

edited

LanceNorskog commented May 31, 2019

adeelahmad-co commented Oct 13, 2020

Colab demo fails #67

Colab demo fails #67

Comments

LanceNorskog commented May 26, 2019

/content/pythia/pythia/.vector_cache/glove.6B.zip: 862MB [01:03, 13.5MB/s] 100%|█████████▉| 399163/400000 [00:50<00:00, 7829.19it/s]

apsdehal commented May 26, 2019

LanceNorskog commented May 26, 2019

apsdehal commented May 26, 2019

apsdehal commented May 30, 2019 • edited

LanceNorskog commented May 31, 2019

adeelahmad-co commented Oct 13, 2020

/content/pythia/pythia/.vector_cache/glove.6B.zip: 862MB [01:03, 13.5MB/s]
100%|█████████▉| 399163/400000 [00:50<00:00, 7829.19it/s]

apsdehal commented May 30, 2019 •

edited