Error in PyTorch-Lightning when Finetuning on VQA #3

tejas1995 · 2021-05-15T20:32:23Z

Hello,

I am trying to finetune ViLT on the VQAv2 task - I created the arrow_root directory as instructed, and then ran:
python run.py with data_root=<PROJECT_DIR>/arrow_root/vqav2/ num_gpus=1 num_nodes=1 task_finetune_vqa per_gpu_batchsize=64 load_path="weights/vilt_200k_mlm_itm.ckpt"

However, once the model begins training, I get the following error:

Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 71, in main
trainer.fit(model, datamodule=dm)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 493, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 711, in run_training_batch
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 817, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 304, in training_step
closure_loss = training_step_output.minimize / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

I printed the value of training_step_output right before the error: {'extra': {}, 'minimize': None}. I am not too familiar with pyTorch-Lighting, but this doesn't seem to be the correct output.

Am I missing any steps here, apart from creating the arrow data and running the model?

The text was updated successfully, but these errors were encountered:

dandelin · 2021-05-15T22:32:59Z

I've just checked that your command works on my environment => python run.py with data_root=./dataset num_gpus=1 num_nodes=1 task_finetune_vqa per_gpu_batchsize=64 load_path="weights/vilt_200k_mlm_itm.ckpt"

It seems your error is caused because the training_step method returned None, which should not be happened as "vqa" is in self.current_tasks and thus the return of forward should contain {vqa_loss: Tensor} in the returning dictionary.

Since PyTorch-lightning is a rapidly changing project, I can only guarantee my code for the specific version of PL as denoted in requirements.txt: pytorch_lightning==1.1.4.
Please re-check the versions in the requirements file match your installed versions first. And if so, please report the variables in the scope of compute_vqa for further analysis.

I strongly assume that your PL version mismatches with mine. Because pytorch_lightning/trainer/training_loop.py line 304 of PL version 1.1.4 is not closure_loss = ... but this.

tejas1995 · 2021-05-17T21:43:54Z

Sorry - I had been getting an error with the earlier version of pytorch_lightning as well, and upgraded just to check if it made any difference. I identified the bug in my code that had caused the earlier error, though.

tejas1995 closed this as completed May 17, 2021

jkkishore1999 mentioned this issue May 27, 2021

python run.py with data_root="/arrows_flickr30k" num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="vilt_200k_mlm_itm.ckpt" #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in PyTorch-Lightning when Finetuning on VQA #3

Error in PyTorch-Lightning when Finetuning on VQA #3

tejas1995 commented May 15, 2021 •

edited

dandelin commented May 15, 2021 •

edited

tejas1995 commented May 17, 2021

Error in PyTorch-Lightning when Finetuning on VQA #3

Error in PyTorch-Lightning when Finetuning on VQA #3

Comments

tejas1995 commented May 15, 2021 • edited

dandelin commented May 15, 2021 • edited

tejas1995 commented May 17, 2021

tejas1995 commented May 15, 2021 •

edited

dandelin commented May 15, 2021 •

edited