Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in PyTorch-Lightning when Finetuning on VQA #3

Closed
tejas1995 opened this issue May 15, 2021 · 2 comments
Closed

Error in PyTorch-Lightning when Finetuning on VQA #3

tejas1995 opened this issue May 15, 2021 · 2 comments

Comments

@tejas1995
Copy link

tejas1995 commented May 15, 2021

Hello,

I am trying to finetune ViLT on the VQAv2 task - I created the arrow_root directory as instructed, and then ran:
python run.py with data_root=<PROJECT_DIR>/arrow_root/vqav2/ num_gpus=1 num_nodes=1 task_finetune_vqa per_gpu_batchsize=64 load_path="weights/vilt_200k_mlm_itm.ckpt"

However, once the model begins training, I get the following error:

Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 71, in main
trainer.fit(model, datamodule=dm)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 493, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 711, in run_training_batch
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 817, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/t-tejass/.conda/envs/vilt-real/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 304, in training_step
closure_loss = training_step_output.minimize / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

I printed the value of training_step_output right before the error: {'extra': {}, 'minimize': None}. I am not too familiar with pyTorch-Lighting, but this doesn't seem to be the correct output.

Am I missing any steps here, apart from creating the arrow data and running the model?

@dandelin
Copy link
Owner

dandelin commented May 15, 2021

I've just checked that your command works on my environment => python run.py with data_root=./dataset num_gpus=1 num_nodes=1 task_finetune_vqa per_gpu_batchsize=64 load_path="weights/vilt_200k_mlm_itm.ckpt"

It seems your error is caused because the training_step method returned None, which should not be happened as "vqa" is in self.current_tasks and thus the return of forward should contain {vqa_loss: Tensor} in the returning dictionary.

Since PyTorch-lightning is a rapidly changing project, I can only guarantee my code for the specific version of PL as denoted in requirements.txt: pytorch_lightning==1.1.4.
Please re-check the versions in the requirements file match your installed versions first. And if so, please report the variables in the scope of compute_vqa for further analysis.

I strongly assume that your PL version mismatches with mine. Because pytorch_lightning/trainer/training_loop.py line 304 of PL version 1.1.4 is not closure_loss = ... but this.

@tejas1995
Copy link
Author

Sorry - I had been getting an error with the earlier version of pytorch_lightning as well, and upgraded just to check if it made any difference. I identified the bug in my code that had caused the earlier error, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants