Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running training on toy dataset fails #56

prateekdhawalia opened this issue Aug 18, 2022 · 10 comments

Running training on toy dataset fails #56

prateekdhawalia opened this issue Aug 18, 2022 · 10 comments


Copy link

I tried running training on toy dataset using the default hydra script and it fails when loss is set to pca_singleview/pca_multiview with the following stack trace.
Kindly help in resolving this.

scripts/ UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_path="configs", config_name="config")
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/hydra/_internal/ UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See for more information.
ret = run_job(
Our Hydra config file:

training parameters

train_batch_size: 16
val_batch_size: 16
test_batch_size: 16
train_prob: 0.8
val_prob: 0.1
train_frames: 1
num_gpus: 0
num_workers: 4
early_stop_patience: 3
unfreezing_epoch: 25
dropout_rate: 0.1
min_epochs: 100
max_epochs: 500
log_every_n_steps: 1
check_val_every_n_epoch: 10
gpu_id: 0
unlabeled_sequence_length: 16
rng_seed_data_pt: 42
rng_seed_data_dali: 43
rng_seed_model_pt: 44
limit_train_batches: 10
multiple_trainloader_mode: max_size_cycle
profiler: simple
accumulate_grad_batches: 2
lr_scheduler: multisteplr
lr_scheduler_params: {'multisteplr': {'milestones': [100, 200, 300], 'gamma': 0.5}}

losses parameters

pca_multiview: {'log_weight': 7.0, 'components_to_keep': 3, 'empirical_epsilon_percentile': 1.0, 'empirical_epsilon_multiplier': 1.0, 'epsilon': None, 'error_metric': 'reprojection_error'}
pca_singleview: {'log_weight': 7.25, 'components_to_keep': 0.99, 'empirical_epsilon_percentile': 1.0, 'empirical_epsilon_multiplier': 1.0, 'epsilon': None, 'error_metric': 'reprojection_error'}
temporal: {'log_weight': 7.5, 'epsilon': [12.9, 11.3, 10.5, 12.0, 5.0, 7.3, 0.7, 61.8, 11.2, 9.9, 9.7, 10.1, 4.8, 4.9, 1.0, 19.2, 6.8]}
unimodal_mse: {'log_weight': 6.5, 'prob_threshold': 0.0}
unimodal_kl: {'log_weight': 6.5, 'prob_threshold': 0.0}

data parameters

image_orig_dims: {'width': 396, 'height': 406}
image_resize_dims: {'width': 256, 'height': 256}
data_dir: toy_datasets/toymouseRunningData
video_dir: unlabeled_videos
csv_file: CollectedData_.csv
header_rows: [1, 2]
downsample_factor: 2
num_keypoints: 17
mirrored_column_matches: [[0, 1, 2, 3, 4, 5, 6], [8, 9, 10, 11, 12, 13, 14]]
columns_for_singleview_pca: [0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14]

model parameters

losses_to_use: ['pca_singleview']
learn_weights: False
resnet_version: 50
model_type: heatmap
heatmap_loss_type: mse
model_name: my_base_toy_model

callbacks parameters

anneal_weight: {'attr_name': 'total_unsupervised_importance', 'init_val': 0.0, 'increase_factor': 0.01, 'final_val': 1.0, 'freeze_until_epoch': 0}

/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/ UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Number of labeled images in the full dataset (train+val+test): 90
Size of -- train set: 72, val set: 9, test set: 9
Warning: the argument {farg[0]} shadows a Pipeline constructor argument of the same name.
[/opt/dali/dali/operators/reader/loader/video_loader.h:178] file_list_include_preceding_frame is set to False (or not set at all). In future releases, the default behavior would be changed to True.
[/opt/dali/dali/operators/reader/nvdecoder/] Warning: Decoding on a default stream. Performance may be affected.
Results of running PCA (pca_singleview) on keypoints:
Kept 13/28 components, and found:
Explained variance ratio: [0.315 0.242 0.209 0.073 0.048 0.034 0.021 0.015 0.01 0.007 0.007 0.005
0.004 0.003 0.002 0.001 0.001 0.001 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. ]
Variance explained by 13 components: 0.991
/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/losses/ UserWarning: Using empirical epsilon=0.194 * multiplier=1.000 -> total=0.194 for pca_singleview loss
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/ LightningDeprecationWarning: pytorch_lightning.core.lightning.LightningModule has been deprecated in v1.7 and will be removed in v1.9. Use the equivalent class from the pytorch_lightning.core.module.LightningModule class instead.

Initializing a SemiSupervisedHeatmapTracker instance.
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torchvision/models/ UserWarning: Using 'weights' as positional parameter(s) is deprecated since 0.13 and will be removed in 0.15. Please use keyword parameter(s) instead.
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torchvision/models/ UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=ResNet50_Weights.IMAGENET1K_V1. You can also use weights=ResNet50_Weights.DEFAULT to get the most up-to-date weights.
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/ LightningDeprecationWarning: Setting Trainer(gpus=[0]) is deprecated in v1.7 and will be removed in v2.0. Please use Trainer(accelerator='gpu', devices=[0]) instead.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/ LightningDeprecationWarning: The Callback.on_epoch_start hook was deprecated in v1.6 and will be removed in v1.8. Please use Callback.on_<train/validation/test>_epoch_start instead.
Missing logger folder: tb_logs/my_base_toy_model
Number of labeled images in the full dataset (train+val+test): 90
Size of -- train set: 72, val set: 9, test set: 9

| Name | Type | Params

0 | backbone | Sequential | 23.5 M
1 | loss_factory | LossFactory | 0
2 | upsampling_layers | Sequential | 81.0 K
3 | rmse_loss | RegressionRMSELoss | 0
4 | loss_factory_unsup | LossFactory | 0

134 K Trainable params
23.5 M Non-trainable params
23.6 M Total params
94.356 Total estimated model params size (MB)
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/ PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 6 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
Epoch 0: 0%| | 0/10 [00:00<?, ?it/s]/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/data/ UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
return torch.tensor(
Error executing job with overrides: []
Traceback (most recent call last):
File "scripts/", line 110, in train, datamodule=data_module)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 696, in fit
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1168, in _run
results = self._run_stage()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1254, in _run_stage
return self._run_train()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1285, in _run_train
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 270, in advance
self._outputs =
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/", line 203, in advance
batch_output =
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/batch/", line 87, in advance
outputs =, kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 201, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 240, in _run_optimization
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 146, in call
self._result = self.closure(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 141, in closure
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 304, in backward_fn
self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1706, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/", line 191, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/", line 80, in backward
model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/", line 1418, in backward
loss.backward(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/autograd/", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 14]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0: 0%|

Copy link

@prateekdhawalia --- it's definitely a new one; i'm on a paternity leave, @themattinthehatt will assist you soon.
please do try to checkout develop branch and let us know if you run into the same error.

Copy link

@danbider I tried the develop branch and it works fine. Got this issue in the main branch. Thanks for the help.

Copy link

@themattinthehatt when you have a chance please merge develop --> main?

Copy link

prateekdhawalia commented Aug 25, 2022

@danbider @themattinthehatt Will this framework work on images where the object occupies only 20 to 30% of image area?
Basically detecting keypoints on small objects present on a bigger image.
If yes, kindly suggest the approach.

Copy link

@danbider I'll run the develop branch through the testing framework then merge into main; will update you all when this is complete.

@prateekdhawalia the framework should work fine if the object is smaller - are you dealing with a freely moving animal in an arena? If not (i.e. the animal is stationary), I'd suggest cropping around the animal first before training the models.

Copy link

@prateekdhawalia I've now merged develop into main; please raise another issue if you run into more troubles.

Copy link

@themattinthehatt Thanks for the response. My use case involves freely moving object.
Also, I tried both SemiSupervisedHeatMap and SemiSupervisedRegression models. The heatmap model did not give good performance on unlabeled video. But it may be because of very low labeled data(260 images).
The Regression model fails during the predict step as there is no implementation of predict_step() for the same.
Is this a bug or done intentionally for a reason?
Kindly suggest if Heapmap or Regression model should be used.

Copy link

Hi @prateekdhawalia, sorry to hear you didn't see good performance on your unlabeled video. 260 labeled images should be a reasonable amount - how many labeled keypoints do you have per frame?

Apologies for the lack of predict_step() for the regression model - that was not intentional, we just haven't updated that model yet. I just raised an issue to that effect and will fix it asap. In general though we've found much better performance with the heatmap models.

Copy link

Hi @themattinthehatt , I have 5 labeled keypoints per frame.
Thanks for the info that heatmaps are more accurate.
Also, I have noticed that when I use DLC image augmentation and when the image rotation aug is above 10, the code throws an error as below.

Error executing job with overrides: []
Traceback (most recent call last):
File "scripts/", line 175, in train, datamodule=data_module)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 696, in fit
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1168, in _run
results = self._run_stage()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1254, in _run_stage
return self._run_train()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1285, in _run_train
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 270, in advance
self._outputs =
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/", line 203, in advance
batch_output =
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/batch/", line 87, in advance
outputs =, kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 201, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 248, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 358, in _optimizer_step
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1552, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/", line 1673, in optimizer_step
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/", line 216, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/", line 153, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/optim/", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/optim/", line 113, in wrapper
return func(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/autograd/", line 27, in decorate_context
return func(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/optim/", line 118, in step
loss = closure()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/", line 138, in _wrap_closure
closure_result = closure()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 146, in call
self._result = self.closure(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 132, in closure
step_output = self._step_fn()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/", line 407, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/", line 1706, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/", line 358, in training_step
return self.model.training_step(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/", line 347, in training_step
loss = self.evaluate_labeled(train_batch, "train")
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/", line 321, in evaluate_labeled
data_dict = self.get_loss_inputs_labeled(batch_dict=batch_dict)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/", line 233, in get_loss_inputs_labeled
predicted_keypoints, confidence = self.run_subpixelmaxima(predicted_heatmaps)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/", line 143, in run_subpixelmaxima
confidences = evaluate_heatmaps_at_location(heatmaps=softmaxes, locs=preds)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/data/", line 333, in evaluate_heatmaps_at_location
heatmaps_padded[i, j, k_offset, l_offset].squeeze(-1).squeeze(-1)
IndexError: index -9223372036854775808 is out of bounds for dimension 2 with size 388

Kindly check is there is a bug and help in correcting this.

Copy link

@prateekdhawalia I opened a new issue for this, will look into it today
#59 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

3 participants