-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running training on toy dataset fails #56
Comments
@prateekdhawalia --- it's definitely a new one; i'm on a paternity leave, @themattinthehatt will assist you soon. |
@danbider I tried the develop branch and it works fine. Got this issue in the main branch. Thanks for the help. |
@themattinthehatt when you have a chance please merge |
@danbider @themattinthehatt Will this framework work on images where the object occupies only 20 to 30% of image area? |
@danbider I'll run the @prateekdhawalia the framework should work fine if the object is smaller - are you dealing with a freely moving animal in an arena? If not (i.e. the animal is stationary), I'd suggest cropping around the animal first before training the models. |
@prateekdhawalia I've now merged |
@themattinthehatt Thanks for the response. My use case involves freely moving object. |
Hi @prateekdhawalia, sorry to hear you didn't see good performance on your unlabeled video. 260 labeled images should be a reasonable amount - how many labeled keypoints do you have per frame? Apologies for the lack of |
Hi @themattinthehatt , I have 5 labeled keypoints per frame. Error executing job with overrides: [] Kindly check is there is a bug and help in correcting this. |
@prateekdhawalia I opened a new issue for this, will look into it today |
Hello,
I tried running training on toy dataset using the default hydra script and it fails when loss is set to pca_singleview/pca_multiview with the following stack trace.
Kindly help in resolving this.
scripts/train_hydra.py:22: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_path="configs", config_name="config")
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Our Hydra config file:
training parameters
train_batch_size: 16
val_batch_size: 16
test_batch_size: 16
train_prob: 0.8
val_prob: 0.1
train_frames: 1
num_gpus: 0
num_workers: 4
early_stop_patience: 3
unfreezing_epoch: 25
dropout_rate: 0.1
min_epochs: 100
max_epochs: 500
log_every_n_steps: 1
check_val_every_n_epoch: 10
gpu_id: 0
unlabeled_sequence_length: 16
rng_seed_data_pt: 42
rng_seed_data_dali: 43
rng_seed_model_pt: 44
limit_train_batches: 10
multiple_trainloader_mode: max_size_cycle
profiler: simple
accumulate_grad_batches: 2
lr_scheduler: multisteplr
lr_scheduler_params: {'multisteplr': {'milestones': [100, 200, 300], 'gamma': 0.5}}
losses parameters
pca_multiview: {'log_weight': 7.0, 'components_to_keep': 3, 'empirical_epsilon_percentile': 1.0, 'empirical_epsilon_multiplier': 1.0, 'epsilon': None, 'error_metric': 'reprojection_error'}
pca_singleview: {'log_weight': 7.25, 'components_to_keep': 0.99, 'empirical_epsilon_percentile': 1.0, 'empirical_epsilon_multiplier': 1.0, 'epsilon': None, 'error_metric': 'reprojection_error'}
temporal: {'log_weight': 7.5, 'epsilon': [12.9, 11.3, 10.5, 12.0, 5.0, 7.3, 0.7, 61.8, 11.2, 9.9, 9.7, 10.1, 4.8, 4.9, 1.0, 19.2, 6.8]}
unimodal_mse: {'log_weight': 6.5, 'prob_threshold': 0.0}
unimodal_kl: {'log_weight': 6.5, 'prob_threshold': 0.0}
data parameters
image_orig_dims: {'width': 396, 'height': 406}
image_resize_dims: {'width': 256, 'height': 256}
data_dir: toy_datasets/toymouseRunningData
video_dir: unlabeled_videos
csv_file: CollectedData_.csv
header_rows: [1, 2]
downsample_factor: 2
num_keypoints: 17
mirrored_column_matches: [[0, 1, 2, 3, 4, 5, 6], [8, 9, 10, 11, 12, 13, 14]]
columns_for_singleview_pca: [0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14]
model parameters
losses_to_use: ['pca_singleview']
learn_weights: False
resnet_version: 50
model_type: heatmap
heatmap_loss_type: mse
model_name: my_base_toy_model
callbacks parameters
anneal_weight: {'attr_name': 'total_unsupervised_importance', 'init_val': 0.0, 'increase_factor': 0.01, 'final_val': 1.0, 'freeze_until_epoch': 0}
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Number of labeled images in the full dataset (train+val+test): 90
Size of -- train set: 72, val set: 9, test set: 9
Warning: the argument
{farg[0]}
shadows a Pipeline constructor argument of the same name.[/opt/dali/dali/operators/reader/loader/video_loader.h:178]
file_list_include_preceding_frame
is set to False (or not set at all). In future releases, the default behavior would be changed to True.[/opt/dali/dali/operators/reader/nvdecoder/nvdecoder.cc:80] Warning: Decoding on a default stream. Performance may be affected.
Results of running PCA (pca_singleview) on keypoints:
Kept 13/28 components, and found:
Explained variance ratio: [0.315 0.242 0.209 0.073 0.048 0.034 0.021 0.015 0.01 0.007 0.007 0.005
0.004 0.003 0.002 0.001 0.001 0.001 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. ]
Variance explained by 13 components: 0.991
/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/losses/losses.py:326: UserWarning: Using empirical epsilon=0.194 * multiplier=1.000 -> total=0.194 for pca_singleview loss
warnings.warn(
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py:22: LightningDeprecationWarning: pytorch_lightning.core.lightning.LightningModule has been deprecated in v1.7 and will be removed in v1.9. Use the equivalent class from the pytorch_lightning.core.module.LightningModule class instead.
rank_zero_deprecation(
Initializing a SemiSupervisedHeatmapTracker instance.
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torchvision/models/_utils.py:135: UserWarning: Using 'weights' as positional parameter(s) is deprecated since 0.13 and will be removed in 0.15. Please use keyword parameter(s) instead.
warnings.warn(
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or
None
for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passingweights=ResNet50_Weights.IMAGENET1K_V1
. You can also useweights=ResNet50_Weights.DEFAULT
to get the most up-to-date weights.warnings.warn(msg)
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:446: LightningDeprecationWarning: Setting
Trainer(gpus=[0])
is deprecated in v1.7 and will be removed in v2.0. Please useTrainer(accelerator='gpu', devices=[0])
instead.rank_zero_deprecation(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:285: LightningDeprecationWarning: The
Callback.on_epoch_start
hook was deprecated in v1.6 and will be removed in v1.8. Please useCallback.on_<train/validation/test>_epoch_start
instead.rank_zero_deprecation(
Missing logger folder: tb_logs/my_base_toy_model
Number of labeled images in the full dataset (train+val+test): 90
Size of -- train set: 72, val set: 9, test set: 9
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
0 | backbone | Sequential | 23.5 M
1 | loss_factory | LossFactory | 0
2 | upsampling_layers | Sequential | 81.0 K
3 | rmse_loss | RegressionRMSELoss | 0
4 | loss_factory_unsup | LossFactory | 0
134 K Trainable params
23.5 M Non-trainable params
23.6 M Total params
94.356 Total estimated model params size (MB)
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:219: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the
num_workers
argument(try 6 which is the number of cpus on this machine) in the
DataLoader` init to improve performance.rank_zero_warn(
Epoch 0: 0%| | 0/10 [00:00<?, ?it/s]/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/data/dali.py:103: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
return torch.tensor(
Error executing job with overrides: []
Traceback (most recent call last):
File "scripts/train_hydra.py", line 110, in train
trainer.fit(model=model, datamodule=data_module)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
results = self._run_stage()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
return self._run_train()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
self.fit_loop.run()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
batch_output = self.batch_loop.run(kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 240, in _run_optimization
closure()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in call
self._result = self.closure(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure
self._backward_fn(step_output.closure_loss)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn
self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1706, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1418, in backward
loss.backward(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 14]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0: 0%|
The text was updated successfully, but these errors were encountered: