[Error] RIDE Expert Assignment Module Training (Stage 2) #20

abababa-ai · 2021-07-14T03:08:59Z

Thanks for your great work, here is an error when I try to use the model as given in the model zoo on iNaturalist data set. The model is based on ResNet50 backbone with 4 experts and distillation. When I try to use the model as the pretrained model to initialize the model during stage 2. It produces the following errors. It seems that the parameters of some layers are not well loaded.
My command lines are:
python train.py -c "configs/config_iNaturalist_resnet50_ride_ea.json" -r afs/RIDE/RIDE_model/imagenet_4experts_distill/checkpoint-epoch5.pth --reduce_dimension 1 --num_experts 4

The reported errors are as follows:

Loading checkpoint: ./RIDE/RIDE_model/imagenet_4experts_distill/checkpoint-epoch5.pth ...
Traceback (most recent call last):
File "/root/workspace/env_run/utils/util.py", line 59, in load_state_dict
own_state["module."+name].copy_(param)
RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 110, in
main(config)
File "train.py", line 73, in main
lr_scheduler=lr_scheduler)
File "/root/workspace/env_run/trainer/trainer.py", line 14, in init
super().init(model, criterion, metric_ftns, optimizer, config)
File "/root/workspace/env_run/base/base_trainer.py", line 59, in init
self._resume_checkpoint(config.resume, state_dict_only=state_dict_only)
File "/root/workspace/env_run/base/base_trainer.py", line 211, in _resume_checkpoint
load_state_dict(self.model, state_dict)
File "/root/workspace/env_run/utils/util.py", line 63, in load_state_dict
print("Error in copying parameter {}, source shape: {}, destination shape: {}".format(name, param.shape, own_sta
te[name].shape))
KeyError: 'backbone.layer1.0.conv1.weight'

TonyLianLong · 2021-07-14T03:15:34Z

There is a size mismatch between the models. As you can see, you are trying to load an ImageNet checkpoint, but config is iNaturalist. You could probably have a look into model you are using and the checkpoint you are using.

The error:

The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 0

abababa-ai · 2021-07-14T08:20:44Z

What a shame for such a careless mistake.

abababa-ai closed this as completed Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Error] RIDE Expert Assignment Module Training (Stage 2) #20

[Error] RIDE Expert Assignment Module Training (Stage 2) #20

abababa-ai commented Jul 14, 2021

TonyLianLong commented Jul 14, 2021

abababa-ai commented Jul 14, 2021

[Error] RIDE Expert Assignment Module Training (Stage 2) #20

[Error] RIDE Expert Assignment Module Training (Stage 2) #20

Comments

abababa-ai commented Jul 14, 2021

TonyLianLong commented Jul 14, 2021

abababa-ai commented Jul 14, 2021