dynamic reweighting causes performance degradation in reproducing #4

Charles-Xie · 2021-10-28T03:32:15Z

Hi,
thanks for sharing the code! Great work!

I have a small question in reproducing your result.
I run the CDN-S model (res50, 3+3). It gave a result of about 31.5 or 31.2 (I run 2 times) after the first training stage (train the whole model with re gular loss). But after the second training stage (decoupled training) is finished, the performance downgrades to 31.0 and 30.4 for these 2 runs separately. For full mAP, rare mAP and non-rare mAP, this trick seems to be not helpful.

So I wonder what could goes wrong during my reproduction or what can be the reason. I will paste the commands and log below. Thanks. Nice day :3

Charles-Xie · 2021-10-28T03:32:47Z

command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base/checkpoint_last.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter

echo "base"

corresponding result (log): 31.5 after 1st training stage and 31.0 after 2nd training stage:
log.txt

Charles-Xie · 2021-10-28T03:39:03Z

for the 2nd run:
command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base_4worker \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter \
        --num_workers 4

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base_4worker/checkpoint_last.pth \
        --output_dir logs/base_4worker \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter \
        --num_workers 4

echo "base_4worker"

corresponding result (log): 31.2 after 1st training stage and 30.4 after 2nd training stage:
log.txt

YueLiao · 2021-10-28T03:39:41Z

This module is implemented by @zhangaixi2008, and he will reply you later.

YueLiao · 2021-10-28T03:47:36Z

command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base/checkpoint_last.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter

echo "base"

corresponding result (log): 31.5 after 1st training stage and 31.0 after 2nd training stage: log.txt

aha 31.5%, a new SOTA with CDN-S.

zhangaixi2008 · 2021-10-28T08:15:05Z

You can have a try with the following command.
python -m torch.distributed.launch --master_port 10026 --nproc_per_node=4 --use_env main.py --pretrained logs/base/checkpoint_last.pth --output_dir logs/base --hoi --dataset_file hico --hoi_path data/hico_20160224_det --num_obj_classes 80 --num_verb_classes 117 --backbone resnet50 --set_cost_bbox 2.5 --set_cost_giou 1 --bbox_loss_coef 2.5 --giou_loss_coef 1 --num_queries 64 --dec_layers_stage1 3 --dec_layers_stage2 3 --epochs 10 --freeze_mode 1 --obj_reweight --verb_reweight --queue_size 9408 --p_obj 0.7 --p_verb 0.7 --lr 5e-6 --lr_backbone 5e-7 --use_nms_filter

boringwar · 2021-11-25T03:00:10Z

@zhangaixi2008 It does not work for me either. The reweighting retraining leads to performance drop.
Details are as follows:

Using given script training on HICO-DET:

CDN S:

best in first 90 epoch: 31.71
fine-tune degrades to 30.96

CDN B:

best in first 90 epochs: 31.6
fine-tune degrades to 30.6

I'm using the above script to re-run cdn-s finetuning.

zhangaixi2008 · 2021-11-25T03:16:33Z

@Haak0 please upload your model here, let me have a look.

boringwar · 2021-11-25T03:48:37Z

@zhangaixi2008 Hi, some of my checkpoints are overwritten. I am re-running the experiments.

boringwar · 2021-11-27T01:47:33Z

Hi, I re-do the experiments, and here's the log.
CDN small:

best in first 90 epoch: 30.99
fine-tune degrades to 30.3
Here's the script and log small.txt

CDN base:

best in first 90 epochs: 31.98
fine-tune degrades to 30.6
Here's the script and log base.txt

zhangaixi2008 · 2021-11-27T14:02:55Z

Hi, I made a mistake in the previous readme for running the fine-tune process. Please use the script I provide above under this issue. As we claimed in the paper, we use a small learning rate to fine-tune the first model. Thus, we set lr as 5e-6 and lr_backbone as 5e-7 for bs=8, or lr as 1e-5 and lr_backbone as 1e-6 for bs=16. Please try again and let us see the results.
Sorry for our carelessness, we have already modified the readme.

boringwar · 2021-12-02T03:03:02Z

@zhangaixi2008 hi, I reproduce the finetune result following your script, and the result is reasonable.
CDN-base:

first 90 epoch: 32.05
fine-tune: 32.12
All the result are evaluated using the python script.

BTW, what's the meaning of the "vis_tag" in hico_eval.py?

zhangaixi2008 · 2021-12-02T03:30:41Z

For CDN-base, you have already surpassed our reported (official matlab 31.78, python 31.86) in our paper. Good job^^
For 'vis_tag', you can see the evaluation script. In short, we filtered the already matched ground-truth hoi to calculate fp and tp during evaluation.

YueLiao · 2021-12-03T03:18:30Z

The issue about the "Re-weight module" seems to be solved. If any other issues, feel free to open a new issue.

YueLiao closed this as completed Dec 3, 2021

YueLiao mentioned this issue Jan 12, 2022

Confusion about parameter conversion #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamic reweighting causes performance degradation in reproducing #4

dynamic reweighting causes performance degradation in reproducing #4

Charles-Xie commented Oct 28, 2021

Charles-Xie commented Oct 28, 2021 •

edited

Charles-Xie commented Oct 28, 2021

YueLiao commented Oct 28, 2021 •

edited

YueLiao commented Oct 28, 2021

zhangaixi2008 commented Oct 28, 2021 •

edited

boringwar commented Nov 25, 2021

zhangaixi2008 commented Nov 25, 2021

boringwar commented Nov 25, 2021

boringwar commented Nov 27, 2021

zhangaixi2008 commented Nov 27, 2021

boringwar commented Dec 2, 2021

zhangaixi2008 commented Dec 2, 2021

YueLiao commented Dec 3, 2021

dynamic reweighting causes performance degradation in reproducing #4

dynamic reweighting causes performance degradation in reproducing #4

Comments

Charles-Xie commented Oct 28, 2021

Charles-Xie commented Oct 28, 2021 • edited

Charles-Xie commented Oct 28, 2021

YueLiao commented Oct 28, 2021 • edited

YueLiao commented Oct 28, 2021

zhangaixi2008 commented Oct 28, 2021 • edited

boringwar commented Nov 25, 2021

zhangaixi2008 commented Nov 25, 2021

boringwar commented Nov 25, 2021

boringwar commented Nov 27, 2021

zhangaixi2008 commented Nov 27, 2021

boringwar commented Dec 2, 2021

zhangaixi2008 commented Dec 2, 2021

YueLiao commented Dec 3, 2021

Charles-Xie commented Oct 28, 2021 •

edited

YueLiao commented Oct 28, 2021 •

edited

zhangaixi2008 commented Oct 28, 2021 •

edited