Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic reweighting causes performance degradation in reproducing #4

Closed
Charles-Xie opened this issue Oct 28, 2021 · 13 comments
Closed

Comments

@Charles-Xie
Copy link

Hi,
thanks for sharing the code! Great work!

I have a small question in reproducing your result.
I run the CDN-S model (res50, 3+3). It gave a result of about 31.5 or 31.2 (I run 2 times) after the first training stage (train the whole model with re gular loss). But after the second training stage (decoupled training) is finished, the performance downgrades to 31.0 and 30.4 for these 2 runs separately. For full mAP, rare mAP and non-rare mAP, this trick seems to be not helpful.

So I wonder what could goes wrong during my reproduction or what can be the reason. I will paste the commands and log below. Thanks. Nice day :3

@Charles-Xie
Copy link
Author

Charles-Xie commented Oct 28, 2021

command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base/checkpoint_last.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter

echo "base"

corresponding result (log): 31.5 after 1st training stage and 31.0 after 2nd training stage:
log.txt

@Charles-Xie
Copy link
Author

for the 2nd run:
command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base_4worker \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter \
        --num_workers 4

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base_4worker/checkpoint_last.pth \
        --output_dir logs/base_4worker \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter \
        --num_workers 4

echo "base_4worker"

corresponding result (log): 31.2 after 1st training stage and 30.4 after 2nd training stage:
log.txt

@YueLiao
Copy link
Owner

YueLiao commented Oct 28, 2021

This module is implemented by @zhangaixi2008, and he will reply you later.

@YueLiao
Copy link
Owner

YueLiao commented Oct 28, 2021

command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base/checkpoint_last.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter

echo "base"

corresponding result (log): 31.5 after 1st training stage and 31.0 after 2nd training stage: log.txt

aha 31.5%, a new SOTA with CDN-S.

@zhangaixi2008
Copy link

zhangaixi2008 commented Oct 28, 2021

You can have a try with the following command.
python -m torch.distributed.launch --master_port 10026 --nproc_per_node=4 --use_env main.py --pretrained logs/base/checkpoint_last.pth --output_dir logs/base --hoi --dataset_file hico --hoi_path data/hico_20160224_det --num_obj_classes 80 --num_verb_classes 117 --backbone resnet50 --set_cost_bbox 2.5 --set_cost_giou 1 --bbox_loss_coef 2.5 --giou_loss_coef 1 --num_queries 64 --dec_layers_stage1 3 --dec_layers_stage2 3 --epochs 10 --freeze_mode 1 --obj_reweight --verb_reweight --queue_size 9408 --p_obj 0.7 --p_verb 0.7 --lr 5e-6 --lr_backbone 5e-7 --use_nms_filter

@boringwar
Copy link

@zhangaixi2008 It does not work for me either. The reweighting retraining leads to performance drop.
Details are as follows:

  1. Using given script training on HICO-DET:

CDN S:

  • best in first 90 epoch: 31.71
  • fine-tune degrades to 30.96

CDN B:

  • best in first 90 epochs: 31.6
  • fine-tune degrades to 30.6

I'm using the above script to re-run cdn-s finetuning.

@zhangaixi2008
Copy link

@Haak0 please upload your model here, let me have a look.

@boringwar
Copy link

@zhangaixi2008 Hi, some of my checkpoints are overwritten. I am re-running the experiments.

@boringwar
Copy link

Hi, I re-do the experiments, and here's the log.
CDN small:

  • best in first 90 epoch: 30.99
  • fine-tune degrades to 30.3
    Here's the script and log small.txt

CDN base:

  • best in first 90 epochs: 31.98
  • fine-tune degrades to 30.6
    Here's the script and log base.txt

@zhangaixi2008
Copy link

Hi, I made a mistake in the previous readme for running the fine-tune process. Please use the script I provide above under this issue. As we claimed in the paper, we use a small learning rate to fine-tune the first model. Thus, we set lr as 5e-6 and lr_backbone as 5e-7 for bs=8, or lr as 1e-5 and lr_backbone as 1e-6 for bs=16. Please try again and let us see the results.
Sorry for our carelessness, we have already modified the readme.

@boringwar
Copy link

@zhangaixi2008 hi, I reproduce the finetune result following your script, and the result is reasonable.
CDN-base:

  • first 90 epoch: 32.05
  • fine-tune: 32.12
    All the result are evaluated using the python script.

BTW, what's the meaning of the "vis_tag" in hico_eval.py?

@zhangaixi2008
Copy link

For CDN-base, you have already surpassed our reported (official matlab 31.78, python 31.86) in our paper. Good job^^
For 'vis_tag', you can see the evaluation script. In short, we filtered the already matched ground-truth hoi to calculate fp and tp during evaluation.

@YueLiao
Copy link
Owner

YueLiao commented Dec 3, 2021

The issue about the "Re-weight module" seems to be solved. If any other issues, feel free to open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants