Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flickr30k Finetune results does not match the provided checkpoint #13

Open
JACKHAHA363 opened this issue Jun 15, 2021 · 10 comments
Open

Comments

@JACKHAHA363
Copy link

Hi authors,

I take the provided pretrained 200k checkpoint and did the finetuning of flickr30k. The IR and TR scores after are 64.5 and 81.7. The TR score lower than the one in the paper. My finetuning command is

$PYTHONBIN run.py with data_root=vilt_dataset/ \
        num_gpus=8 num_nodes=1 task_finetune_irtr_f30k \
        per_gpu_batchsize=4 load_path="weights/vilt_200k.ckpt" \
        exp_name="f30k/finetune_official" 

Screen Shot 2021-06-15 at 12 53 24 PM

I also test the given vilt_irtr_f30k.ckpt and the results is good, with IR=65.3, TR=83.5. Can I ask what is the process of getting vilt_irtr_f30k.ckpt?

@dandelin
Copy link
Owner

@JACKHAHA363

The fine-tuning results can be unstable due to augmentations. Also, we have only trained the IR/TR fine-tuning models for a single time.
You may increase the training epochs (greater than 10 epochs, maybe 20 epochs?) to get more stable and better results.

@JACKHAHA363
Copy link
Author

I tried longer epochs but that end up overfitting with increasing val loss. Would you mind providing the checkpoint for 100k steps also?

@dandelin
Copy link
Owner

@JACKHAHA363
Sure, you can grab it here https://www.dropbox.com/s/lcqmbx587szaox3/vilt_100k_wwm_pretrain.ckpt?dl=0 (will be expired someday)

@JACKHAHA363
Copy link
Author

thanks @dandelin!

@yangxiaofeng
Copy link

Are you able to solve this issue? @JACKHAHA363 I have similar issues on both flicker and coco retrieval.

@byougert
Copy link

Hi, bro.
I found ir/tr evaluation result on flickr is still unstable even using official finetuned checkpoint. Sometimes I got 63.94(ir)/83.6(tr), sometimes it changed to 64.3(ir)/83.7(tr). How do you think it? @dandelin @JACKHAHA363

@byougert
Copy link

byougert commented Dec 30, 2021 via email

@dandelin
Copy link
Owner

Hi @byougert

Oops, you got the mail. I deleted the comment right after posted it as I noticed I put shuffle=False in DistributedSampler(image_dset, shuffle=False).

Though after quick investigation, I found the true reason.
It was the precision=16, set in https://github.com/dandelin/ViLT/blob/master/run.py#L51.
After setting precision=32 during evaluation I was able to get stable result.

I guess the score from rank_output is very cluttered so they need larger precision.
Thanks for the report and I will revise the EVAL.md. :)

@byougert
Copy link

Hi, bro.
Yes, i received your message in my mail but couldn't find the reply in github. hhhh....
Thanks for your reply and nice work.

@byougert
Copy link

Hi, @dandelin
I'm sorry to say that the result seems still puzzled. Last night, when I changed precision to 32 during evaluation, two similar but NOT SAME results appeared, which showed one was 0.6480(ir)/0.8370(tr) but the other was 0.6460(ir)/0.8370(tr).
Acatlly, seed is exactly fixed to 0. I have no idea what causes the differece. Y_Y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants