Reproducing results #41

MauritsBleeker · 2021-06-03T14:10:41Z

Hi,

First of al, thanks for sharing this great work!

I've difficulties reproducing the results from the paper as a baseline. I will talk about experiment #3.15 in this issue: VSE++ (ResNet), Flickr30k

So what I get from the paper, the config is the following:

30 epochs
Load images from disk, no precomputed features?
lower the lr after 15 epochs.
lr goes from 0.0002 -> 0.00002

My question is: is the image-encoder here trained end-to-end or not. In other words, is ResNet152 only used as a fixed feature extractor, or is it optimized?

According to your documentation, VSE++ (and therefore I assume 3.14) can be reproduced by - only - using the --max_violation flag, but I get (way) lower results, do I need the --finetune flag as well?

Thanks,
Maurits

The text was updated successfully, but these errors were encountered:

fartashf · 2021-06-04T00:06:44Z

Hi Maurits,

I just reproduced the results for row 3.15 (Flickr30k ResNet without finetune) using the following command:

python train.py --logger_name runs/X --data_name f30k --cnn_type resnet152 --max_violation --num_epochs 30 --lr_update 15

The setup is PyTorch 1.4.0 and Python 3.7.1 and I used the changes in the branch pytorch4.1 and python3. The final result as printed is:

Image to text: 43.8, 72.4, 81.8, 2.0, 13.4
Text to image: 31.6, 59.6, 69.7, 3.0, 26.6

Were you running the same command as above? How big is the gap?

MauritsBleeker · 2021-06-04T09:03:21Z

yeah, I use the same command.

Do you use a random seed? I use different Python and Torch versions, but that should not give that much of a difference right? I will share my results later today.

Thanks again,

Maurits

fartashf · 2021-06-04T13:57:43Z

The seed is not fixed, sorry. Unfortunately, I had not done that in the original code and did not report standard deviations.
Nevertheless, I don't expect std to be higher than 1%.
Make sure the experiment runs for the entire length of training. The recall at the end of the first few epochs for VSE++ is near zero but it picks up quickly.

MauritsBleeker · 2021-06-05T04:49:27Z

Okay, I've managed to reproduce the results (finally). I still don't know what was the problem in the end.

Thanks for your feedback,

Maurits

Average i2t Recall: 66.9
mage to text: 45.1 73.5 82.0 2.0 11.7

Average t2i Recall: 55.7
Text to image: 32.8 62.3 72.2 3.0 21.9

fartashf · 2021-06-05T14:35:41Z

Sounds great. Thanks for reporting the result.

fartashf closed this as completed Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing results #41

Reproducing results #41

MauritsBleeker commented Jun 3, 2021 •

edited

Loading

fartashf commented Jun 4, 2021

MauritsBleeker commented Jun 4, 2021

fartashf commented Jun 4, 2021

MauritsBleeker commented Jun 5, 2021

fartashf commented Jun 5, 2021

Reproducing results #41

Reproducing results #41

Comments

MauritsBleeker commented Jun 3, 2021 • edited Loading

fartashf commented Jun 4, 2021

MauritsBleeker commented Jun 4, 2021

fartashf commented Jun 4, 2021

MauritsBleeker commented Jun 5, 2021

fartashf commented Jun 5, 2021

MauritsBleeker commented Jun 3, 2021 •

edited

Loading