Skip to content

Is an image worth five sentences? A new look into semantics of image-text matching - WACV 2022

Notifications You must be signed in to change notification settings

furkanbiten/ncs_metric

Repository files navigation

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Welcome! In this repo, you can find the code for our paper published in WACV 2022.

Adaptive Margin Model!

Main idea of this metric is to use classic image captioning metrics CIDEr or SPICE to better evaluate retrieval models.

Using our metric as an adaptive margin can be found in https://github.com/andrespmd/semantic_adaptive_margin. Roughly, we want to take into account the NON-GROUND TRUTH items effect in top-k retrieved items to better evaluate what our models do.

Now, this repo is divided into main section. The former is for the curious and while the latter is for the pragmatists!

To those who are curious! (How did we do it?)

First off, we had to change the code of SPICE to save all the pairwise distances.

If you would like to compile from scratch or would like to see the changes we made to SPICE, please check the submodule!

Here is the link to download the compiled version: SPICE.zip. After downloading, unzip the file and run python get_stanford models and then run

java -Xmx8G -jar spice-1.0.jar ./example.json

to see if it works. This should result in a file called spice_pairwise.csv.

Now, to obtain the pairwise distances of captions with CIDEr, we run:

python custom_cider.py --dataset [coco/f30k]

To obtain these distances we used MSCOCO and Flickr30k, here are they for you to download. The reason we run these commands is to preprocess all the pairwise distances. So that we reduce the time it takes to run the NCS metric.

To those who are impatient! (I just wanna use the metric and nothing more!)

You are a pragmatist and just wanna use the code (I feel you!). Download the precomputed pairwise distances here.

As a format, we expect a similarity matrix saved as json where each row are images and each column are sentences. For example, for Flickr30k, the matrix would have dimensions of 1000x5000; 1000 images, 5000 sentences. Distance metric choice doesn't matter, you can use anything. As an example of the format, we provide some of models' similarity matrix.

Finally, just run to get the results:

python eval.py --dataset [coco/f30k] --metric_name [spice/cider] --model_path [ThePathToSimilarityMatrix]

There are more options to be selected, you can read them inside the code.

Conclusion

To err is human.

About

Is an image worth five sentences? A new look into semantics of image-text matching - WACV 2022

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published