Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions about MSVD #11

Open
xixiareone opened this issue May 25, 2020 · 12 comments
Open

questions about MSVD #11

xixiareone opened this issue May 25, 2020 · 12 comments

Comments

@xixiareone
Copy link

The article mentions that "where they randomly chose 5 ground-truth sentences per video. We use the same setting when we compare with that approach".Does the training set, validation set and test set all take 5 sentences at random?
Not all sentences are used in training set, validation set and test set?

Thank you ver much!

@albanie
Copy link
Owner

albanie commented May 25, 2020

Hi @xixiareone, do you have a pointer to that sentence (e.g. in which section in the article it appears)? Thanks! For reference, in our setting, we train by randomly sampling from all possible training captions, then we test each sentence query independently.

@xixiareone
Copy link
Author

Sorry, I didn't explain the source. I read other articles to do this,

So I would like to ask you, in the MSVD data set, especially in the test phase, are you going to evaluate all sentences, or just randomly select 5 sentences from the MSVD for evaluation?

@albanie
Copy link
Owner

albanie commented May 25, 2020

No worries! In the test phase, all sentences are used (independently). The evaluation we use was based on the protocol used here: https://github.com/niluthpol/multimodal_vtt

@xixiareone
Copy link
Author

Eh, I'm a little confused

Like this one in https : //github.com/niluthpol/multimodal_vtt
image

npts = videos.shape [0] / 20
Among them, 20 is a video corresponding to 20 descriptions, which is from msr-vtt data set.
But the number of videos in the MSVD data set is different. How do you deal with it independently?

This problem really bothers me for a long time. Thank you very much

@albanie
Copy link
Owner

albanie commented May 26, 2020

I agree it's confusing! I've summarised below my understanding of the evaluation protocols for MSVD.

Design choices

There are two choices to be made for datasets (like MSVD) that contain a variable numbers of sentences per video:

  1. (Assignment) Which sentences should be assigned to each video (i.e. whether they should be subsampled to a fixed number per video, or whether all available sentences should be used?)
  2. (Evaluation) How should the system be evaluated for retrieval performance when multiple sentences are assigned to each video (for example, should multiple sentences be used together to retrieve the target video, or should they be used independently of the knowledge of other description assignments for the same video?)

Previous works using MSVD

When reporting numbers in our paper, I looked at the following papers using MSVD for retrieval, to try to understand the different protocols.

  1. Learning Joint Representations of Videos and Sentences with Web Image Search
  2. Predicting Visual Features from Text for Image and Video Caption Retrieval
  3. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
  4. Dual Encoding for Zero-Example Video Retrieval

I've added my notes below. Direct qutoes from each paper are written in quotation marks.


Learning Joint Representations of Videos and Sentences with Web Image Search

Assignment: "We first divided the dataset into 1,200, 100, and 670 videos for training, validation, and test, respectively, as in [35,34,11]. Then, we extracted five-second clips from each original video in a sliding-window manner. As a result, we obtained 8,001, 628, and 4,499 clips for the training, validation, and test sets, respectively. For each clip, we picked five ground truth descriptions out of those associated with its original video."

Evaluation: (Retrieving videos with text) "Given a video and a query sentence, we extracted five-second video clips from the video and computed Euclidean distances from the query to the clips. We used their median as the distance of the original video and the query. We ranked the videos based on the distance to each query and recorded the rank of the ground truth video." (Retrieving text with videos) "We computed the distances between a sentence and a query video in the same way as the video retrieval task. Note that each video has five ground truth sentences; thus, we recorded the highest rank among them. The test set has 3,500 sentences."

Summary: Taken together, my interpretation is that the authors first randomly assign 5 sentences per video before any experiments are done. They then break the videos into clips and further randomly assign five sentences to each clip (i.e. sampling with replacement from the initial pool of 5 sentences that were assigned to the video). Since the test set has 670 videos and 670 * 5 = 3350, this approximately lines up with the comment that the test set has 3,500 sentences. In terms of evaluation, when retrieving videos with text, each query is performed and evaluated independently.


Predicting Visual Features from Text for Image and Video Caption Retrieval

Assignment: "For the ease of cross-paper comparison, we follow the identical data partitions as used in [5], [7], [58] for images and [60] for videos. That is, training / validation / test is 6k / 1k / 1k for Flickr8k, 29K / 1,014 / 1x k for Flickr30k, and 1,200 / 100 / 670 for MSVD." (The reference [60] here refers to Translating videos to natural language using deep recurrent neural network).

Evaluation: "The training, validation and test set are used for model training, model selection and performance evaluation, respectively, and exclusively. For performance evaluation, each test caption is first vectorized by a trained Word2VisualVec. Given a test image/video query, we then rank all the test captions in terms of their similarities with the image/video query in the visual feature space. The performance is evaluated based on the caption ranking."

Summary: The paper doesn't mention that they perform subsampling to five captions per video, so it's probably safe to assume that they don't. The evaluation code they have made available for image/text retrieval does assume a fixed number of captions per image (with a default value of 5), but (as of 26/05/20) the MSVD code is not available (I guess this is what you were asking about here) and the comment by the author in that issue implies that all sentences (rather than just 5) are used.


Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Assignment: "For a fair comparison, we used the same splits utilized in prior works [32], with 1200 videos for training, 100 videos for validation, and 670 videos for testing. The MSVD dataset is also used in [24] for video-text retrieval task, where they randomly chose 5 ground-truth sentences per video. We use the same setting when we compare with that approach." (Note the references here are: [24] Learning Joint Representations of Videos and Sentences with Web Image Search, [32] Sequence to Sequence – Video to Text.)

Evaluation: The code given here implements the evaluation protocol as described in Learning Joint Representations of Videos and Sentences with Web Image Search).

Summary: Retrieval performance is reported on two splits in Table 2 and Table 3 of the paper (one described as LJRV, the other as JMET and JMDV). My interpreation was that the LJRV split refers to the practice of sampling 5 descriptions per video, and that the JMET & JMDV split refers to using all captions from a video (since this is what is used in Predicting Visual Features from Text for Image and Video Caption Retrieval, and this number is reported in the table).


Dual Encoding for Zero-Example Video Retrieval

Assignment: I wasn't quite able to determine the splits from the paper, but the comment here suggests that the test set remains the same as each of the other papers above (670 videos).

Evaluation: This is performed in a zero-shot setting.

Suummary: This work is by the same author as Predicting Visual Features from Text for Image and Video Caption Retrieval, so in the absence of extra comments in the paper, it's probably reasonable to assume that the same protocol is used in both works (using all descriptions per video).


Use What You Have: Video Retrieval Using Representations From Collaborative Experts

Assignment: As with all four papers above, this repo uses the 1200, 100, 670 split between train, val and test. It uses all captions associated with each video (sampling one caption per video randomly during training).

Evaluation: (Retrieving videos with text) each query retrieval is evaluated independently of the others, and all test set queries (i.e. more than 5 per video are used). (Retrieving text with videos) As with the other papers above, if a video has multiple descriptions, we evaluate each independently, then take the minimum rank (this is what I was referring to when in my comment above I said that we used the same protocol as Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval).


Summary

My interpretaton is that:

  1. Learning Joint Representations of Videos and Sentences with Web Image Search) and the LJRV split reported by Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval use the same protocol: randomly sampling 5 descriptions per video before any experiments are run, then using these same descriptions throughout.
  2. Predicting Visual Features from Text for Image and Video Caption Retrieval, Dual Encoding for Zero-Example Video Retrieval, the JMET and JMDV split reported by Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval and Use What You Have: Video Retrieval Using Representations From Collaborative Experts use all descriptions for each video.
  3. For all protocols used above, when retrieving text with videos the evaluation is "optimistic" in the sense that the highest rank among the possible retrieved candidates.

If the specific five captions per videos that were sampled for each video (used for the LJRV split) are still available, I will implement it as an additional protocol for our codebase to ensure the comparisons can also be made under this setting (I will follow up with the authors by email to find out).

Either way, an important takeaway is that the protocols are different. Thanks a lot for drawing attention to this issue!

@xixiareone
Copy link
Author

xixiareone commented May 26, 2020 via email

@albanie
Copy link
Owner

albanie commented May 26, 2020

Hi @xixiareone,

81 is the maximum number of sentences for a single video used in MSVD. For efficiency, we compute a similarity matrix with a fixed number of sentences per video (we use 81 since this corresponds to the maximum number of sentences assigned to any individual video). We then mask out all the positions that correspond to sentences that are not present (for videos that have fewer than 81 captions) so that they do not affect the evaluation.

I'm sorry, I didn't quite understand the second part of your question?

@xixiareone
Copy link
Author

xixiareone commented May 26, 2020 via email

@albanie
Copy link
Owner

albanie commented May 26, 2020

Yes, for MSVD, that's correct.

@xixiareone
Copy link
Author

xixiareone commented May 26, 2020 via email

@xixiareone
Copy link
Author

xixiareone commented May 28, 2020 via email

@albanie
Copy link
Owner

albanie commented May 31, 2020

No problem! Do you have a reference for where this model checkpoint comes from?
The example given on the main README is:

# fetch the pretrained experts for MSVD 
python3 misc/sync_experts.py --dataset MSVD

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/msvd-train-full-ce/5bb8dda1/seed-0/2020-01-30_12-29-56/trained_model.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/collaborative-experts/${MODEL}"

# Evaluate the model
python3 test.py --config configs/msvd/train-full-ce.json --resume ${MODEL} --device 0 --eval_from_training_config

Do these steps fail for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants