Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reproduce results for YouCookII #2

Closed
dzabraev opened this issue May 3, 2020 · 7 comments
Closed

Can't reproduce results for YouCookII #2

dzabraev opened this issue May 3, 2020 · 7 comments

Comments

@dzabraev
Copy link

dzabraev commented May 3, 2020

I took this model

wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_howto100m.pth
wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/s3d_dict.npy

and code from this repository, I take validation part of youcookII and try to achieve numbers
mentioned in the article End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Capture

It is unclear which protocol did you use for testing. In the following table I show several experiments and none of them could achieve your results. Could you clarify which test protocol did you use for testing? It will be good if you publish script for testing.

What I try.

  • T is time in seconds. I split each clip to subclips each has length T seconds. For each subclip embedding will be compute.
  • pooling If clip was split to >1 subclips embeddings will be averaged to pooling
  • imgsz Short side of each source video will be rescaled to imgsz with h:w preserving. Then
    center crop will be taken for each frame.
  • normalize Whether or not sentence embedding and each video embedding was L2-normalized before dot-product.
  • num frames From each T-seconds clip num frames was taken in uniform style.
  • num resample For each clip sample different num resample sets of frames. For each resample compute embedding. With pooling all embedding will be polled to single one. LCR means sample from each clip 3 times: num frames left crops, num_frames right crops, num frames center crops.
T imgsize pooling normalize num frames num resample R@1 R@5 R@10 MedR
250 200 max False 32 1 11.478 27.610 37.453 21
250 224 max False 32 1 8.774 22.044 30.975 32
250 256 max False 32 1 5.912 15.503 21.038 104
1.5 200 max False 32 1 8.333 23.208 31.981 31
3.2 200 max False 32 1 9.497 24.969 34.654 24
8 200 max False 32 1 10.094 25.818 35.849 23
16 200 max False 32 1 10.755 26.478 36.541 21
32 200 max False 32 1 11.164 27.484 37.296 21
64 200 max False 32 1 11.415 27.704 37.547 21
128 200 max False 32 1 11.447 27.610 37.453 21
250 200 max True 32 1 9.906 25.031 34.748 25
250 200 max False 32 2 11.604 28.270 37.987 20
250 200 max False 32 3 11.918 28.396 38.333 21
250 200 max False 32 4 11.509 28.082 38.365 21
250 200 max False 32 LCR 11.384 27.138 37.704 22
250 200 mean False 32 4 12.075 28.805 38.459 20
@antoine77340
Copy link
Owner

antoine77340 commented May 3, 2020

Hi,

Thank you for your comment here is how we have done it:
Each testing video is sampled at 10 fps and rescaled so that min(height,width) = 224.
For each YouCook2 video clip, we sampled 5 x 32-frame clips linearly spaced (so each clip is of 3.2 seconds), center crop it to 224x224, and compute the video embedding for each of them of size 512. And then we average pool the embeddings.
Finally there was no normalization.
I guess the main difference with what you are doing is that you are uniformly sampling the 32 frames over the whole video right ? Or are the 32 frames sampled, always subsequent frames ?

Also please make sure to put the model in eval mode, otherwise you will recompute batch norm statistics over running batches.

One thing to note is that this pytorch model is a port of the official tensorflow release model from: https://tfhub.dev/deepmind/mil-nce/s3d/1
I did convert the weights to pytorch and did run a benchmark on CrossTask to check if the numbers were similar but I did not check on YouCook2. If you still happen to have any problem, please let me know and I will check myself on YouCook2.

@dzabraev
Copy link
Author

dzabraev commented May 4, 2020

Do you L2-normalize text embeddings and video embeddings before avg-pooling?
If not, is it ok that text embedding has L2-norm ~175 and video embedding ~0.25 ?

@antoine77340
Copy link
Owner

No normalization is needed, I managed to rerun the YouCook2 evaluation using this pytorch model on a new code (different from my codebase at Deepmind) and with a validation set sligtly larger than the one I had at deepmind and got 49,5 in R@10.
I assume there is a problem in how you sample the video clips of 32 frames. Are they always 32 contiguous frames ? if not then you might have some issues if you randomly sampled 32 frames within a large clip of 250 seconds.

@dzabraev
Copy link
Author

dzabraev commented May 5, 2020

Thank you for explanation. I succeed to get numbers from article. The main reason was in JPG compression. By default ffmpeg uses JPG compression when it doing unpacking video to images. I disabled compression and could manage to get required number.

@xiangyh9988
Copy link

I disabled compression and could manage to get required number

Hi bro, sorry to bother you. Could you please share how to disable the JPG compression in ffmpeg-python? I try to search for the arguments but didn't find how to disable it.

@dzabraev
Copy link
Author

dzabraev commented Jun 1, 2022

  1. Add -q:v 1 to ffmpeg arguments.
  2. You can unpack video to bmp. BMP is lossless format. It will give you the best possible quality, but each image will have big size.

@xiangyh9988
Copy link

Got it. Thank you. After seeing your another issue, I see that you used ffmpeg command-line to unpack videos. I misunderstood that and thought the compression need to be disabled in ffmpeg-python. My bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants