-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't reproduce results for YouCookII #2
Comments
Hi, Thank you for your comment here is how we have done it: Also please make sure to put the model in eval mode, otherwise you will recompute batch norm statistics over running batches. One thing to note is that this pytorch model is a port of the official tensorflow release model from: https://tfhub.dev/deepmind/mil-nce/s3d/1 |
Do you L2-normalize text embeddings and video embeddings before avg-pooling? |
No normalization is needed, I managed to rerun the YouCook2 evaluation using this pytorch model on a new code (different from my codebase at Deepmind) and with a validation set sligtly larger than the one I had at deepmind and got 49,5 in R@10. |
Thank you for explanation. I succeed to get numbers from article. The main reason was in JPG compression. By default ffmpeg uses JPG compression when it doing unpacking video to images. I disabled compression and could manage to get required number. |
Hi bro, sorry to bother you. Could you please share how to disable the JPG compression in ffmpeg-python? I try to search for the arguments but didn't find how to disable it. |
|
Got it. Thank you. After seeing your another issue, I see that you used ffmpeg command-line to unpack videos. I misunderstood that and thought the compression need to be disabled in |
I took this model
and code from this repository, I take validation part of youcookII and try to achieve numbers
mentioned in the article End-to-End Learning of Visual Representations from Uncurated Instructional Videos
It is unclear which protocol did you use for testing. In the following table I show several experiments and none of them could achieve your results. Could you clarify which test protocol did you use for testing? It will be good if you publish script for testing.
What I try.
T
is time in seconds. I split each clip to subclips each has length T seconds. For each subclip embedding will be compute.pooling
If clip was split to >1 subclips embeddings will be averaged topooling
imgsz
Short side of each source video will be rescaled toimgsz
with h:w preserving. Thencenter crop will be taken for each frame.
normalize
Whether or not sentence embedding and each video embedding was L2-normalized before dot-product.num frames
From eachT
-seconds clipnum frames
was taken in uniform style.num resample
For each clip sample differentnum resample
sets of frames. For each resample compute embedding. Withpooling
all embedding will be polled to single one. LCR means sample from each clip 3 times:num frames
left crops,num_frames
right crops,num frames
center crops.The text was updated successfully, but these errors were encountered: