[2023.8.22] Code and pre-trained models of Tubelet Contrast will be released soon! Keep a look at this repo!
[2023.8.22] Code for evaluation of Tubelet Contrast pretrained models is added this repo. 🎉
[2023.7.13] Our [Tubelet Contrast] (https://arxiv.org/abs/2303.11003) paper is accepted by ICCV 2023! 🎉
Official code for our ECCV 2022 paper How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?
TL;DR. We propose the SEVERE (SEnsitivity of VidEo REpresentations) benchmark for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods.
We evaluate 9 video self-supervised learning (VSSL) methods on 7 video datasets for 6 video understanding tasks.
Below are the video self-suprevised methods that we evaluate.
- For SeLaVi, MoCo, VideoMoCO, Pretext-Contrast, CtP, TCLR and GDT we use the Kinetics-400 pretrained R(2+1D)-18 weights provided by the Authors.
- For RSPNet and AVID-CMA the author provided R(2+1D)-18 weights differ from the R(2+1D)-18 architecture defined in 'A Closer Look at Spatiotemporal Convolutions for Action Recognition'. Thus we use the official implementation of the RSPNet and AVID-CMA and to pretrain with the common R(2+1D)-18 backbone on Kinetics-400 dataset.
- For Supervised, We use the Kinetics-400 pretrained R(2+1D)-18 weights from the pytorch library.
Download Kinetics-400 pretrained R(2+1D)-18 weights for each method from here. Unzip the downloaded file and it shall create a folder checkpoints_pretraining/
with all the pretraining model weights.
We divide these downstream evaluations across four axes:
We evaluate the sensitivity of self-supervised methods to the domain shift in downstream dataset with respect to the pre-training dataset i.e. Kinetics.
Please refer to action_recognition/README.md for steps to reproduce the experiments with varying downstream domain datasets like .
We evaluate the sensitivity of self-supervised methods to the amount of downstream samples available for finetuning.
Please refer to action_recognition/README.md for steps to reproduce the experiments with varying downstream samples.
We investigate whether self-supervised methods can learn fine-grained features required for recognizing semantically similar actions.
Please refer to action_recognition/README.md for steps to reproduce the experiments with varying downstream actions.
We study the sensitivity of video self-supervised methods to nature of the downstream task.
In-domain task shift: For task-shift within-domain, we evaluate the UCF dataset for the task of repetition counting. Please refer to Repetition-Counting/README.md for steps to reproduce experiments.
Out-of-domain task shift: For task-shift as well as domain shift, we evaluate on multi-label action classification on Charades and action detection on AVA. Please refer to action_detection_multi_label_classification/README.md for steps to reproduce the experiments.
From our analysis we distill the SEVERE-benchmark, a subset of our experiments, that can be useful for evaluating current and future video representations beyond standard benchmarks.
If you use our work or code, kindly consider citing our paper:
@inproceedings{thoker2022severe,
author = {Thoker, Fida Mohammad and Doughty, Hazel and Bagad, Piyush and Snoek, Cees},
title = {How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?},
journal = {ECCV},
year = {2022},
}
🔔 If you face an issue or have suggestions, please create a Github issue and we will try our best to address soon.