[NEW!] 2022/7/8 - Our paper has been accepted by ECCV 2022.
2022/5/5 - We have released the code and models.
This is a PyTorch/GPU implementation of the paper In Defense of Image Pre-Training for Spatiotemporal Recognition.
-
The Image Pre-Training code is located in Image_Pre_Training, which is based on the timm repo.
-
The Spatiotemporal Finetuning code is a modification on the mmaction2. Installation and preparation follow that repo.
-
You can find the proposed STS 3D convolution in STS_Conv.
The code is built with following libraries:
-
python 3.8.5 or higher
-
PyTorch 1.10.0+cu113
-
torchvision 0.11.1+cu113
-
opencv-python 4.4.0
-
mmcv 1.4.6
-
mmaction 0.20.0+
We mainly focus on two widely-used video classification benchmarks Kinetics-400 and Something-Something V2.
Some notes before preparing the two datasets:
-
We decode the video online to reduce the cost of storage. In our experiments, the cpu bottleneck issue only appears when input frames are more than 8.
-
The frame resolution of Kinetics-400 we used is with a short-side 320. The number of train / validation data for our experiments is 240436 /19796. We also provide the train/val list.
We provide our annotation and data structure bellow for easy installation.
-
Generate the annotation.
The annotation usually includes train.txt, val.txt. The format of *.txt file is like:
video_1 label_1 video_2 label_2 video_3 label_3 ... video_N label_N
The pre-processed dataset is organized with the following structure:
datasets |_ Kinetics400 |_ videos | |_ video_0 | |_ video_1 |_ ... |_ video_N |_ train.txt |_ val.txt
Here we provide video dataset list and pretrained weights in this OneDrive or GoogleDrive.
We provide ImageNet-1k pre-trained weights for five video models. All models are trained for 300 epochs. Please follow the scripts we provided to evaluate or finetune on video dataset.
Models/Configs | Resolution | Top-1 | Checkpoints |
---|---|---|---|
ir-CSN50 | 224 * 224 | 78.8% | ckpt |
R2plus1d34 | 224 * 224 | 79.6% | ckpt |
SlowFast50-4x16 | 224 * 224 | 79.9% | ckpt |
SlowFast50-8x8 | 224 * 224 | 79.1% | ckpt |
Slowonly50 | 224 * 224 | 79.9% | ckpt |
X3D-S | 224 * 224 | 74.8% | ckpt |
Here we provided the 50-epoch fine-tuning configs and checkpoints. We also include some 100-epochs checkpoints for a better performance but with a comparable computation.
Models/Configs | Resolution | Frames * Crops * Clips | 50-epoch Top-1 | 100-epoch Top1 | Checkpoints folder |
---|---|---|---|---|---|
ir-CSN50 | 256 * 256 | 32 * 3 * 10 | 76.8% | 76.7% | ckpt |
R2plus1d34 | 256 * 256 | 8 * 3 * 10 | 76.2% | Over training budget | ckpt |
SlowFast50-4x16 | 256 * 256 | 32 * 3 * 10 | 76.2% | 76.9% | ckpt |
SlowFast50-8x8 | 256 * 256 | 32 * 3 * 10 | 77.2% | 77.9% | ckpt |
Slowonly50 | 256 * 256 | 8 * 3 * 10 | 75.7% | Over training budget | ckpt |
X3D-S | 192 * 192 | 13 * 3 * 10 | 72.5% | 73.9% | ckpt |
Models/Configs | Resolution | Frames * Crops * Clips | Top-1 | Checkpoints |
---|---|---|---|---|
ir-CSN50 | 256 * 256 | 8 * 3 * 1 | 61.4% | ckpt |
R2plus1d34 | 256 * 256 | 8 * 3 * 1 | 63.0% | ckpt |
SlowFast50-4x16 | 256 * 256 | 32 * 3 * 1 | 57.2% | ckpt |
Slowonly50 | 256 * 256 | 8 * 3 * 1 | 62.7% | ckpt |
X3D-S | 256 * 256 | 8 * 3 * 1 | 58.3% | ckpt |
After downloading the checkpoints and putting them into the target path, you can fine-tune or test the models with corresponding configs following the instruction bellow.
After having the above dependencies, run:
git clone https://github.com/UCSC-VLAA/Image-Pretraining-for-Video
cd Image_Pre_Training # first pretrain the 3D model on ImageNet
cd Spatiotemporal_Finetuning # then finetune the model on target video dataset
We have provided some widely-used 3D model pre-trained weights that you can directly use for evaluation or fine-tuning.
After downloading the pre-training weights, for example, you can evaluate the CSN model on Imagenet by running:
bash scripts/csn/distributed_eval.sh [number of gpus]
The pre-training scripts for listed models are located in scripts. Before training the model on ImageNet, you should indicate some paths you would like to store the checkpoint your data path
and --output
. By default, we use wandb to show the curve.
For example, pre-train a CSN model on Imagenet:
bash scripts/csn/distributed_train.sh [number of gpus]
After pre-training, you can use the following command to fine-tune a video model.
Some Notes:
-
In the config file, change the
load_from = [your pre-trained model path]
. -
Simply setting the
reshape_t
orreshape_st
in the model config to False can disable the STS Conv.
Then you can use the following command to fine-tune the models.
bash tools/dist_train.sh ${CONFIG_FILE} [optional arguments]
Example: train a CSN model on Kinetics-400 dataset with periodic validation.
bash tools/dist_train.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py [number of gpus] --validate
You can use the following command to test a model.
bash tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
Example: test a CSN model on Kinetics-400 dataset and dump the result to a json file.
bash tools/dist_test.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py \
checkpoints/SOME_CHECKPOINT.pth [number of gpus] --eval top_k_accuracy mean_class_accuracy \
--out result.json --average-clips prob
This repo is based on timm and mmaction2. Thanks the contributors of these repos!
@inproceedings{li2022videopretraining,
title = {In Defense of Image Pre-Training for Spatiotemporal Recognition},
author = {Xianhang Li and Huiyu Wang and Chen Wei and Jieru Mei and Alan Yuille and Yuyin Zhou and Cihang Xie},
booktitle = {ECCV},
year = {2022},
}