Skip to content

CVMI-Lab/ResKD

Repository files navigation

Rethinking Resolution in the Context of Efficient Video Recognition

Rethinking Resolution in the Context of Efficient Video Recognition (NeurIPS 2022)
By Chuofan Ma, Qiushan Guo, Yi Jiang, Ping Luo, Zehuan Yuan, and Xiaojuan Qi.

Introduction

We introduce cross-resolution knowledge distillation (ResKD) to make the most of low-resolution frames for efficient video recognition. In the training phase, a pre-trained teacher network taking high-resolution frames as input is leveraged to guide the learning of a student network on low-resolution frames. While for evaluation, only the student is deployed to make predictions. This simple but effective method largely boosts the boost recognition accuracy on low-resolution frames, and is compatible with state-of-the-art architectures, i.e., 3D-CNNs and Video Transformers.

framework

Setup Environment

This project is developed with CUDA 11.0, PyTorch 1.7.1, and Python 3.7. Please be aware of possible code compatibility issues if you are using another version. The following is an example of setting up the experimental environment:

git clone https://github.com/CVMI-Lab/ResKD.git
cd ResKD
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu110/torch1.7.0/index.html
pip install -r requirements/build.txt
pip install -v -e .
pip install tqdm
pip install timm
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Download Datasets

Four benchmarks are involved for training and evaluation. Please download the corresponding dataset(s) from the official websites and place or sim-link them under $ResKD_ROOT/data/. (You don't have to download all of them at one time)

$ResKD_ROOT/data/
    actnet/
    sthv2/
    fcvid/
    kinetics400/
  • ActivityNet. After downloading the raw videos, extract frames using tools/data/activitynet/video2img.py. To reproduce the results in our paper, you need to extract frames in png format at a frame rate of 4. The extracted frames will take roughly 1.9T space. If you do not have enough space, you may consider extracting frames in jpg format at the default frame rate, which will sacrifice accuracy slightlty.
  • Mini-Kinetics and Kinetics-400. We use the Kinetics-400 version provided by Common Visual Data Foundation. Remeber to filter out corrupted videos before using the dataset. Mini-Kinetics is a subset of Kinetics-400. You can get the train/val splits files from AR-net.
  • FCVID. Following the same pipeline to extract frames as Activitynet.
  • Something Something V2.

You may need to modify the corresponding file paths in the config files after data preparation.

Pretrained Teacher Models

Backbone Dataset Config Model
TSN_Res50 actnet tsn_r50_1x1x16_50e_actnet_rgb.py ckpt
TSM_Res50 sthv2 tsm_r50_1x1x8_50e_sthv2_rgb.py ckpt
TSN_Res152 actnet tsn_r152_1x1x16_50e_actnet_rgb.py ckpt
TSN_Res152 minik tsn_r152_1x1x8_50e_minik_rgb.py ckpt
TSN_Res152 fcvid tsn_r152_1x1x16_50e_fcvid_rgb.py ckpt
Slowonly_Res50 k400 slowonly_r50_8x8x1_150e_k400_rgb.py ckpt
Swin_Base k400 swin_base_32x2x1_50e_k400.py ckpt

Training and Evaluation Scripts

Here we provide some examples to train and test a model. For more details of the training and evaluation scripts, you may refer to the documents of mmaction2.

Inference with pretrained models

  • Evaluation on ActivityNet
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} --eval mean_average_precision
  • Evaluation on Mini-kinetics, Something Something V2, and FCVID
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} --eval top_k_accuracy

Training from scratch

./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} --validate --test-last

Citation

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{
    ma2022rethinking,
    title={Rethinking Resolution in the Context of Efficient Video Recognition},
    author={Chuofan Ma and Qiushan Guo and Yi Jiang and Ping Luo and Zehuan Yuan and XIAOJUAN QI},
    booktitle={Advances in Neural Information Processing Systems},
    year={2022},
}

Acknowledgment

Our codebase builds upon several existing publicly available codes. Specifically, we have modified and integrated the following repos into this project:

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

[NeurIPS 2022] Official implementation of the paper "Rethinking Resolution in the Context of Efficient Video Recognition".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published