Rethinking Resolution in the Context of Efficient Video Recognition

Rethinking Resolution in the Context of Efficient Video Recognition (NeurIPS 2022)
By Chuofan Ma, Qiushan Guo, Yi Jiang, Ping Luo, Zehuan Yuan, and Xiaojuan Qi.

Introduction

We introduce cross-resolution knowledge distillation (ResKD) to make the most of low-resolution frames for efficient video recognition. In the training phase, a pre-trained teacher network taking high-resolution frames as input is leveraged to guide the learning of a student network on low-resolution frames. While for evaluation, only the student is deployed to make predictions. This simple but effective method largely boosts the boost recognition accuracy on low-resolution frames, and is compatible with state-of-the-art architectures, i.e., 3D-CNNs and Video Transformers.

Setup Environment

This project is developed with CUDA 11.0, PyTorch 1.7.1, and Python 3.7. Please be aware of possible code compatibility issues if you are using another version. The following is an example of setting up the experimental environment:

git clone https://github.com/CVMI-Lab/ResKD.git
cd ResKD
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu110/torch1.7.0/index.html
pip install -r requirements/build.txt
pip install -v -e .
pip install tqdm
pip install timm
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Download Datasets

Four benchmarks are involved for training and evaluation. Please download the corresponding dataset(s) from the official websites and place or sim-link them under $ResKD_ROOT/data/. (You don't have to download all of them at one time)

$ResKD_ROOT/data/
    actnet/
    sthv2/
    fcvid/
    kinetics400/

ActivityNet. After downloading the raw videos, extract frames using tools/data/activitynet/video2img.py. To reproduce the results in our paper, you need to extract frames in png format at a frame rate of 4. The extracted frames will take roughly 1.9T space. If you do not have enough space, you may consider extracting frames in jpg format at the default frame rate, which will sacrifice accuracy slightlty.
Mini-Kinetics and Kinetics-400. We use the Kinetics-400 version provided by Common Visual Data Foundation. Remeber to filter out corrupted videos before using the dataset. Mini-Kinetics is a subset of Kinetics-400. You can get the train/val splits files from AR-net.
FCVID. Following the same pipeline to extract frames as Activitynet.
Something Something V2.

You may need to modify the corresponding file paths in the config files after data preparation.

Pretrained Teacher Models

Backbone	Dataset	Config	Model
TSN_Res50	actnet	tsn_r50_1x1x16_50e_actnet_rgb.py	ckpt
TSM_Res50	sthv2	tsm_r50_1x1x8_50e_sthv2_rgb.py	ckpt
TSN_Res152	actnet	tsn_r152_1x1x16_50e_actnet_rgb.py	ckpt
TSN_Res152	minik	tsn_r152_1x1x8_50e_minik_rgb.py	ckpt
TSN_Res152	fcvid	tsn_r152_1x1x16_50e_fcvid_rgb.py	ckpt
Slowonly_Res50	k400	slowonly_r50_8x8x1_150e_k400_rgb.py	ckpt
Swin_Base	k400	swin_base_32x2x1_50e_k400.py	ckpt

Training and Evaluation Scripts

Here we provide some examples to train and test a model. For more details of the training and evaluation scripts, you may refer to the documents of mmaction2.

Inference with pretrained models

Evaluation on ActivityNet

./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} --eval mean_average_precision

Evaluation on Mini-kinetics, Something Something V2, and FCVID

./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} --eval top_k_accuracy

Training from scratch

./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} --validate --test-last

Citation

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{
    ma2022rethinking,
    title={Rethinking Resolution in the Context of Efficient Video Recognition},
    author={Chuofan Ma and Qiushan Guo and Yi Jiang and Ping Luo and Zehuan Yuan and XIAOJUAN QI},
    booktitle={Advances in Neural Information Processing Systems},
    year={2022},
}

Acknowledgment

Our codebase builds upon several existing publicly available codes. Specifically, we have modified and integrated the following repos into this project:

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
apex		apex
configs		configs
demo		demo
docker		docker
docs		docs
docs_zh_CN		docs_zh_CN
mmaction		mmaction
mmcv_custom		mmcv_custom
requirements		requirements
resources		resources
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
model-index.yml		model-index.yml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

License

CVMI-Lab/ResKD

Folders and files

Latest commit

History

Repository files navigation

Rethinking Resolution in the Context of Efficient Video Recognition

Introduction

Setup Environment

Download Datasets

Pretrained Teacher Models

Training and Evaluation Scripts

Inference with pretrained models

Training from scratch

Citation

Acknowledgment

License

About

Resources

License

Stars

Watchers

Forks

Languages