[MM' 2023] Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
Motion information is critical to a robust and generalized video representation. However, the representations learned by the vanilla instance-level contrastive loss are easily overwhelmed by static background cues and lack the ability to capture dynamic motion information. Recent works have adopted frame difference as the source of motion information and align the global pooled motion features at the instance level, which suffers from spatial and temporal weak alignment between the RGB and frame difference modalities. We present a Fine-grained Motion Alignment (FIMA) framework, capable of introducing well-aligned motion information from the noisy frame difference. By introducing fine-grained motion information, the representations learned by FIMA have strong generalization and transferability on various downstream tasks.
- pytorch >=1.10.1
- cv2
- kornia
- av
- tensorboard
- Download the UCF101 dataset from the official website.
- Download the Kinetics400 dataset from the official website.
- We rescale all videos to height=256 pixels, it is not necessary, but it will save a lot of memory storage space and fast IO speed.
In default, we pretrain networks on K400 for 100 epochs.
python3 train.py \
--log_dir $PATH_TO_LOG_DIR \
--ckp_dir $PATH_TO_CKP_DIR \
-a I3D \
--dataset ucf101 \
--lr 0.01 \
-fpc 16 \
-cs 224 \
-b 64 \
-j 16 \
--cos \
--epochs 100 \
--pos_ratio 0.7 \
--dist_url 'tcp://localhost:10001' --multiprocessing_distributed --world_size 1 --rank 0 \
$PATH_TO_K400
In default, we fine-tune pre-trained model on UCF101 for 150 epochs
python3 eval.py \
--log_dir $PATH_TO_LOG \
--pretrained $PATH_TO_PRITRAINED_MODEL \
-a I3D \
--seed 42 \
--num_class 101 \
--lr 0.01 \
--weight_decay 0.0001 \
--lr_decay 0.1 \
-fpc 16 \
-b 16 \
-j 16 \
-cs 224 \
--finetune \
--epochs 150 \
--schedule 60 120 \
--dist_url 'tcp://localhost:10001' --multiprocessing_distributed --world_size 1 --rank 0 \
$PATH_TO_UCF101
Method | UCF101 | HMDB51 |
---|---|---|
MoCo baseline | 76.7 | 48.8 |
FIMA | 84.2 | 57.8 |
Method | UCF101 | HMDB51 |
---|---|---|
MoCo baseline | 58.1 | 26.7 |
FIMA | 75.3 | 42.8 |
We visualize the model attention by applying the class-agnostic activation map. FIMA is effective for alleviating the background bias.
Our code builds on MoCo and FAME. Thank them for their excellent works!
If our work is useful to you, please consider citing our paper using the following BibTeX entry.
@inproceedings{FIMA,
title={Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning},
author={Zhu, Minghao and Lin, Xiao and Dang, Ronghao and Liu, Chengju and Chen, Qijun},
booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
pages={4725--4736},
year={2023}
}