Skip to content

cheetah003/HMMC

Repository files navigation

HMMC

Paper: End-to-end Pre-training with Hierarchical Matching and Momentum Contrast for Text-Video Retrieval

Try Demo here

Our Paper has been published on TIP 2023

Python PyTorch Github stars Github forks Demo

The implementation of paper "End-to-end Pre-training with Hierarchical Matching and Momentum Contrast for Text-Video Retrieval".

HMMC(Hierarchical Matching and Momentum Contrast) is a text-video retrieval model(support Chinese and English) based on CLIP, which pre-trained on 400M image-text pairs in an end-to-end manner. We introduce HMMC model for video-language pre-training, taking advantage of both global video representation and frame features with a hierarchical matching mechanism. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods.

Model Architecture

Overall Architecture:

Architecture

Hierarchical Matching:

HM

Requirement

pip install -r requirements.txt

Data Preparing

Public Datasets

Write video frames to lmdb

We found that pre-training on videos is a time-consuming task, especially when extracting raw frames from videos via OpenCV. To address this issue, we pre-extract raw frames for efficiency and save them to lmdb. Dateloaders will directly read video frames from lmdb, which can speed up training. (e.g. We fine-tuned the model for 5 epochs with a 256 batch size on 8 NVIDIA Tesla V100 GPUs and it takes about 1 hour on MSR-VTT).

Parameters such as video_dir and lmdb_path in tools/frame2lmdb.py can be changed and then run:

python tools/frame2lmdb.py

How to Run

main_pretrain.py is for pre-training while main_task_retrieval.py is for fine-tuning and testing.

--do_pretrain,--do_train and --do_eval means pre-training, fine-tuning and testing.

--output_dir means the path where you want to save the results

--contrast_num_negative means momentum queue size

--contrast_momentum means momentum queue size

--top_frames means k in top-k choose

--language means which language for text(choose from [chinese, english])

--dataset means which dataset to use(choose from [chvtt, msrvtt, vatex])

Pre-train:

For example, run:

python -m torch.distributed.launch --nproc_per_node=8 --master_port 11000 \
main_pretrain.py --do_pretrain --num_thread_reader=8 \
--epochs=11 --batch_size=128 --n_display=596 \
--contrast_num_negative 1024 --contrast_momentum 0.99 \
--output_dir ckpts/pretrain/ablation7w_11epoch/FAM0.05_VTM0.45_FTM0.45_MLM0.05_m0.99_q1024_f30_e-7_e-4_11epoch_8GPU \
--top_frames 3 \
--max_frames 12 --frame_sample random  --use_temp --use_frame_fea \
--lr 1e-4 --text_lr 1e-4 --coef_lr 1e-3 --batch_size_val 128  \
--language chinese --dataset chvtt

to pre-train model on CHVTT in Chinese for 11 epochs with 8 GPUs.

Fine-tuning:

--init_model means the pre-trained model path

For example, run:

python -m torch.distributed.launch --nproc_per_node=8 --master_port 11000 \
main_task_retrieval.py --do_train --num_thread_reader=8 \
--epochs=5 --batch_size=256 --n_display=100 \
--output_dir ckpts/msrvtt/vit32_VTM0.85_FTM0.15_cross3_frame30_256_8GPU_pretrained \
--top_frames 2 \
--max_frames 12 --frame_sample random  --use_temp --use_frame_fea \
--lr 1e-4 --text_lr 1e-7 --coef_lr 1e-3 --batch_size_val 256  \
--task retrieval --dataset msrvtt --language english \
--init_model ckpts/pretrain/english/FAM0_VTM0.45_FTM0.45_MLM0_m0.99_q1024_f30_whole/pytorch_model.bin.2

to fine-tune the pre-trained model on MSR-VTT in English for 5 epochs with 8 GPUs.

Testing:

For example, run:

CUDA_VISIBLE_DEVICES=0  \
python -m torch.distributed.launch --nproc_per_node=1 --master_port 12000 \
main_task_retrieval.py --do_eval --num_thread_reader=8 \
--epochs=1 --batch_size=64 --n_display=1 \
--output_dir ckpts/val \
--max_frames 12 --frame_sample random --use_temp --use_frame_fea \
--lr 1e-4 --text_lr 3e-5 --coef_lr 8e-1 --batch_size_val 256  \
--task retrieval --dataset vatex --language english \
--init_model ckpts/vatex/VTM0.85_FTM0.15_cross3_frame30_8GPU_pretrain/pytorch_model.bin.4

to test model on VATEX in Chinese with 1 GPU.

Test Model Checkpoint

We provide three models pre-trained on CHVTT, and then fine-tuned on MSR-VTT, VATEX-English and VATEX-Chinese.

Model Link
MSR-VTT English download
VATEX English download
VATEX Chinese download

Main Results of HMMC

T2V(zero-shot):

Methods dataset R@1 R@5 R@10 Median Rank Mean Rank
HMMC (pre-trained on CHVTT) MSR-VTT 30.3 56.6 68.4 4.0 23.9
HMMC (pre-trained on CHVTT) VATEX(English) 40.5 73.4 83.1 2.0 10.3
HMMC (pre-trained on CHVTT) VATEX(Chinese) 15.5 39.9 52.5 9.0 53.4

T2V(fine-tune):

Methods dataset R@1 R@5 R@10 Median Rank Mean Rank
HMMC MSR-VTT 46.7 73.9 83.3 2.0 14.9
HMMC (pre-trained on CHVTT) MSR-VTT 47.8 74.1 83.7 2.0 14.3
HMMC VATEX(English) 61.0 91.2 95.7 1.0 3.2
HMMC (pre-trained on CHVTT) VATEX(English) 61.5 91.6 96.1 1.0 3.2
HMMC VATEX(Chinese) 51.4 84.8 92.0 1.0 6.1
HMMC (pre-trained on CHVTT) VATEX(Chinese) 54.1 86.2 92.4 1.0 5.6

V2T(zero-shot):

Methods dataset R@1 R@5 R@10 Median Rank Mean Rank
HMMC (pre-trained on CHVTT) MSR-VTT 28.8 57.0 68.6 4.0 22.6
HMMC (pre-trained on CHVTT) VATEX(English) 54.0 87.3 93.9 1.0 3.6
HMMC (pre-trained on CHVTT) VATEX(Chinese) 23.9 58.1 71.4 4.0 18.1

V2T(fine-tune):

Methods dataset R@1 R@5 R@10 Median Rank Mean Rank
HMMC MSR-VTT 46.3 74.5 83.4 2.0 9.5
HMMC (pre-trained on CHVTT) MSR-VTT 46.4 75.0 84.0 2.0 9.0
HMMC VATEX(English) 79.0 97.5 99.5 1.0 1.5
HMMC (pre-trained on CHVTT) VATEX(English) 77.9 97.8 99.2 1.0 1.5
HMMC VATEX(Chinese) 70.7 95.3 98.3 1.0 2.0
HMMC (pre-trained on CHVTT) VATEX(Chinese) 71.9 96.3 98.5 1.0 1.9

Visualization

Results:

results

Attention map:

Attention

Citation

If you find this work useful in your research, please cite the following paper:

@ARTICLE{HMMC,
  author={Shen, Wenxue and Song, Jingkuan and Zhu, Xiaosu and Li, Gongfu and Shen, Heng Tao},
  journal={IEEE Transactions on Image Processing}, 
  title={End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval}, 
  year={2023},
  volume={32},
  number={},
  pages={5017-5030},
  doi={10.1109/TIP.2023.3275071}}

Acknowledgments

Our code is based on CLIP and CLIP4Clip. We sincerely appreciate for their contributions.

About

Repository of "End-to-end Pre-training with Hierarchical Matching and Momentum Contrast for Text-Video Retrieval"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages