In this repo, we provide code and pretrained models for the paper "Relevance-based Margin for Contrastively-trained Video Retrieval Models" which has been accepted for presentation at the ACM International Conference on Multimedia Retrieval (ICMR 2022). We also provide code and pretrained models for RelevanceMargin-HGR here.
The environment used is based on the JPoSE environment. To create a Conda environment from RelMarg_environment.yml, type:
conda env create -f RelMarg_environment.yml
conda activate JPoSE
Then clone the repository and type
export PYTHONPATH=src/
- Features:
- EPIC-Kitchens-100: video features and text features; both from JPoSE's repo.
- YouCook2: video features and text features; video from VALUE benchmark, text are precomputed using code from JPoSE repo.
- Additional:
- relational files: EPIC-Kitchens-100 and YouCook2
- relevancy files: EPIC-Kitchens-100 and YouCook2
To launch a training (with JPoSE) on EPIC-Kitchens-100:
python -m train.train_jpose_tripletRelBased
- to use the proposed relevance margin, specify
--rel-margin --all-noun-classes
- add
--rgb
to use only RGB features,--rgb-flow
to use RGB+Flow, otherwise do not add anything to use RGB+Flow+Audio (TBN features) - to only use cross-modality loss, specify
--tt-weight 0 --vv-weight 0
- to only use action-level embedding space, specify
--noun-weight 0 --verb-weight 0
- to use a GPU, specify
--gpu True
- more options in
src/parsing/__init__.py
and thesrc/train/train_{mmen,jpose}_tripletRelBased.py
files
To train on YouCook2, specify --dataset youcook2
. Similar options are available for MMEN baseline.
To test a specific checkpoint:
python -m train.test_jpose_triplet checkpoint
- use the same options used during training (e.g. if training was performed with RGB-only features, specify
--rgb
)
On EPIC-Kitchens-100:
- MMEN
- JPoSE
- Baseline reproduced (~53.5 nDCG, ~44.0 mAP), With Relevance Margin (56.2 nDCG, 45.8 mAP)
- Only cross-modality loss, action-level (53.1 nDCG, 43.4 mAP), With Relevance Margin (54.7 nDCG, 44.5 mAP)
- Only cross-modality loss, both action- and PoS-level (53.4 nDCG, 43.7 mAP), With Relevance Margin (56.2 nDCG, 45.6 mAP)
- Only RGB features (36.8 nDCG, 28.8 mAP), With Relevance Margin (38.4 nDCG, 30.4 mAP)
- RGB+Flow features (49.6 nDCG, 41.0 mAP), With Relevance Margin (52.5 nDCG, 42.8 mAP)
On YouCook2:
- MMEN
- JPoSE
We thank the authors of Chen et al. (CVPR, 2020) (github), Wray et al. (ICCV, 2019) (github), Wray et al. (CVPR, 2021) (github) for the release of their codebases. We thank Damen et al. (IJCV, 2021) and Li et al. (NeurIPS Track on Datasets and Benchmarks, 2021) for the release of the EPIC-Kitchens-100 and the YouCook2 features.
If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:
@article{falcon2022relevance,
title={Relevance-based Margin for Contrastively-trained Video Retrieval Models},
author={Falcon, Alex and Sudhakaran, Swathikiran and Serra, Giuseppe and Escalera, Sergio and Lanz, Oswald},
journal={ICMR},
year={2022}
}
MIT License