Eshika Khandelwal1, Junyu Xie2, Tengda Han2, Max Bain2, Arsha Nagrani2, Andrew Zisserman2, Gül Varol3, Makarand Tapaswi1
1 CVIT, IIIT Hyderabad
2 Visual Geometry Group, University of Oxford
3 LIGM, École des Ponts ParisTech
Our framework is evaluated on the following datasets:
python vlm/main.py \
--dataset={dataset} \ #e.g. "cmdad"
--video_dir={video_dir} \
--anno_path={anno_path} \ #e.g. "resources/annotations/cmdad_anno_with_face_0.2_0.4.csv"
--charbank_path={charbank_path} \ #e.g. "resources/charbanks/cmdad_charbank.json"
--model_path={videollama2_ckpt_path} \
--output_dir={output_dir}
python llm/main.py \
--path={vlm_result_path} \
--prompt_idx=0
python llm/main.py \
--path={summarised_result_path} \
--prompt_idx=1
The criteria that evaluate each candidate independently (without requiring context from previous intervals) are scored separately.
python candidate_scorer/independent_scoring/main.py \
--path={multiple_candidates_result_path} \
--criterion="ad"
python candidate_scorer/independent_scoring/main.py \
--path={multiple_candidates_result_path} \
--criterion="counts"
Next, the remaining context-dependent criteria are evaluated while recursively selecting the best candidate.
python candidate_scorer/main.py \
--path={multiple_candidates_result_path} \
--story --redundancy --ad --action --other --char
PS: There is an additional criterion --salience available in the code. (not part of the original paper)
python metrics/storyrecall.py \
--path={path_to_evaluate}
python metrics/repeat.py \
--path={path_to_evaluate}
If you find this repository helpful, please consider citing our work:
@article{khandelwal2025coherentad,
title={More than a Moment: Towards Coherent Sequences of Audio Descriptions},
author = {Eshika Khandelwal and Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Andrew Zisserman and G\"ul Varol and Makarand Tapaswi },
year={2025},
url={https://arxiv.org/abs/2510.25440}
}
For any issues or questions while running this repository, please feel free to reach out.
