More than a Moment: Towards Coherent Sequences of Audio Descriptions

Eshika Khandelwal¹, Junyu Xie², Tengda Han², Max Bain², Arsha Nagrani², Andrew Zisserman², Gül Varol³, Makarand Tapaswi¹

¹ CVIT, IIIT Hyderabad
² Visual Geometry Group, University of Oxford
³ LIGM, École des Ponts ParisTech

Datasets and Results

Our framework is evaluated on the following datasets:

Coherent-AD Pipeline

1. Video Description

python vlm/main.py \
--dataset={dataset} \                  #e.g. "cmdad"
--video_dir={video_dir} \
--anno_path={anno_path} \              #e.g. "resources/annotations/cmdad_anno_with_face_0.2_0.4.csv"
--charbank_path={charbank_path} \      #e.g. "resources/charbanks/cmdad_charbank.json" 
--model_path={videollama2_ckpt_path} \
--output_dir={output_dir}

2. Summarisation

python llm/main.py \
--path={vlm_result_path} \
--prompt_idx=0

3. Candidate Generation

python llm/main.py \
--path={summarised_result_path} \
--prompt_idx=1

Candidate Scoring

The criteria that evaluate each candidate independently (without requiring context from previous intervals) are scored separately.

python candidate_scorer/independent_scoring/main.py \
--path={multiple_candidates_result_path} \
--criterion="ad"

python candidate_scorer/independent_scoring/main.py \
--path={multiple_candidates_result_path} \
--criterion="counts"

Next, the remaining context-dependent criteria are evaluated while recursively selecting the best candidate.

python candidate_scorer/main.py \
--path={multiple_candidates_result_path} \
--story --redundancy --ad --action --other --char

PS: There is an additional criterion --salience available in the code. (not part of the original paper)

Metrics

StoryRecall

python metrics/storyrecall.py \
--path={path_to_evaluate}

Repetition Metrics

python metrics/repeat.py \
--path={path_to_evaluate}

Citation

If you find this repository helpful, please consider citing our work:

@article{khandelwal2025coherentad,
    title={More than a Moment: Towards Coherent Sequences of Audio Descriptions},
    author    = {Eshika Khandelwal and Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Andrew Zisserman and G\"ul Varol and Makarand Tapaswi },
    year={2025},
    url={https://arxiv.org/abs/2510.25440}
}

For any issues or questions while running this repository, please feel free to reach out.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
candidate_scorer		candidate_scorer
llm		llm
metrics		metrics
vlm		vlm
README.md		README.md
pipeline.png		pipeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

More than a Moment: Towards Coherent Sequences of Audio Descriptions

Datasets and Results

Coherent-AD Pipeline

1. Video Description

2. Summarisation

3. Candidate Generation

Candidate Scoring

Metrics

StoryRecall

Repetition Metrics

Citation

About

Uh oh!

Releases

Packages

Languages

esh04/Towards-Coherent-Sequences-of-Audio-Descriptions

Folders and files

Latest commit

History

Repository files navigation

More than a Moment: Towards Coherent Sequences of Audio Descriptions

Datasets and Results

Coherent-AD Pipeline

1. Video Description

2. Summarisation

3. Candidate Generation

Candidate Scoring

Metrics

StoryRecall

Repetition Metrics

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages