GitHub - alibaba/alimama-video-narrator: Research code for ACL2024 paper: "Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline"

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

Dingyi Yang, Chunru Zhan, Ziheng Wang, Biao Wang, Tiezheng Ge, Bo Zheng, Qin Jin

Abstract

Video storytelling is engaging multimedia content that utilizes video and its accompanying narration to share a story and attract the audience, where a key challenge is creating narrations for recorded visual scenes. Previous studies on dense video captioning and video story generation have made some progress. However, in practical applications, we typically require synchronized narrations for ongoing visual scenes.

In this work, we introduce a new task of Synchronized Video Storytelling, which aims to generate synchronous and informative narrations for videos. These narrations, associated with each video clip, should relate to the visual content, integrate relevant knowledge, and have an appropriate word count corresponding to the clip's duration. Specifically, a structured storyline is beneficial to guide the generation process, ensuring coherence and integrity. To support the exploration of this task, we introduce a new benchmark dataset E-SyncVidStory with rich annotations. Since existing Multimodal LLMs are not effective in addressing this task in one-shot or few-shot settings, we propose a framework named VideoNarrator that can generate a storyline for input videos and simultaneously generate narrations with the guidance of the generated or predefined storyline. We further introduce a set of evaluation metrics to thoroughly assess the generation. Both automatic and human evaluations validate the effectiveness of our approach.

Comparison of Synchronized Video Storytelling and existing video-to-text generation tasks.

Release 📢

2024/07/17: Our code and dataset annotations are released. Video features will be available soon.

Install

Clone this repository

git clone https://github.com/alibaba/alimama-video-narrator
cd alimama-video-narrator

Install Package

pip install --upgrade pip
conda env create -f environment.yml

Dataset

File

Our annotations can be found at "/data/all_video_data.json".

Data Process

Due to copyright considerations, we will release the features of the original videos (coming soon).

If you want to extract features from your raw videos, please download all videos and store them in "/data_process/all_videos/". Then, proceed to extract the video features:

cd data_process/
python process_video.py
python get_blip_fea.py

Get the training data:

cd data_process/
# Visual Compression & Memory Consolidation
python get_training_data.py ./blip_fea/video_cuts/ ../data/all_video_data.json 
cp training_data.json ../data/split/

cd ../data/split/
python split.py training_data.json ../all_video_data.json
python get_cut_data.py train.json train_shots.json

Train

We apply the pretraining model firefly-baichuan-7b, with the details shown in: https://github.com/yangjianxin1/Firefly
You can directly use the baichuan-7b model, downloaded from: https://huggingface.co/baichuan-inc/Baichuan-7B

Run the following shell script to train your model:

bash train.sh

Inference

Run the following shell script for inference. Set 'offered_label' to False to generate narrations based on the model-generated storyline; otherwise, set it to True to use the ground truth (user-provided) storyline.

bash infer.sh

Evaluation

Standard metrics such as BLEU and CIDEr

python tokenize_output.py $chk_path/output.json
cd metrics/evaluator_for_caption/
python evaluate_ads.py $chk_path/out_tokens.json

Visual Relevance (EMScore & EMScore_ref)

Download Chinese_CLIP model from: https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16

cd metrics/EMScore/
python eval_ad_with_ref.py --inpath $chk_path/output.json

"EMScore(X,V) -> full_F" refers to EMScore；"EMScore(X,V,X*) -> full_F" refers to EMScore_ref 3. Knowledge Relevance

Download chinese-roberta-large model from: https://huggingface.co/hfl/chinese-roberta-wwm-ext-large

cd metrics/roberta_based/
# info_sim
python info_sim.py ../data/all_video_data.json $chk_path/output.json idf_with_all_ref.json
# info_diverse
python info_diverse.py $chk_path/output.json idf_with_all_ref.json

Fluency (intra-story repetition)

cd metrics/roberta_based/
python count_intra_repeat.py chk_path/output.json

Citation

If you find our work useful for your research and applications, please cite using this BibTeX:

@misc{yang2024synchronizedvideostorytellinggenerating,
      title={Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline}, 
      author={Dingyi Yang and Chunru Zhan and Ziheng Wang and Biao Wang and Tiezheng Ge and Bo Zheng and Qin Jin},
      year={2024},
      eprint={2405.14040},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2405.14040}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
data_process		data_process
imgs		imgs
metrics		metrics
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
configuration_baichuan.py		configuration_baichuan.py
deep_config.json		deep_config.json
environment.yml		environment.yml
infer.sh		infer.sh
lora_video_infer.py		lora_video_infer.py
modeling_baichuan.py		modeling_baichuan.py
modeling_lmm.py		modeling_lmm.py
modeling_lmm_baichuan.py		modeling_lmm_baichuan.py
tokenization_baichuan.py		tokenization_baichuan.py
tokenize_output.py		tokenize_output.py
train.sh		train.sh
train_on_local.py		train_on_local.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

Abstract

Release 📢

Contents

Install

Dataset

Train

Inference

Evaluation

Citation

About

Releases

Packages

Languages

License

alibaba/alimama-video-narrator

Folders and files

Latest commit

History

Repository files navigation

Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

Abstract

Release 📢

Contents

Install

Dataset

Train

Inference

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages