Skip to content

bigai-nlco/LSTP-Chat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LSTP-Chat: Language-guided Spatial-Temporal Prompt Learning for Video Chat

PyTorch Lightning Lightning Config: Hydra

Paper

Updates

  • (2024.02.27) Paper Release, check it on Arxiv.
  • (2024.02.26) Initial Release (´▽`ʃ♡ƪ)

Overview

This is a chat agent based on our work LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding. This work is finetuned on video-instruction datasets Video-ChatGPT and image-instruction datasets LLaVA.

We have meticulously chosen two distinct architectural paradigms for our study: the encoder-decoder architecture, exemplified by BLIP2-Flan-T5-xl, and the decoder-only architecture, represented by InstructBLIP-Vicuna-7B. For further exploration, we also provide the code to tune the LLM with LoRA.

Installation

# clone project
git clone https://github.com/bigai-nlco/LSTP-Chat
cd LSTP-Chat

# create conda environment
conda create -n LSTP
conda activate LSTP

# install requirements
pip install -r requirements.txt

Data Preparation

You can download all the instruction data and evaluation data from Video-LLaVA/DATA

inputs/ivinstruct
├── llava_image_tune
└── videochatgpt_tune

How to run

Our training framework offers tailored scripts to meet the diverse needs of researchers.

Train model

# run on local
python src/train.py experiment=LSTP_SF_blip2flant5xl_videoinstruct # blip2-flan-t5-xl + video-instruct
python src/train.py experiment=LSTP_SF_instructblipvicuna7b_videoinstruct # instructblip-vicuna-7b + video-instruct

# run on cluster
sbatch scripts/videoinstruct_train.slurm # blip2-flan-t5-xl + video-instruct
sbatch scripts/videoinstruct_vicuna_train.slurm # instructblip-vicuna-7b + video-instruct

For those with limited GPU resources, we also provide the pipeline to shorten the training procudure

# step 1: generate the pseudal labels from the base-model, and extract the optical flow in advance

# step 2: train the temporal sampler
python src/train.py experiment=LSTP_TG_blip2flant5xl_videoinstruct

# step 3: train LSTP with fixed temporal sampler
python src/train.py experiment=LSTP_blip2flant5xl_ivinstruct # blip2-flan-t5-xl + video-instruct + image-instruct
python src/train.py experiment=LSTP_instructblipvicuna7b_ivinstruct # instrucblip-vicuna-7b + video-instruct + image-instruct
python src/train.py experiment=LSTP_blip2flant5xl_ivtinstruct # blip2-flan-t5-xl (LoRA) + video-instruct + image-instruct + text-instruct
python src/train.py experiment=LSTP_instructblipvicuna7b_ivtinstruct # instrucblip-vicuna-7b (LoRA) + video-instruct + image-instruct + text-instruct

Evaluate model

# run inference for LSTP-Vicuan-7B
bash eval/scripts/run_qa_msvd_vicuna.sh
bash eval/scripts/run_qa_msrvtt_vicuna.sh
bash eval/scripts/run_qa_activitynet_vicuna.sh

# run inference for LSTP-Flan-T5-xl
bash eval/scripts/run_qa_msvd.sh
bash eval/scripts/run_qa_msrvtt.sh
bash eval/scripts/run_qa_activitynet.sh

# run evaluation
bash eval/scripts/eval_qa_msvd.sh
bash eval/scripts/eval_qa_msrvtt.sh
bash eval/scripts/eval_qa_activitynet.sh

Configures

data:
  - text_dir
  - video_dir
  - processor_name
  - sampler_processor_name
  - nframe # final sampled frames
  - target_size # image size
  - batch_size
model:
  - model_name_or_path
  - sampler_name_or_path
  - of_extractor_name_or_path
  - optimizer
  - scheduler
  - generate_configs
path:
  - data_dir
  - video_dir
  - text_dir
  - output_dir
trainer: 
  - strategy
  - accelerator
  - devices
  - num_nodes
  - precision

Evaluation Results

Metrics: Accuracy/Score

Methods LLM size MSVD-QA MSRVTT-QA ActivityNet-QA
FrozenBiLM 1B 32.2/- 16.8/- 24.7/-
VideoChat 7B 56.4/2.8 45.0/2.5 -/2.2
LLaMA-Adapter 7B 54.9/3.1 43.8/2.7 34.2/2.7
Video-LLaMA 7B 51.6/2.5 29.6/1.8 12.4/1.1
Video-ChatGPT 7B 64.9/3.3 49.3/2.8 35.2/2.7
Video-LLaVA 7B 70.7/3.9 59.2/3.5 45.3/3.3
LSTP-7B 7B 71.3/3.9 57.3/3.3 43.9/3.3

Demo

We provide the chat demo supported by Gradio. We also provide some checkpoints, you can download it an put it to ckpts/LSTP-Chat/.

Model Zoo

Model Base Model Training Data Strategy for LLM Download Link
LSTP-7B InstructBlip-Vicuna-7B Video-ChatGPT, LLaVA fixed Huggingface
python -m demo.demo

Acknowledgement

Citation

@misc{wang2024lstp,
    title={LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding},
    author={Yuxuan Wang and Yueqian Wang and Pengfei Wu and Jianxin Liang and Dongyan Zhao and Zilong Zheng},
    year={2024},
    eprint={2402.16050},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}