The official implementation of StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models accepted by IJCV.
- Clone this repository and navigate to StimuVAR folder
git clone https://github.com/EthanG97/StimuVAR.git
cd StimuVAR- Install Packages
conda create -n stimuvar python=3.9 -y
conda activate stimuvar
pip install --upgrade pip
pip install -r requirements.txtIn this work, we conduct experiments on four datasets: VCE, VE-8, YF-6, and EmoSet.
The following examples use the VCE dataset for demonstration purposes.
We provide a preprocessing script to extract motion-salient frames from video clips using dense optical flow analysis. This is particularly useful when preparing frame-level stimuli that capture rapid, key events—often corresponding to dramatic changes in a video's visual dynamics.
python helpers/extract_frames.py \
--input_json helpers/alltrain.json \
--video_root /path/to/videos \
--output_dir /path/to/output_frames \
--total_frames 6The training pipeline for StimuVAR consists of two sequential stages:
- Stage 1: Visual feature alignment
- Stage 2: Emotion reasoning based on aligned features
You can download the necessary resources from Google Drive:
# Stage 1: Train for visual feature alignment
torchrun --nproc_per_node=1 train.py --config config/Stage1.yaml
# Stage 2: Train for emotion reasoning
# (Use the Stage 1 model as the base model, specified in the Stage2 config)
torchrun --nproc_per_node=1 train.py --config config/Stage2.yamlTo run inference with a trained StimuVAR model:
-
Update the configuration
Openconfig/Inference.yamland set themodel_pathto the checkpoint of your trained Stage 2 model. -
Run the inference script
Use the following command:torchrun --nproc_per_node=1 inference.py --config config/Inference.yaml
To run the demo for a single video, fill in the model path, config file and the video name
python demo.py \
--video assets/sample_video.mp4 \
--model_path checkpoints/stage2/checkpoint-150000 \
--config config/Inference.yamlpython Metrics/Clip_score/clip_score.py \
--response_path /path/to/model_responses.jsonl \
--img_dir /path/to/extracted_test_set_frames/
python Metrics/Clip_score/ave.py \
--filename /path/to/_clipscore.jsonpython Metrics/LLM_judge/LLMjudge.py </path/to/model_responses.jsonl> <output_file>python Metrics/Doubly/gpt_predict.py --response_path /path/to/model_responses.jsonl
#Emotion prediction based on reason using GPT
python Metrics/Doubly/emo_align.py --response_path /path/to/model_responses.jsonl
Special thanks to Valley for providing high-quality code, which served as the foundation for our implementation.
If you find this project helpful in your research, please consider citing our paper:
@article{guo2025stimuvar,
title = {StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models},
author = {Guo, Yuxiang and Siddiqui, Faizan and Zhao, Yang and Chellappa, Rama and Lo, Shao-Yuan},
journal = {International Journal of Computer Vision},
pages = {1--17},
year = {2025},
publisher = {Springer}
}