[CVPR2026 HIGHLIGHT] EgoAVU, [ICASSP Oral] Exploring Audio Hallucination in Egocentric Video Understanding
Official Implementation of EgoAVU: Egocentric Audio-Visual Understanding and Exploring Audio Hallucination in Egocentric Video Understanding
[Paper (CVPR)] [Paper (ICASSP)] [Project Page] [Huggingface Dataset]
We introduce EgoAVU, a scalable and automated data engine to enable egocentric audio–visual understanding. EgoAVU enriches existing egocentric narrations by integrating human actions with environmental context, explicitly linking visible objects and the sounds produced during interactions or surroundings. Leveraging this pipeline, we construct EgoAVU-Instruct (3M QAs) and EgoAVU-Bench (3K verified QAs), enabling systematic training and evaluation of Multimodal Large Language Models (MLLMs). Models finetuned with EgoAVU-Instruct exhibit high audio-visual grounding in egocentric settings.
If you find our code useful for your research, please consider citing:
@article{seth2026egoavu,
title={EgoAVU: Egocentric Audio-Visual Understanding},
author={Seth, Ashish and Mei, Xinhao and Zhao, Changsheng and Nagaraja, Varun and Chang, Ernie and Meyer, Gregory P and Lan, Gael Le and Xiong, Yunyang and Chandra, Vikas and Shi, Yangyang and others},
journal={arXiv preprint arXiv:2602.06139},
year={2026}
}
@INPROCEEDINGS{11460380,
author={Seth, Ashish and Mei, Xinhao and Zhao, Changsheng and Nagaraja, Varun and Chang, Ernie and Meyer, Gregory P. and Le Lan, Gael and Xiong, Yunyang and Chandra, Vikas and Shi, Yangyang and Manocha, Dinesh and Cai, Zhipeng},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Exploring Audio Hallucination in Egocentric Video Understanding},
year={2026},
volume={},
number={},
pages={22527-22531},
doi={10.1109/ICASSP55912.2026.11460380}
}
The EgoAVU data engine pipeline consists of 5 stages that transform raw egocentric videos into rich audio-visual QA pairs. Use the run_pipeline.sh script to execute the complete pipeline or individual stages.
| Stage | Description | Output |
|---|---|---|
| 1. Split Videos | Split video+audio into 10-second segments | Split video files + CSV with paths |
| 2. Generate Captions | Generate video, sound, and object descriptions using Qwen2.5-Omni and Qwen2.5-VL | JSONL with multimodal captions |
| 3. Generate MCG | Create Multimodal Context Graph linking objects and sounds | JSONL with scene graphs |
| 4. Generate AV Narration | Combine MCG with captions to create unified narration | JSONL with combined narrations |
| 5. Generate QA Pairs | Generate QA pairs for 7 different task types | JSONL files per QA type |
Prepare a CSV file with video information (see utils/sample_vid.csv for reference):
id,start_time,end_time,split
video_uuid_1,0,360,train
video_uuid_2,360,720,trainid: Video filename (without extension)start_time: Start timestamp in secondsend_time: End timestamp in secondssplit: Data split (train/val/test)
./run_pipeline.sh --all# Stage 1: Split videos into segments
./run_pipeline.sh --split
# Stage 2: Generate captions using Qwen2.5-Omni
./run_pipeline.sh --caption
# Stage 3: Generate Multimodal Context Graph (MCG)
./run_pipeline.sh --mcg
# Stage 4: Generate combined audio-visual narration
./run_pipeline.sh --av-narration
# Stage 5: Generate QA pairs for all task types
./run_pipeline.sh --qa# Resume from MCG generation (runs stages 3, 4, 5)
./run_pipeline.sh --from-mcg
# Resume from AV narration (runs stages 4, 5)
./run_pipeline.sh --from-avConfigure the pipeline using environment variables:
| Variable | Default | Description |
|---|---|---|
CSV_FILE |
./utils/sample_vid.csv |
Input CSV with video metadata |
INPUT_VIDEO_DIR |
./media/input |
Directory containing source videos |
SPLIT_OUTPUT_DIR |
./media/split |
Output directory for split segments |
CAPTION_OUTPUT_DIR |
./outputs/captions |
Output directory for captions |
MCG_OUTPUT_DIR |
./outputs/mcg |
Output directory for MCG |
AV_NARRATION_OUTPUT_DIR |
./outputs/av_narration |
Output directory for AV narrations |
QA_OUTPUT_DIR |
./outputs/qa |
Output directory for QA pairs |
| Variable | Default | Description |
|---|---|---|
QWEN_MODEL_PATH |
Qwen/Qwen2.5-Omni-7B |
Path to Qwen model for caption generation |
LLM_MODEL_ID |
meta-llama/Meta-Llama-3-70B |
LLM for MCG, narration, and QA generation |
| Variable | Default | Description |
|---|---|---|
NUM_GPUS |
4 |
Number of GPUs for distributed processing |
CHUNK_DURATION |
10.0 |
Duration of video segments in seconds |
BATCH_SIZE |
64 |
Batch size for caption generation |
SPLIT_WORKERS |
4 |
Parallel workers for video splitting |
MAX_NEW_TOKENS |
512 |
Maximum tokens for LLM generation |
TEMPERATURE |
0.7 |
Sampling temperature for LLM |
CSV_FILE=./my_data/videos.csv \
INPUT_VIDEO_DIR=./my_data/raw_videos \
NUM_GPUS=8 \
LLM_MODEL_ID=meta-llama/Meta-Llama-3-70B \
./run_pipeline.sh --allThe pipeline generates QA pairs for 7 different audio-visual understanding tasks (For replicating the results reported in our ICASSP paper, please only generate/evaluate AVH-(Action, Object and Sound)):
| Task | Prompt File | Description |
|---|---|---|
| AVDN | prompt_avdn.txt |
Dense narration summarizing the entire video |
| AVH-Action | prompt_avh_action.txt |
Hallucination detection for actions |
| AVH-Object | prompt_avh_object.txt |
Hallucination detection for objects |
| AVH-Sound | prompt_avh_sound.txt |
Hallucination detection for sounds |
| SSA | prompt_ssa.txt |
Sound-source association reasoning |
| TR-Before/After | prompt_tr_before_after.txt |
Temporal reasoning about event order |
| TR-Event Ordering | prompt_tr_event_ordering.txt |
Temporal reasoning about event sequences |
We use LlamaFactory for all training experiments.
- Install LlamaFactory following their official installation guide
- Use the provided configuration files for training:
- LoRA fine-tuning:
.train/qwen2_5omni_lora_sft.yml - Full fine-tuning:
.train/qwen2_5omni_full_sft.yml
- LoRA fine-tuning:
# LoRA fine-tuning
llamafactory-cli train .train/qwen2_5omni_lora_sft.yml
# Full fine-tuning
llamafactory-cli train .train/qwen2_5omni_full_sft.ymlTo compute standard captioning metrics such as METEOR and ROUGE-L run the following:
python evaluation/captioning_eval.py \
--json_dir /path/to/jsons \
--output_csv /path/to/results.csv \
--categories avsn avdnWe use Qwen3-235B-A22B-Instruct-2507 for LLM-as-judge evaluation. Run the following:
python evaluation/llm_as_judge.py \
--input_dir /path/to/jsons \
--output_csv /path/to/results.csv \
--temperature 0.0 \
--max_new_tokens 512EgoAVU is FAIR CC-BY-NC licensed, as found in the LICENSE file.




