FLARE Benchmark

Official repository for FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries.

FLARE is a benchmark designed to evaluate long-video audiovisual retrieval with realistic user-style queries. The benchmark contains 399 long videos (225.4 h) and 87,697 fine-grained clips, each annotated with vision, audio, and unified captions, together with 274,933 user-style queries (vision-only, audio-only, and cross-modal with a hard bimodal constraint).

The repository provides:

Clip segmentation pipeline
Multimodal caption generation framework
User-style query generation and two-stage filtering
Unified evaluation harness covering 15 representative retrievers

Pipeline Overview

The FLARE dataset is constructed using a three-stage pipeline:

Clip Segmentation Segment long videos into fine-grained clips based on visual scene changes, ASR-driven semantics, and spectral novelty of the audio track.
Caption Generation Generate clip-level vision, audio, and unified captions, followed by hierarchical merging into video-level captions.
Query Generation Generate user-style queries for each modality and filter them by relevance / non-copy rules and the hard bimodal constraint.

The implementation of these steps is provided in the flare/data_construction/ directory. Evaluation of 15 baselines is provided in flare/evaluation/.

Project Structure

FLARE/
│
├── flare/
│   ├── data_construction/
│   │   ├── segmentation/
│   │   │   ├── pyscene_seg.py
│   │   │   ├── asr.py
│   │   │   ├── semantic_seg.py
│   │   │   └── spectral_seg.py
│   │   ├── caption_generation/
│   │   │   ├── sglang_gen.py
│   │   │   ├── sglang_gen_omni.py
│   │   │   ├── llm_judge.py
│   │   │   ├── sglang_vora.py
│   │   │   ├── sglang_unify.py
│   │   │   └── sglang_longmerge.py
│   │   └── query_generation/
│   │       ├── sglang_query.py
│   │       ├── sglang_query_unified.py
│   │       ├── query_filter_stage1.py
│   │       ├── query_filter_stage2.py
│   │       ├── query_unified_filter_stage1.py
│   │       └── query_unified_filter_stage2.py
│   └── evaluation/
│       ├── complete_retrieval_pipeline.py
│       ├── retrieval_adapters.py
│       ├── eval_run.sh
│       └── <model>_config.sh
│
└── README.md

Dataset Generation

The dataset generation pipeline consists of three stages. Please execute them in the following order.

Clip Segmentation  →  Caption Generation  →  Query Generation

For scripts that call Qwen models through SGLang, start an OpenAI-compatible SGLang server first, for example:

python -m sglang.launch_server \
  --model-path /path/to/Qwen3-VL-235B-A22B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --tp 8 \
  --trust-remote-code

export SGLANG_SERVER_URL="http://127.0.0.1:30000/v1"
export SGLANG_CHAT_COMPLETIONS_URL="http://127.0.0.1:30000/v1/chat/completions"

Replace --model-path with the model required by the specific script. Our experiments were conducted on a machine with 8 H20 GPUs, each with 96GB memory.

All scripts configure their I/O through top-level constants at the head of each file. Edit these placeholders ("your input path for ..." / "your output path for ...") before running.

1. Clip Segmentation

Located at:

flare/data_construction/segmentation/

Run the four scripts below in order. The first performs visual segmentation; the remaining three handle clips that are still too long after visual segmentation, by transcribing them, splitting the transcripts into topically coherent spans, and adding spectral-novelty boundaries.

1.1 `pyscene_seg.py` — Visual scene segmentation

PySceneDetect ContentDetector segmentation. Thresholds and workers are set at the bottom of the script:

Constant	Description
`SOURCE`	Directory containing original long videos
`DEST`	Output directory for segmented clips (`<DEST>/<video_id>/Scene-*.mp4`)
`max_parallel_workers`	Number of parallel workers

Example

cd flare/data_construction/segmentation
python pysence_seg.py

1.2 `asr.py` — Audio transcription with word-level timestamps

Transcribes clips using Qwen3-ASR-1.7B with a forced aligner, and saves per-file transcript.txt + timestamps.json.

Constant	Description
`INPUT_LIST_FILE`	Text file listing audio paths (one per line)
`OUTPUT_ROOT`	Output root; each audio writes to `<OUTPUT_ROOT>/<file_stem>/`
`ASR_MODEL_PATH`	Path to the Qwen3-ASR checkpoint
`FORCED_ALIGNER_PATH`	Path to the forced aligner checkpoint
`DEVICE`	Torch device (default `cuda`)

Supports LOCAL_RANK / WORLD_SIZE environment variables for multi-GPU sharding.

Example

cd flare/data_construction/segmentation
# single GPU
python asr.py
# multi-GPU sharding (8 GPUs)
for i in $(seq 0 7); do
    CUDA_VISIBLE_DEVICES=$i LOCAL_RANK=$i WORLD_SIZE=8 python asr.py &
done
wait

1.3 `semantic_seg.py` — LLM-based semantic splitting

Uses Qwen3-235B-A22B-Instruct to segment the ASR transcript into topically coherent paragraphs, then maps them back to temporal spans using the word-level timestamps produced by asr.py. Segments shorter than MIN_DURATION seconds are merged.

Constant	Description
`ASR_OUTPUT_ROOT`	Root directory written by `asr.py`
`AUDIO_LIST_FILE`	Audio list file (must match `asr.py`)
`MAX_WORKERS`	Number of parallel request workers

Segmented wavs are written as <ASR_OUTPUT_ROOT>/<file_stem>/segment_XXX.wav.

Example

cd flare/data_construction/segmentation
python semantic_seg.py

1.4 `spectral_seg.py` — Spectral novelty segmentation

Spectral-flux + KL-divergence novelty detection on the mel-spectrogram (robust MAD-based z-score), for boundaries at non-speech acoustic events (music shifts, applause, ambient changes).

Constant	Description
`TXT_FILE`	Text file listing audio paths to re-segment
`NUM_WORKERS`	Number of parallel workers

Example

cd flare/data_construction/segmentation
python spectral_seg.py

2. Caption Generation

Located at:

flare/data_construction/caption_generation/

Run the six scripts below in order: generate vision and audio captions in parallel, run the LLM-based quality inspector, re-score audio-heavy clips, fuse the two modalities into a unified caption, and finally merge clip-level unified captions into video-level captions.

2.1 `sglang_gen.py` — Vision caption generation (Qwen3-VL-235B-A22B-Instruct)

Produces per-clip visual captions. Supports resume: existing video_paths in the output JSONL are skipped.

Constant	Description
`VIDEO_DIRECTORY`	Directory of segmented clips (walked recursively)
`OUTPUT_JSONL_PATH`	Output JSONL (fields: `video_path`, `num_frames_used`, `caption`)
`NUM_WORKERS`	Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/caption_generation
python sglang_gen.py

2.2 `sglang_gen_omni.py` — Audio caption generation (Qwen3-Omni-30B-A3B-Instruct)

Produces per-clip audio captions. Input directory is expected to contain .wav files.

Constant	Description
`VIDEO_DIRECTORY`	Directory of audio files (walked recursively)
`OUTPUT_JSONL_PATH`	Output JSONL (fields: `video_path`, `audio_from_video`, `audio_caption`)
`NUM_WORKERS`	Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/caption_generation
python sglang_gen_omni.py

2.3 `llm_judge.py` — Caption-quality inspection

LLM-based inspector that flags captions with degenerate repetition or semantic collapse. In our caption quality-assurance pipeline, this check is used together with EVQAScore for visual captions and BRACEScore for audio captions. Captions that fall below the calibrated quality thresholds (EVQAScore < 0.2 for visual captions; BRACEScore < 0.1 for audio captions), or are flagged by the LLM inspector, are manually reviewed and corrected before unified caption generation. The script outputs both the flagged records and a full debug JSONL.

Constant	Description
`INPUT_JSONL_PATH`	Caption JSONL to inspect
`OUTPUT_BAD_JSONL_PATH`	JSONL collecting records judged as BAD
`OUTPUT_DEBUG_JSONL_PATH`	JSONL with per-record LLM verdict and evidence
`CAPTION_FIELD`	Caption field name (`caption` or `audio_caption`)
`VIDEO_PATH_FIELD`	Path field name (default `video_path`)
`MAX_WORKERS`	Number of parallel request workers

Requires SGLANG_CHAT_COMPLETIONS_URL.

Example

cd flare/data_construction/caption_generation
python llm_judge.py

2.4 `sglang_vora.py` — Audio-importance scoring

Classifies each clip as audio-driven or visual-driven and assigns an audio_importance_score in [0, 10], used to route clips to audio-driven segmentation.

Constant	Description
`INPUT_TXT`	Text file listing video paths to score
`OUTPUT_JSONL`	Output JSONL (fields: `video_path`, `llm_output`)
`MAX_WORKERS`	Number of parallel request workers

Requires SGLANG_CHAT_COMPLETIONS_URL.

Example

cd flare/data_construction/caption_generation
python sglang_vora.py

2.5 `sglang_unify.py` — Unified caption generation (Qwen3-Omni-30B-A3B-Instruct)

Fuses the quality-assured vision and audio captions into a single unified caption per clip. Vision-caption and audio-caption JSONLs are joined on video_path.

Constant	Description
`VISION_JSONL`	Vision caption JSONL (field `caption`)
`AUDIO_JSONL`	Audio caption JSONL (field `audio_caption`)
`OUTPUT_JSONL_PATH`	Output JSONL (fields: `video_path`, `unified_caption`)
`EXCLUDE_JSONL`	Optional JSONL whose `video_path`s should be skipped
`NUM_WORKERS`	Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/caption_generation
python sglang_unify.py

2.6 `sglang_longmerge.py` — Hierarchical video-level merging

Sorts clip-level unified captions by scene order, partitions them into clusters of MERGE_BATCH_SIZE, smooths the transition between adjacent clusters using the LLM, and produces one video-level caption per video_id. Resume-friendly: existing valid records in the output file are preserved.

Constant	Description
`INPUT_JSONL`	Clip-level unified caption JSONL
`OUTPUT_JSONL`	Output JSONL (fields: `video_id`, `num_clips`, `video_level_caption`)
`MERGE_BATCH_SIZE`	Cluster size $k$ (default `10`)
`VIDEO_MAX_WORKERS`	Per-video parallelism
`CLIP_MERGE_WORKERS`	Per-cluster parallelism within each video

Requires SGLANG_CHAT_COMPLETIONS_URL.

Example

cd flare/data_construction/caption_generation
python sglang_longmerge.py

3. Query Generation

Located at:

flare/data_construction/query_generation/

This stage produces user-style queries from the three caption spaces (vision / audio / unified) and filters them in two stages: (i) relevance-vs-non-copy filtering against the source caption; (ii) retrieval-based validation against the full benchmark gallery, with the hard bimodal constraint applied to unified queries.

3.1 `sglang_query.py` — Single-modality query generation

Generates 3–5 user-style queries per vision or audio caption.

Constant	Description
`INPUT_JSONL_PATH`	Caption JSONL (vision or audio)
`OUTPUT_JSONL_PATH`	Output JSONL (fields: `sample_id`, `modality`, `caption`, `queries`, ...)
`DEFAULT_MODALITY`	`"vision"` or `"audio"`
`CAPTION_ID`	Caption field name in the input (`caption` / `audio_caption`)
`NUM_WORKERS`	Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/query_generation
python sglang_query.py

3.2 `sglang_query_unified.py` — Cross-modal query generation

Generates 3–5 combined audiovisual queries per unified caption, each exposed as {combined_query, vision_part, audio_part} so that the hard bimodal constraint can be tested downstream.

Constant	Description
`INPUT_JSONL_PATH`	Unified caption JSONL (field `unified_caption`)
`OUTPUT_JSONL_PATH`	Output JSONL (fields: `sample_id`, `unified_caption`, `queries` = list of `{combined_query, vision_part, audio_part}`)
`NUM_WORKERS`	Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/query_generation
python sglang_query_unified.py

3.3 `query_filter_stage1.py` — Relevance + non-copy filter (single-modal)

Keeps queries with embedding similarity $\ge$ EMBEDDING_THRESHOLD against the source caption and ROUGE-L F1 $\le$ ROUGE_L_THRESHOLD. Embeddings are produced by a SentenceTransformer encoder (BGE-Multilingual-Gemma2 in our setup) with multi-GPU support.

Constant	Description
`INPUT_JSONL`	Input query JSONL with `caption`, `queries`, and `sample_id` or `video_path`
`OUTPUT_JSONL`	Output JSONL with appended `query_scores` and `kept_queries`
`MODEL_NAME`	Local path or HuggingFace ID of the embedding model
`CAPTION_DB_CACHE_DIR`	Pre-built caption embedding cache (`caption_embeddings.npy` + `caption_meta.json`)
`QUERY_CACHE_DIR`	Cache directory for encoded query embeddings (`queries.npy` + `queries.meta.json`)
`FORCE_REBUILD_QUERY_CACHE`	Recompute query embeddings even when a valid cache exists
`EMBEDDING_THRESHOLD`	Minimum caption-query embedding similarity for keeping a query
`ROUGE_L_THRESHOLD`	Maximum ROUGE-L F1 allowed to avoid near-copy queries
`BATCH_SIZE`	Embedding batch size
`GPU_IDS` / `USE_MULTI_PROCESS`	Multi-GPU encoding settings
`NORMALIZE_EMBEDDINGS`	Whether to normalize embeddings; must match the caption DB cache

Example

cd flare/data_construction/query_generation
python query_filter_stage1.py

3.4 `query_filter_stage2.py` — Retrieval-based validation (single-modal)

For each kept query from stage 1, searches the full caption gallery via a FAISS inner-product index and checks whether the target clip is retrieved at rank 1. Queries hitting the target at rank 1 are kept; all other queries are dropped.

Constant	Description
`QUERY_JSONL_PATH`	Stage-1 output JSONL
`CAPTION_DB_JSONL_PATH`	Full caption gallery
`CAPTION_KEY`	Caption field in the gallery (`caption` or `audio_caption`)
`OUTPUT_JSONL_PATH`	Per-item diagnostics + `rule_kept_queries` / `rule_dropped_queries`
`OUTPUT_SUMMARY_PATH`	Aggregate summary JSON
`MODEL_NAME`	Embedding model (must match stage 1 and the caption cache)
`CAPTION_CACHE_DIR`	Caption embedding cache + `caption_faiss_ip.index`
`QUERY_CACHE_DIR`	Cache directory for query embeddings
`TOP_K`	Top-k depth for retrieval
`QUERY_BATCH_SIZE`	Embedding batch size
`GPU_IDS` / `USE_MULTI_PROCESS`	Multi-GPU encoding
`NORMALIZE_EMBEDDINGS`	Must match the caption cache
`SAVE_FULL_TOPK_CAPTION`	Also save top-k captions in the diagnostics
`FORCE_REBUILD_QUERY_CACHE`	Recompute query embeddings even if a cache exists

Example

cd flare/data_construction/query_generation
python query_filter_stage2.py

3.5 `query_unified_filter_stage1.py` — Relevance + non-copy filter (cross-modal)

Keeps queries with embedding similarity $\ge$ EMBEDDING_THRESHOLD against the source caption and ROUGE-L F1 $\le$ ROUGE_L_THRESHOLD. Embeddings are produced by a SentenceTransformer encoder (BGE-Multilingual-Gemma2 in our setup) with multi-GPU support.

Constant	Description
`INPUT_JSONL`	Cross-modal query JSONL with `queries`
`OUTPUT_JSONL`	Output JSONL with `query_scores` and `kept_queries`
`MODEL_NAME`	Local path or HuggingFace ID of the embedding model
`CAPTION_DB_CACHE_DIR`	Pre-built unified caption embedding cache (`caption_embeddings.npy` + `caption_meta.json`)
`QUERY_CACHE_DIR`	Cache directory for `combined_query` embeddings
`FORCE_REBUILD_QUERY_CACHE`	Recompute query embeddings even if a cache exists
`EMBEDDING_THRESHOLD`	Minimum embedding similarity for keeping a query
`ROUGE_L_THRESHOLD`	Maximum ROUGE-L F1 allowed to avoid near-copy queries
`BATCH_SIZE`	Embedding batch size
`GPU_IDS` / `USE_MULTI_PROCESS` / `NORMALIZE_EMBEDDINGS`	Multi-GPU encoding and normalization settings

Example

cd flare/data_construction/query_generation
python query_unified_filter_stage1.py

3.6 `query_unified_filter_stage2.py` — Hard bimodal constraint

For each cross-modal query, runs three independent searches: (i) vision_part against the vision gallery, (ii) audio_part against the audio gallery, (iii) combined_query against the unified gallery. Stage 2 uses kept_queries from stage 1 when present and falls back to queries otherwise. A query is kept as hard bimodal only if the vision-only and audio-only searches both fail (target rank > VISION_FAIL_TOPK / AUDIO_FAIL_TOPK) and the joint search succeeds (target rank ≤ JOINT_PASS_TOPK).

Constant	Description
`QUERY_JSONL_PATH`	Stage-1 unified output JSONL
`VISION_DB_JSONL_PATH` / `AUDIO_DB_JSONL_PATH` / `UNIFIED_DB_JSONL_PATH`	Three modality-specific caption galleries
`VISION_CACHE_DIR` / `AUDIO_CACHE_DIR` / `UNIFIED_CACHE_DIR`	Corresponding caption embedding caches and FAISS indexes
`QUERY_CACHE_DIR`	Cache for query embeddings
`OUTPUT_JSONL_PATH`	Per-item results with `hard_bimodal_queries`, `dropped_queries`, diagnostics
`OUTPUT_SUMMARY_PATH`	Aggregate summary JSON

Example

cd flare/data_construction/query_generation
python query_unified_filter_stage2.py

Evaluation

The evaluation pipeline is located in:

flare/evaluation/

It supports evaluation for:

Text-to-Clip Retrieval
Clip-to-Text Retrieval
Text-to-Video Retrieval
Video-to-Text Retrieval

under two query regimes (caption-based and query-based) and three modality scopes (vision, audio, unified audiovisual).

Supported baselines

Modality scope	`MODEL` value
Vision–Text	`clip`, `metaclip`, `siglip2`, `videoclip_xl_v2`, `qwen3_vl_embedding`
Audio–Text	`ms_clap_2022`, `ms_clap_2023`, `laion_clap`, `m2d_clap`, `dasheng_glap`, `aurola`
Audiovisual–Text	`imagebind_av`, `languagebind_av`, `perception_av`, `wave`

Input JSONL format

JSONL	Key fields
`vision_clip.jsonl`	`video_path`, `caption`
`audio_clip.jsonl`	`video_path`, `audio_caption`
`unified_clip.jsonl`	`video_path`, `unified_caption`
`vision_query.jsonl`	`video_path`, `caption`
`audio_query.jsonl`	`video_path`, `audio_caption`
`unified_query.jsonl`	`video_path`, `unified_caption`
`video_caption.jsonl`	`video_id`, `video_level_caption`

video_path is expected to follow <video_id>/<clip_id>.mp4. Full-length media are located through FULL_VIDEO_DIR / FULL_AUDIO_DIR and referenced by video_id.

Pipeline stages

inspect_data → encode_text_chunk → merge_text   ┐
                                                 ├→ calc_metrics
               encode_gallery    → merge_gallery ┘

Each stage of complete_retrieval_pipeline.py is triggered by its own flag (--inspect_data, --encode_text_chunk, --merge_text, --encode_gallery, --merge_gallery, --calc_metrics). Chunked stages accept --num_chunks / --chunk_idx and can be parallelised across GPUs. eval_run.sh glues them together.

Running a baseline

Each baseline has a config stub under flare/evaluation/<model>_config.sh. Fill in the "your input path for ..." placeholders with your local paths, then run:

cd flare/evaluation
bash eval_run.sh clip_config.sh

Common config variables

Variable	Description
`MODEL`	One of the baselines listed above
`QUERY_MODE`	`vision` / `audio` / `unified`
`MEDIA_MODE`	Only for audiovisual baselines: `vision` / `audio` / `unified`
`RUN_NAME`	Used to derive the output directory `${EXP_DIR}/${RUN_NAME}/`
`EXP_DIR`	Root directory for cached features, chunks, and `metrics.json`
`GPU_COUNT`	Number of GPUs used by `eval_run.sh` (default `8`)
`VISION_CLIP_JSONL` / `AUDIO_CLIP_JSONL` / `UNIFIED_CLIP_JSONL`	Clip-level caption JSONLs
`VISION_QUERY_JSONL` / `AUDIO_QUERY_JSONL` / `UNIFIED_QUERY_JSONL`	Clip-level query JSONLs; also needed to provide the matching full `*_CLIP_JSONL`
`VIDEO_CAPTION_JSONL`	Video-level caption JSONL
`FULL_VIDEO_DIR`	Directory and extension of full-length videos
`FULL_AUDIO_DIR`	Directory and extension of full-length audio
`TOPK`	Comma-separated Recall@K list (default `1,5,10`)

License

The code in this repository is released under the MIT License. The FLARE dataset is released under CC BY 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
flare		flare
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

FLARE Benchmark

Pipeline Overview

Project Structure

Dataset Generation

1. Clip Segmentation

1.1 pyscene_seg.py — Visual scene segmentation

Example

1.2 asr.py — Audio transcription with word-level timestamps

Example

1.3 semantic_seg.py — LLM-based semantic splitting

Example

1.4 spectral_seg.py — Spectral novelty segmentation

Example

2. Caption Generation

2.1 sglang_gen.py — Vision caption generation (Qwen3-VL-235B-A22B-Instruct)

Example

2.2 sglang_gen_omni.py — Audio caption generation (Qwen3-Omni-30B-A3B-Instruct)

Example

2.3 llm_judge.py — Caption-quality inspection

Example

2.4 sglang_vora.py — Audio-importance scoring

Example

2.5 sglang_unify.py — Unified caption generation (Qwen3-Omni-30B-A3B-Instruct)

Example

2.6 sglang_longmerge.py — Hierarchical video-level merging

Example

3. Query Generation

3.1 sglang_query.py — Single-modality query generation

Example

3.2 sglang_query_unified.py — Cross-modal query generation

Example

3.3 query_filter_stage1.py — Relevance + non-copy filter (single-modal)

Example

3.4 query_filter_stage2.py — Retrieval-based validation (single-modal)

Example

3.5 query_unified_filter_stage1.py — Relevance + non-copy filter (cross-modal)

Example

3.6 query_unified_filter_stage2.py — Hard bimodal constraint

Example

Evaluation

Supported baselines

Input JSONL format

Pipeline stages

Running a baseline

Common config variables

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1.1 `pyscene_seg.py` — Visual scene segmentation

1.2 `asr.py` — Audio transcription with word-level timestamps

1.3 `semantic_seg.py` — LLM-based semantic splitting

1.4 `spectral_seg.py` — Spectral novelty segmentation

2.1 `sglang_gen.py` — Vision caption generation (Qwen3-VL-235B-A22B-Instruct)

2.2 `sglang_gen_omni.py` — Audio caption generation (Qwen3-Omni-30B-A3B-Instruct)

2.3 `llm_judge.py` — Caption-quality inspection

2.4 `sglang_vora.py` — Audio-importance scoring

2.5 `sglang_unify.py` — Unified caption generation (Qwen3-Omni-30B-A3B-Instruct)

2.6 `sglang_longmerge.py` — Hierarchical video-level merging

3.1 `sglang_query.py` — Single-modality query generation

3.2 `sglang_query_unified.py` — Cross-modal query generation

3.3 `query_filter_stage1.py` — Relevance + non-copy filter (single-modal)

3.4 `query_filter_stage2.py` — Retrieval-based validation (single-modal)

3.5 `query_unified_filter_stage1.py` — Relevance + non-copy filter (cross-modal)

3.6 `query_unified_filter_stage2.py` — Hard bimodal constraint

Packages