Official repository for FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries.
FLARE is a benchmark designed to evaluate long-video audiovisual retrieval with realistic user-style queries. The benchmark contains 399 long videos (225.4 h) and 87,697 fine-grained clips, each annotated with vision, audio, and unified captions, together with 274,933 user-style queries (vision-only, audio-only, and cross-modal with a hard bimodal constraint).
The repository provides:
- Clip segmentation pipeline
- Multimodal caption generation framework
- User-style query generation and two-stage filtering
- Unified evaluation harness covering 15 representative retrievers
The FLARE dataset is constructed using a three-stage pipeline:
-
Clip Segmentation Segment long videos into fine-grained clips based on visual scene changes, ASR-driven semantics, and spectral novelty of the audio track.
-
Caption Generation Generate clip-level vision, audio, and unified captions, followed by hierarchical merging into video-level captions.
-
Query Generation Generate user-style queries for each modality and filter them by relevance / non-copy rules and the hard bimodal constraint.
The implementation of these steps is provided in the flare/data_construction/ directory. Evaluation of 15 baselines is provided in flare/evaluation/.
FLARE/
│
├── flare/
│ ├── data_construction/
│ │ ├── segmentation/
│ │ │ ├── pyscene_seg.py
│ │ │ ├── asr.py
│ │ │ ├── semantic_seg.py
│ │ │ └── spectral_seg.py
│ │ ├── caption_generation/
│ │ │ ├── sglang_gen.py
│ │ │ ├── sglang_gen_omni.py
│ │ │ ├── llm_judge.py
│ │ │ ├── sglang_vora.py
│ │ │ ├── sglang_unify.py
│ │ │ └── sglang_longmerge.py
│ │ └── query_generation/
│ │ ├── sglang_query.py
│ │ ├── sglang_query_unified.py
│ │ ├── query_filter_stage1.py
│ │ ├── query_filter_stage2.py
│ │ ├── query_unified_filter_stage1.py
│ │ └── query_unified_filter_stage2.py
│ └── evaluation/
│ ├── complete_retrieval_pipeline.py
│ ├── retrieval_adapters.py
│ ├── eval_run.sh
│ └── <model>_config.sh
│
└── README.md
The dataset generation pipeline consists of three stages. Please execute them in the following order.
Clip Segmentation → Caption Generation → Query Generation
For scripts that call Qwen models through SGLang, start an OpenAI-compatible SGLang server first, for example:
python -m sglang.launch_server \
--model-path /path/to/Qwen3-VL-235B-A22B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--tp 8 \
--trust-remote-code
export SGLANG_SERVER_URL="http://127.0.0.1:30000/v1"
export SGLANG_CHAT_COMPLETIONS_URL="http://127.0.0.1:30000/v1/chat/completions"Replace --model-path with the model required by the specific script. Our experiments were conducted on a machine with 8 H20 GPUs, each with 96GB memory.
All scripts configure their I/O through top-level constants at the head of each file. Edit these placeholders ("your input path for ..." / "your output path for ...") before running.
Located at:
flare/data_construction/segmentation/
Run the four scripts below in order. The first performs visual segmentation; the remaining three handle clips that are still too long after visual segmentation, by transcribing them, splitting the transcripts into topically coherent spans, and adding spectral-novelty boundaries.
PySceneDetect ContentDetector segmentation. Thresholds and workers are set at the bottom of the script:
| Constant | Description |
|---|---|
SOURCE |
Directory containing original long videos |
DEST |
Output directory for segmented clips (<DEST>/<video_id>/Scene-*.mp4) |
max_parallel_workers |
Number of parallel workers |
cd flare/data_construction/segmentation
python pysence_seg.pyTranscribes clips using Qwen3-ASR-1.7B with a forced aligner, and saves per-file transcript.txt + timestamps.json.
| Constant | Description |
|---|---|
INPUT_LIST_FILE |
Text file listing audio paths (one per line) |
OUTPUT_ROOT |
Output root; each audio writes to <OUTPUT_ROOT>/<file_stem>/ |
ASR_MODEL_PATH |
Path to the Qwen3-ASR checkpoint |
FORCED_ALIGNER_PATH |
Path to the forced aligner checkpoint |
DEVICE |
Torch device (default cuda) |
Supports LOCAL_RANK / WORLD_SIZE environment variables for multi-GPU sharding.
cd flare/data_construction/segmentation
# single GPU
python asr.py
# multi-GPU sharding (8 GPUs)
for i in $(seq 0 7); do
CUDA_VISIBLE_DEVICES=$i LOCAL_RANK=$i WORLD_SIZE=8 python asr.py &
done
waitUses Qwen3-235B-A22B-Instruct to segment the ASR transcript into topically coherent paragraphs, then maps them back to temporal spans using the word-level timestamps produced by asr.py. Segments shorter than MIN_DURATION seconds are merged.
| Constant | Description |
|---|---|
ASR_OUTPUT_ROOT |
Root directory written by asr.py |
AUDIO_LIST_FILE |
Audio list file (must match asr.py) |
MAX_WORKERS |
Number of parallel request workers |
Segmented wavs are written as <ASR_OUTPUT_ROOT>/<file_stem>/segment_XXX.wav.
cd flare/data_construction/segmentation
python semantic_seg.pySpectral-flux + KL-divergence novelty detection on the mel-spectrogram (robust MAD-based z-score), for boundaries at non-speech acoustic events (music shifts, applause, ambient changes).
| Constant | Description |
|---|---|
TXT_FILE |
Text file listing audio paths to re-segment |
NUM_WORKERS |
Number of parallel workers |
cd flare/data_construction/segmentation
python spectral_seg.pyLocated at:
flare/data_construction/caption_generation/
Run the six scripts below in order: generate vision and audio captions in parallel, run the LLM-based quality inspector, re-score audio-heavy clips, fuse the two modalities into a unified caption, and finally merge clip-level unified captions into video-level captions.
Produces per-clip visual captions. Supports resume: existing video_paths in the output JSONL are skipped.
| Constant | Description |
|---|---|
VIDEO_DIRECTORY |
Directory of segmented clips (walked recursively) |
OUTPUT_JSONL_PATH |
Output JSONL (fields: video_path, num_frames_used, caption) |
NUM_WORKERS |
Number of parallel request workers |
Requires SGLANG_SERVER_URL.
cd flare/data_construction/caption_generation
python sglang_gen.pyProduces per-clip audio captions. Input directory is expected to contain .wav files.
| Constant | Description |
|---|---|
VIDEO_DIRECTORY |
Directory of audio files (walked recursively) |
OUTPUT_JSONL_PATH |
Output JSONL (fields: video_path, audio_from_video, audio_caption) |
NUM_WORKERS |
Number of parallel request workers |
Requires SGLANG_SERVER_URL.
cd flare/data_construction/caption_generation
python sglang_gen_omni.pyLLM-based inspector that flags captions with degenerate repetition or semantic collapse. In our caption quality-assurance pipeline, this check is used together with EVQAScore for visual captions and BRACEScore for audio captions. Captions that fall below the calibrated quality thresholds (EVQAScore < 0.2 for visual captions; BRACEScore < 0.1 for audio captions), or are flagged by the LLM inspector, are manually reviewed and corrected before unified caption generation. The script outputs both the flagged records and a full debug JSONL.
| Constant | Description |
|---|---|
INPUT_JSONL_PATH |
Caption JSONL to inspect |
OUTPUT_BAD_JSONL_PATH |
JSONL collecting records judged as BAD |
OUTPUT_DEBUG_JSONL_PATH |
JSONL with per-record LLM verdict and evidence |
CAPTION_FIELD |
Caption field name (caption or audio_caption) |
VIDEO_PATH_FIELD |
Path field name (default video_path) |
MAX_WORKERS |
Number of parallel request workers |
Requires SGLANG_CHAT_COMPLETIONS_URL.
cd flare/data_construction/caption_generation
python llm_judge.pyClassifies each clip as audio-driven or visual-driven and assigns an audio_importance_score in [0, 10], used to route clips to audio-driven segmentation.
| Constant | Description |
|---|---|
INPUT_TXT |
Text file listing video paths to score |
OUTPUT_JSONL |
Output JSONL (fields: video_path, llm_output) |
MAX_WORKERS |
Number of parallel request workers |
Requires SGLANG_CHAT_COMPLETIONS_URL.
cd flare/data_construction/caption_generation
python sglang_vora.pyFuses the quality-assured vision and audio captions into a single unified caption per clip. Vision-caption and audio-caption JSONLs are joined on video_path.
| Constant | Description |
|---|---|
VISION_JSONL |
Vision caption JSONL (field caption) |
AUDIO_JSONL |
Audio caption JSONL (field audio_caption) |
OUTPUT_JSONL_PATH |
Output JSONL (fields: video_path, unified_caption) |
EXCLUDE_JSONL |
Optional JSONL whose video_paths should be skipped |
NUM_WORKERS |
Number of parallel request workers |
Requires SGLANG_SERVER_URL.
cd flare/data_construction/caption_generation
python sglang_unify.pySorts clip-level unified captions by scene order, partitions them into clusters of MERGE_BATCH_SIZE, smooths the transition between adjacent clusters using the LLM, and produces one video-level caption per video_id. Resume-friendly: existing valid records in the output file are preserved.
| Constant | Description |
|---|---|
INPUT_JSONL |
Clip-level unified caption JSONL |
OUTPUT_JSONL |
Output JSONL (fields: video_id, num_clips, video_level_caption) |
MERGE_BATCH_SIZE |
Cluster size 10) |
VIDEO_MAX_WORKERS |
Per-video parallelism |
CLIP_MERGE_WORKERS |
Per-cluster parallelism within each video |
Requires SGLANG_CHAT_COMPLETIONS_URL.
cd flare/data_construction/caption_generation
python sglang_longmerge.pyLocated at:
flare/data_construction/query_generation/
This stage produces user-style queries from the three caption spaces (vision / audio / unified) and filters them in two stages: (i) relevance-vs-non-copy filtering against the source caption; (ii) retrieval-based validation against the full benchmark gallery, with the hard bimodal constraint applied to unified queries.
Generates 3–5 user-style queries per vision or audio caption.
| Constant | Description |
|---|---|
INPUT_JSONL_PATH |
Caption JSONL (vision or audio) |
OUTPUT_JSONL_PATH |
Output JSONL (fields: sample_id, modality, caption, queries, ...) |
DEFAULT_MODALITY |
"vision" or "audio" |
CAPTION_ID |
Caption field name in the input (caption / audio_caption) |
NUM_WORKERS |
Number of parallel request workers |
Requires SGLANG_SERVER_URL.
cd flare/data_construction/query_generation
python sglang_query.pyGenerates 3–5 combined audiovisual queries per unified caption, each exposed as {combined_query, vision_part, audio_part} so that the hard bimodal constraint can be tested downstream.
| Constant | Description |
|---|---|
INPUT_JSONL_PATH |
Unified caption JSONL (field unified_caption) |
OUTPUT_JSONL_PATH |
Output JSONL (fields: sample_id, unified_caption, queries = list of {combined_query, vision_part, audio_part}) |
NUM_WORKERS |
Number of parallel request workers |
Requires SGLANG_SERVER_URL.
cd flare/data_construction/query_generation
python sglang_query_unified.pyKeeps queries with embedding similarity EMBEDDING_THRESHOLD against the source caption and ROUGE-L F1 ROUGE_L_THRESHOLD. Embeddings are produced by a SentenceTransformer encoder (BGE-Multilingual-Gemma2 in our setup) with multi-GPU support.
| Constant | Description |
|---|---|
INPUT_JSONL |
Input query JSONL with caption, queries, and sample_id or video_path |
OUTPUT_JSONL |
Output JSONL with appended query_scores and kept_queries |
MODEL_NAME |
Local path or HuggingFace ID of the embedding model |
CAPTION_DB_CACHE_DIR |
Pre-built caption embedding cache (caption_embeddings.npy + caption_meta.json) |
QUERY_CACHE_DIR |
Cache directory for encoded query embeddings (queries.npy + queries.meta.json) |
FORCE_REBUILD_QUERY_CACHE |
Recompute query embeddings even when a valid cache exists |
EMBEDDING_THRESHOLD |
Minimum caption-query embedding similarity for keeping a query |
ROUGE_L_THRESHOLD |
Maximum ROUGE-L F1 allowed to avoid near-copy queries |
BATCH_SIZE |
Embedding batch size |
GPU_IDS / USE_MULTI_PROCESS |
Multi-GPU encoding settings |
NORMALIZE_EMBEDDINGS |
Whether to normalize embeddings; must match the caption DB cache |
cd flare/data_construction/query_generation
python query_filter_stage1.pyFor each kept query from stage 1, searches the full caption gallery via a FAISS inner-product index and checks whether the target clip is retrieved at rank 1. Queries hitting the target at rank 1 are kept; all other queries are dropped.
| Constant | Description |
|---|---|
QUERY_JSONL_PATH |
Stage-1 output JSONL |
CAPTION_DB_JSONL_PATH |
Full caption gallery |
CAPTION_KEY |
Caption field in the gallery (caption or audio_caption) |
OUTPUT_JSONL_PATH |
Per-item diagnostics + rule_kept_queries / rule_dropped_queries |
OUTPUT_SUMMARY_PATH |
Aggregate summary JSON |
MODEL_NAME |
Embedding model (must match stage 1 and the caption cache) |
CAPTION_CACHE_DIR |
Caption embedding cache + caption_faiss_ip.index |
QUERY_CACHE_DIR |
Cache directory for query embeddings |
TOP_K |
Top-k depth for retrieval |
QUERY_BATCH_SIZE |
Embedding batch size |
GPU_IDS / USE_MULTI_PROCESS |
Multi-GPU encoding |
NORMALIZE_EMBEDDINGS |
Must match the caption cache |
SAVE_FULL_TOPK_CAPTION |
Also save top-k captions in the diagnostics |
FORCE_REBUILD_QUERY_CACHE |
Recompute query embeddings even if a cache exists |
cd flare/data_construction/query_generation
python query_filter_stage2.pyKeeps queries with embedding similarity EMBEDDING_THRESHOLD against the source caption and ROUGE-L F1 ROUGE_L_THRESHOLD. Embeddings are produced by a SentenceTransformer encoder (BGE-Multilingual-Gemma2 in our setup) with multi-GPU support.
| Constant | Description |
|---|---|
INPUT_JSONL |
Cross-modal query JSONL with queries |
OUTPUT_JSONL |
Output JSONL with query_scores and kept_queries |
MODEL_NAME |
Local path or HuggingFace ID of the embedding model |
CAPTION_DB_CACHE_DIR |
Pre-built unified caption embedding cache (caption_embeddings.npy + caption_meta.json) |
QUERY_CACHE_DIR |
Cache directory for combined_query embeddings |
FORCE_REBUILD_QUERY_CACHE |
Recompute query embeddings even if a cache exists |
EMBEDDING_THRESHOLD |
Minimum embedding similarity for keeping a query |
ROUGE_L_THRESHOLD |
Maximum ROUGE-L F1 allowed to avoid near-copy queries |
BATCH_SIZE |
Embedding batch size |
GPU_IDS / USE_MULTI_PROCESS / NORMALIZE_EMBEDDINGS |
Multi-GPU encoding and normalization settings |
cd flare/data_construction/query_generation
python query_unified_filter_stage1.pyFor each cross-modal query, runs three independent searches: (i) vision_part against the vision gallery, (ii) audio_part against the audio gallery, (iii) combined_query against the unified gallery. Stage 2 uses kept_queries from stage 1 when present and falls back to queries otherwise. A query is kept as hard bimodal only if the vision-only and audio-only searches both fail (target rank > VISION_FAIL_TOPK / AUDIO_FAIL_TOPK) and the joint search succeeds (target rank ≤ JOINT_PASS_TOPK).
| Constant | Description |
|---|---|
QUERY_JSONL_PATH |
Stage-1 unified output JSONL |
VISION_DB_JSONL_PATH / AUDIO_DB_JSONL_PATH / UNIFIED_DB_JSONL_PATH |
Three modality-specific caption galleries |
VISION_CACHE_DIR / AUDIO_CACHE_DIR / UNIFIED_CACHE_DIR |
Corresponding caption embedding caches and FAISS indexes |
QUERY_CACHE_DIR |
Cache for query embeddings |
OUTPUT_JSONL_PATH |
Per-item results with hard_bimodal_queries, dropped_queries, diagnostics |
OUTPUT_SUMMARY_PATH |
Aggregate summary JSON |
cd flare/data_construction/query_generation
python query_unified_filter_stage2.pyThe evaluation pipeline is located in:
flare/evaluation/
It supports evaluation for:
- Text-to-Clip Retrieval
- Clip-to-Text Retrieval
- Text-to-Video Retrieval
- Video-to-Text Retrieval
under two query regimes (caption-based and query-based) and three modality scopes (vision, audio, unified audiovisual).
| Modality scope | MODEL value |
|---|---|
| Vision–Text | clip, metaclip, siglip2, videoclip_xl_v2, qwen3_vl_embedding |
| Audio–Text | ms_clap_2022, ms_clap_2023, laion_clap, m2d_clap, dasheng_glap, aurola |
| Audiovisual–Text | imagebind_av, languagebind_av, perception_av, wave |
| JSONL | Key fields |
|---|---|
vision_clip.jsonl |
video_path, caption |
audio_clip.jsonl |
video_path, audio_caption |
unified_clip.jsonl |
video_path, unified_caption |
vision_query.jsonl |
video_path, caption |
audio_query.jsonl |
video_path, audio_caption |
unified_query.jsonl |
video_path, unified_caption |
video_caption.jsonl |
video_id, video_level_caption |
video_path is expected to follow <video_id>/<clip_id>.mp4. Full-length media are located through FULL_VIDEO_DIR / FULL_AUDIO_DIR and referenced by video_id.
inspect_data → encode_text_chunk → merge_text ┐
├→ calc_metrics
encode_gallery → merge_gallery ┘
Each stage of complete_retrieval_pipeline.py is triggered by its own flag (--inspect_data, --encode_text_chunk, --merge_text, --encode_gallery, --merge_gallery, --calc_metrics). Chunked stages accept --num_chunks / --chunk_idx and can be parallelised across GPUs. eval_run.sh glues them together.
Each baseline has a config stub under flare/evaluation/<model>_config.sh. Fill in the "your input path for ..." placeholders with your local paths, then run:
cd flare/evaluation
bash eval_run.sh clip_config.sh| Variable | Description |
|---|---|
MODEL |
One of the baselines listed above |
QUERY_MODE |
vision / audio / unified
|
MEDIA_MODE |
Only for audiovisual baselines: vision / audio / unified
|
RUN_NAME |
Used to derive the output directory ${EXP_DIR}/${RUN_NAME}/
|
EXP_DIR |
Root directory for cached features, chunks, and metrics.json
|
GPU_COUNT |
Number of GPUs used by eval_run.sh (default 8) |
VISION_CLIP_JSONL / AUDIO_CLIP_JSONL / UNIFIED_CLIP_JSONL
|
Clip-level caption JSONLs |
VISION_QUERY_JSONL / AUDIO_QUERY_JSONL / UNIFIED_QUERY_JSONL
|
Clip-level query JSONLs; also needed to provide the matching full *_CLIP_JSONL
|
VIDEO_CAPTION_JSONL |
Video-level caption JSONL |
FULL_VIDEO_DIR |
Directory and extension of full-length videos |
FULL_AUDIO_DIR |
Directory and extension of full-length audio |
TOPK |
Comma-separated Recall@K list (default 1,5,10) |
The code in this repository is released under the MIT License. The FLARE dataset is released under CC BY 4.0.