Skip to content

YqjMartin/FLARE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

FLARE Benchmark

Official repository for FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries.

FLARE is a benchmark designed to evaluate long-video audiovisual retrieval with realistic user-style queries. The benchmark contains 399 long videos (225.4 h) and 87,697 fine-grained clips, each annotated with vision, audio, and unified captions, together with 274,933 user-style queries (vision-only, audio-only, and cross-modal with a hard bimodal constraint).

The repository provides:

  • Clip segmentation pipeline
  • Multimodal caption generation framework
  • User-style query generation and two-stage filtering
  • Unified evaluation harness covering 15 representative retrievers

Pipeline Overview

The FLARE dataset is constructed using a three-stage pipeline:

  1. Clip Segmentation Segment long videos into fine-grained clips based on visual scene changes, ASR-driven semantics, and spectral novelty of the audio track.

  2. Caption Generation Generate clip-level vision, audio, and unified captions, followed by hierarchical merging into video-level captions.

  3. Query Generation Generate user-style queries for each modality and filter them by relevance / non-copy rules and the hard bimodal constraint.

The implementation of these steps is provided in the flare/data_construction/ directory. Evaluation of 15 baselines is provided in flare/evaluation/.


Project Structure

FLARE/
│
├── flare/
│   ├── data_construction/
│   │   ├── segmentation/
│   │   │   ├── pyscene_seg.py
│   │   │   ├── asr.py
│   │   │   ├── semantic_seg.py
│   │   │   └── spectral_seg.py
│   │   ├── caption_generation/
│   │   │   ├── sglang_gen.py
│   │   │   ├── sglang_gen_omni.py
│   │   │   ├── llm_judge.py
│   │   │   ├── sglang_vora.py
│   │   │   ├── sglang_unify.py
│   │   │   └── sglang_longmerge.py
│   │   └── query_generation/
│   │       ├── sglang_query.py
│   │       ├── sglang_query_unified.py
│   │       ├── query_filter_stage1.py
│   │       ├── query_filter_stage2.py
│   │       ├── query_unified_filter_stage1.py
│   │       └── query_unified_filter_stage2.py
│   └── evaluation/
│       ├── complete_retrieval_pipeline.py
│       ├── retrieval_adapters.py
│       ├── eval_run.sh
│       └── <model>_config.sh
│
└── README.md

Dataset Generation

The dataset generation pipeline consists of three stages. Please execute them in the following order.

Clip Segmentation  →  Caption Generation  →  Query Generation

For scripts that call Qwen models through SGLang, start an OpenAI-compatible SGLang server first, for example:

python -m sglang.launch_server \
  --model-path /path/to/Qwen3-VL-235B-A22B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --tp 8 \
  --trust-remote-code

export SGLANG_SERVER_URL="http://127.0.0.1:30000/v1"
export SGLANG_CHAT_COMPLETIONS_URL="http://127.0.0.1:30000/v1/chat/completions"

Replace --model-path with the model required by the specific script. Our experiments were conducted on a machine with 8 H20 GPUs, each with 96GB memory.

All scripts configure their I/O through top-level constants at the head of each file. Edit these placeholders ("your input path for ..." / "your output path for ...") before running.


1. Clip Segmentation

Located at:

flare/data_construction/segmentation/

Run the four scripts below in order. The first performs visual segmentation; the remaining three handle clips that are still too long after visual segmentation, by transcribing them, splitting the transcripts into topically coherent spans, and adding spectral-novelty boundaries.

1.1 pyscene_seg.py — Visual scene segmentation

PySceneDetect ContentDetector segmentation. Thresholds and workers are set at the bottom of the script:

Constant Description
SOURCE Directory containing original long videos
DEST Output directory for segmented clips (<DEST>/<video_id>/Scene-*.mp4)
max_parallel_workers Number of parallel workers

Example

cd flare/data_construction/segmentation
python pysence_seg.py

1.2 asr.py — Audio transcription with word-level timestamps

Transcribes clips using Qwen3-ASR-1.7B with a forced aligner, and saves per-file transcript.txt + timestamps.json.

Constant Description
INPUT_LIST_FILE Text file listing audio paths (one per line)
OUTPUT_ROOT Output root; each audio writes to <OUTPUT_ROOT>/<file_stem>/
ASR_MODEL_PATH Path to the Qwen3-ASR checkpoint
FORCED_ALIGNER_PATH Path to the forced aligner checkpoint
DEVICE Torch device (default cuda)

Supports LOCAL_RANK / WORLD_SIZE environment variables for multi-GPU sharding.

Example

cd flare/data_construction/segmentation
# single GPU
python asr.py
# multi-GPU sharding (8 GPUs)
for i in $(seq 0 7); do
    CUDA_VISIBLE_DEVICES=$i LOCAL_RANK=$i WORLD_SIZE=8 python asr.py &
done
wait

1.3 semantic_seg.py — LLM-based semantic splitting

Uses Qwen3-235B-A22B-Instruct to segment the ASR transcript into topically coherent paragraphs, then maps them back to temporal spans using the word-level timestamps produced by asr.py. Segments shorter than MIN_DURATION seconds are merged.

Constant Description
ASR_OUTPUT_ROOT Root directory written by asr.py
AUDIO_LIST_FILE Audio list file (must match asr.py)
MAX_WORKERS Number of parallel request workers

Segmented wavs are written as <ASR_OUTPUT_ROOT>/<file_stem>/segment_XXX.wav.

Example

cd flare/data_construction/segmentation
python semantic_seg.py

1.4 spectral_seg.py — Spectral novelty segmentation

Spectral-flux + KL-divergence novelty detection on the mel-spectrogram (robust MAD-based z-score), for boundaries at non-speech acoustic events (music shifts, applause, ambient changes).

Constant Description
TXT_FILE Text file listing audio paths to re-segment
NUM_WORKERS Number of parallel workers

Example

cd flare/data_construction/segmentation
python spectral_seg.py

2. Caption Generation

Located at:

flare/data_construction/caption_generation/

Run the six scripts below in order: generate vision and audio captions in parallel, run the LLM-based quality inspector, re-score audio-heavy clips, fuse the two modalities into a unified caption, and finally merge clip-level unified captions into video-level captions.

2.1 sglang_gen.py — Vision caption generation (Qwen3-VL-235B-A22B-Instruct)

Produces per-clip visual captions. Supports resume: existing video_paths in the output JSONL are skipped.

Constant Description
VIDEO_DIRECTORY Directory of segmented clips (walked recursively)
OUTPUT_JSONL_PATH Output JSONL (fields: video_path, num_frames_used, caption)
NUM_WORKERS Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/caption_generation
python sglang_gen.py

2.2 sglang_gen_omni.py — Audio caption generation (Qwen3-Omni-30B-A3B-Instruct)

Produces per-clip audio captions. Input directory is expected to contain .wav files.

Constant Description
VIDEO_DIRECTORY Directory of audio files (walked recursively)
OUTPUT_JSONL_PATH Output JSONL (fields: video_path, audio_from_video, audio_caption)
NUM_WORKERS Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/caption_generation
python sglang_gen_omni.py

2.3 llm_judge.py — Caption-quality inspection

LLM-based inspector that flags captions with degenerate repetition or semantic collapse. In our caption quality-assurance pipeline, this check is used together with EVQAScore for visual captions and BRACEScore for audio captions. Captions that fall below the calibrated quality thresholds (EVQAScore < 0.2 for visual captions; BRACEScore < 0.1 for audio captions), or are flagged by the LLM inspector, are manually reviewed and corrected before unified caption generation. The script outputs both the flagged records and a full debug JSONL.

Constant Description
INPUT_JSONL_PATH Caption JSONL to inspect
OUTPUT_BAD_JSONL_PATH JSONL collecting records judged as BAD
OUTPUT_DEBUG_JSONL_PATH JSONL with per-record LLM verdict and evidence
CAPTION_FIELD Caption field name (caption or audio_caption)
VIDEO_PATH_FIELD Path field name (default video_path)
MAX_WORKERS Number of parallel request workers

Requires SGLANG_CHAT_COMPLETIONS_URL.

Example

cd flare/data_construction/caption_generation
python llm_judge.py

2.4 sglang_vora.py — Audio-importance scoring

Classifies each clip as audio-driven or visual-driven and assigns an audio_importance_score in [0, 10], used to route clips to audio-driven segmentation.

Constant Description
INPUT_TXT Text file listing video paths to score
OUTPUT_JSONL Output JSONL (fields: video_path, llm_output)
MAX_WORKERS Number of parallel request workers

Requires SGLANG_CHAT_COMPLETIONS_URL.

Example

cd flare/data_construction/caption_generation
python sglang_vora.py

2.5 sglang_unify.py — Unified caption generation (Qwen3-Omni-30B-A3B-Instruct)

Fuses the quality-assured vision and audio captions into a single unified caption per clip. Vision-caption and audio-caption JSONLs are joined on video_path.

Constant Description
VISION_JSONL Vision caption JSONL (field caption)
AUDIO_JSONL Audio caption JSONL (field audio_caption)
OUTPUT_JSONL_PATH Output JSONL (fields: video_path, unified_caption)
EXCLUDE_JSONL Optional JSONL whose video_paths should be skipped
NUM_WORKERS Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/caption_generation
python sglang_unify.py

2.6 sglang_longmerge.py — Hierarchical video-level merging

Sorts clip-level unified captions by scene order, partitions them into clusters of MERGE_BATCH_SIZE, smooths the transition between adjacent clusters using the LLM, and produces one video-level caption per video_id. Resume-friendly: existing valid records in the output file are preserved.

Constant Description
INPUT_JSONL Clip-level unified caption JSONL
OUTPUT_JSONL Output JSONL (fields: video_id, num_clips, video_level_caption)
MERGE_BATCH_SIZE Cluster size $k$ (default 10)
VIDEO_MAX_WORKERS Per-video parallelism
CLIP_MERGE_WORKERS Per-cluster parallelism within each video

Requires SGLANG_CHAT_COMPLETIONS_URL.

Example

cd flare/data_construction/caption_generation
python sglang_longmerge.py

3. Query Generation

Located at:

flare/data_construction/query_generation/

This stage produces user-style queries from the three caption spaces (vision / audio / unified) and filters them in two stages: (i) relevance-vs-non-copy filtering against the source caption; (ii) retrieval-based validation against the full benchmark gallery, with the hard bimodal constraint applied to unified queries.

3.1 sglang_query.py — Single-modality query generation

Generates 3–5 user-style queries per vision or audio caption.

Constant Description
INPUT_JSONL_PATH Caption JSONL (vision or audio)
OUTPUT_JSONL_PATH Output JSONL (fields: sample_id, modality, caption, queries, ...)
DEFAULT_MODALITY "vision" or "audio"
CAPTION_ID Caption field name in the input (caption / audio_caption)
NUM_WORKERS Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/query_generation
python sglang_query.py

3.2 sglang_query_unified.py — Cross-modal query generation

Generates 3–5 combined audiovisual queries per unified caption, each exposed as {combined_query, vision_part, audio_part} so that the hard bimodal constraint can be tested downstream.

Constant Description
INPUT_JSONL_PATH Unified caption JSONL (field unified_caption)
OUTPUT_JSONL_PATH Output JSONL (fields: sample_id, unified_caption, queries = list of {combined_query, vision_part, audio_part})
NUM_WORKERS Number of parallel request workers

Requires SGLANG_SERVER_URL.

Example

cd flare/data_construction/query_generation
python sglang_query_unified.py

3.3 query_filter_stage1.py — Relevance + non-copy filter (single-modal)

Keeps queries with embedding similarity $\ge$ EMBEDDING_THRESHOLD against the source caption and ROUGE-L F1 $\le$ ROUGE_L_THRESHOLD. Embeddings are produced by a SentenceTransformer encoder (BGE-Multilingual-Gemma2 in our setup) with multi-GPU support.

Constant Description
INPUT_JSONL Input query JSONL with caption, queries, and sample_id or video_path
OUTPUT_JSONL Output JSONL with appended query_scores and kept_queries
MODEL_NAME Local path or HuggingFace ID of the embedding model
CAPTION_DB_CACHE_DIR Pre-built caption embedding cache (caption_embeddings.npy + caption_meta.json)
QUERY_CACHE_DIR Cache directory for encoded query embeddings (queries.npy + queries.meta.json)
FORCE_REBUILD_QUERY_CACHE Recompute query embeddings even when a valid cache exists
EMBEDDING_THRESHOLD Minimum caption-query embedding similarity for keeping a query
ROUGE_L_THRESHOLD Maximum ROUGE-L F1 allowed to avoid near-copy queries
BATCH_SIZE Embedding batch size
GPU_IDS / USE_MULTI_PROCESS Multi-GPU encoding settings
NORMALIZE_EMBEDDINGS Whether to normalize embeddings; must match the caption DB cache

Example

cd flare/data_construction/query_generation
python query_filter_stage1.py

3.4 query_filter_stage2.py — Retrieval-based validation (single-modal)

For each kept query from stage 1, searches the full caption gallery via a FAISS inner-product index and checks whether the target clip is retrieved at rank 1. Queries hitting the target at rank 1 are kept; all other queries are dropped.

Constant Description
QUERY_JSONL_PATH Stage-1 output JSONL
CAPTION_DB_JSONL_PATH Full caption gallery
CAPTION_KEY Caption field in the gallery (caption or audio_caption)
OUTPUT_JSONL_PATH Per-item diagnostics + rule_kept_queries / rule_dropped_queries
OUTPUT_SUMMARY_PATH Aggregate summary JSON
MODEL_NAME Embedding model (must match stage 1 and the caption cache)
CAPTION_CACHE_DIR Caption embedding cache + caption_faiss_ip.index
QUERY_CACHE_DIR Cache directory for query embeddings
TOP_K Top-k depth for retrieval
QUERY_BATCH_SIZE Embedding batch size
GPU_IDS / USE_MULTI_PROCESS Multi-GPU encoding
NORMALIZE_EMBEDDINGS Must match the caption cache
SAVE_FULL_TOPK_CAPTION Also save top-k captions in the diagnostics
FORCE_REBUILD_QUERY_CACHE Recompute query embeddings even if a cache exists

Example

cd flare/data_construction/query_generation
python query_filter_stage2.py

3.5 query_unified_filter_stage1.py — Relevance + non-copy filter (cross-modal)

Keeps queries with embedding similarity $\ge$ EMBEDDING_THRESHOLD against the source caption and ROUGE-L F1 $\le$ ROUGE_L_THRESHOLD. Embeddings are produced by a SentenceTransformer encoder (BGE-Multilingual-Gemma2 in our setup) with multi-GPU support.

Constant Description
INPUT_JSONL Cross-modal query JSONL with queries
OUTPUT_JSONL Output JSONL with query_scores and kept_queries
MODEL_NAME Local path or HuggingFace ID of the embedding model
CAPTION_DB_CACHE_DIR Pre-built unified caption embedding cache (caption_embeddings.npy + caption_meta.json)
QUERY_CACHE_DIR Cache directory for combined_query embeddings
FORCE_REBUILD_QUERY_CACHE Recompute query embeddings even if a cache exists
EMBEDDING_THRESHOLD Minimum embedding similarity for keeping a query
ROUGE_L_THRESHOLD Maximum ROUGE-L F1 allowed to avoid near-copy queries
BATCH_SIZE Embedding batch size
GPU_IDS / USE_MULTI_PROCESS / NORMALIZE_EMBEDDINGS Multi-GPU encoding and normalization settings

Example

cd flare/data_construction/query_generation
python query_unified_filter_stage1.py

3.6 query_unified_filter_stage2.py — Hard bimodal constraint

For each cross-modal query, runs three independent searches: (i) vision_part against the vision gallery, (ii) audio_part against the audio gallery, (iii) combined_query against the unified gallery. Stage 2 uses kept_queries from stage 1 when present and falls back to queries otherwise. A query is kept as hard bimodal only if the vision-only and audio-only searches both fail (target rank > VISION_FAIL_TOPK / AUDIO_FAIL_TOPK) and the joint search succeeds (target rank ≤ JOINT_PASS_TOPK).

Constant Description
QUERY_JSONL_PATH Stage-1 unified output JSONL
VISION_DB_JSONL_PATH / AUDIO_DB_JSONL_PATH / UNIFIED_DB_JSONL_PATH Three modality-specific caption galleries
VISION_CACHE_DIR / AUDIO_CACHE_DIR / UNIFIED_CACHE_DIR Corresponding caption embedding caches and FAISS indexes
QUERY_CACHE_DIR Cache for query embeddings
OUTPUT_JSONL_PATH Per-item results with hard_bimodal_queries, dropped_queries, diagnostics
OUTPUT_SUMMARY_PATH Aggregate summary JSON

Example

cd flare/data_construction/query_generation
python query_unified_filter_stage2.py

Evaluation

The evaluation pipeline is located in:

flare/evaluation/

It supports evaluation for:

  • Text-to-Clip Retrieval
  • Clip-to-Text Retrieval
  • Text-to-Video Retrieval
  • Video-to-Text Retrieval

under two query regimes (caption-based and query-based) and three modality scopes (vision, audio, unified audiovisual).

Supported baselines

Modality scope MODEL value
Vision–Text clip, metaclip, siglip2, videoclip_xl_v2, qwen3_vl_embedding
Audio–Text ms_clap_2022, ms_clap_2023, laion_clap, m2d_clap, dasheng_glap, aurola
Audiovisual–Text imagebind_av, languagebind_av, perception_av, wave

Input JSONL format

JSONL Key fields
vision_clip.jsonl video_path, caption
audio_clip.jsonl video_path, audio_caption
unified_clip.jsonl video_path, unified_caption
vision_query.jsonl video_path, caption
audio_query.jsonl video_path, audio_caption
unified_query.jsonl video_path, unified_caption
video_caption.jsonl video_id, video_level_caption

video_path is expected to follow <video_id>/<clip_id>.mp4. Full-length media are located through FULL_VIDEO_DIR / FULL_AUDIO_DIR and referenced by video_id.

Pipeline stages

inspect_data → encode_text_chunk → merge_text   ┐
                                                 ├→ calc_metrics
               encode_gallery    → merge_gallery ┘

Each stage of complete_retrieval_pipeline.py is triggered by its own flag (--inspect_data, --encode_text_chunk, --merge_text, --encode_gallery, --merge_gallery, --calc_metrics). Chunked stages accept --num_chunks / --chunk_idx and can be parallelised across GPUs. eval_run.sh glues them together.

Running a baseline

Each baseline has a config stub under flare/evaluation/<model>_config.sh. Fill in the "your input path for ..." placeholders with your local paths, then run:

cd flare/evaluation
bash eval_run.sh clip_config.sh

Common config variables

Variable Description
MODEL One of the baselines listed above
QUERY_MODE vision / audio / unified
MEDIA_MODE Only for audiovisual baselines: vision / audio / unified
RUN_NAME Used to derive the output directory ${EXP_DIR}/${RUN_NAME}/
EXP_DIR Root directory for cached features, chunks, and metrics.json
GPU_COUNT Number of GPUs used by eval_run.sh (default 8)
VISION_CLIP_JSONL / AUDIO_CLIP_JSONL / UNIFIED_CLIP_JSONL Clip-level caption JSONLs
VISION_QUERY_JSONL / AUDIO_QUERY_JSONL / UNIFIED_QUERY_JSONL Clip-level query JSONLs; also needed to provide the matching full *_CLIP_JSONL
VIDEO_CAPTION_JSONL Video-level caption JSONL
FULL_VIDEO_DIR Directory and extension of full-length videos
FULL_AUDIO_DIR Directory and extension of full-length audio
TOPK Comma-separated Recall@K list (default 1,5,10)

License

The code in this repository is released under the MIT License. The FLARE dataset is released under CC BY 4.0.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors