HiMu: Hierarchical Multimodal Frame Selection

Training-free, plug-and-play frame selection for long-form video question answering.

HiMu selects the K most question-relevant frames from any video for your downstream Vision-Language Model (VLM). No fine-tuning required.

How It Works

HiMu is a 5-stage pipeline:

Query Parsing - An LLM decomposes the question into a hierarchical logic tree. Leaf nodes are (expert, query) pairs; internal nodes are fuzzy operators (AND, OR, SEQ, RIGHT_AFTER).

Expert Signal Extraction - Five experts across two modalities ground atomic predicates into per-frame scores:

Expert	Modality	Backbone	Detects
CLIP	Visual	DFN5B ViT-H/14	Scenes, actions, concepts
OVD	Visual	YOLO-World v2	Objects, people
OCR	Visual	docTR	On-screen text
ASR	Audio	faster-whisper turbo	Speech content
CLAP	Audio	LAION CLAP	Environmental sounds

Normalization + Smoothing - Robust median/MAD normalization with sigmoid mapping, then modality-specific Gaussian smoothing (visual: 0.5s, speech: 1.5s, audio: 2.0s).
Fuzzy Logic Composition - Bottom-up tree evaluation with continuous operators:
- AND: product t-norm
- OR: probabilistic sum
- SEQ: temporal ordering with backward/forward gates
- RIGHT_AFTER: exponential decay for tight adjacency
Frame Selection - Select top-K frames from the resulting satisfaction curve.

Example: Questions and Logic Trees

Installation

pip install himu              # core (requires OpenAI API key for tree parsing)
pip install "himu[all]"       # all expert backends

Install individual expert backends:

pip install "himu[clip]"      # CLIP (OpenCLIP)
pip install "himu[ovd]"       # YOLO-World (object detection)
pip install "himu[ocr]"       # docTR (on-screen text)
pip install "himu[asr]"       # faster-whisper (speech)
pip install "himu[clap]"      # CLAP (environmental audio)

Running the Examples

The examples in examples/ use a sample video from the Video-MME benchmark. Before running them, download the video:

python assets/videomme/download_video.py

This downloads a ~4 min YouTube video about space debris. The subtitle file and sample questions are already included.

Quick Start

from himu import HiMuSelector

selector = HiMuSelector()

result = selector.select_frames(
    video_path="video.mp4",
    question="What happens after the narrator mentions the chemical reaction?",
    num_frames=16,
)

print(result.frame_indices)   # [12, 14, 15, 17, ...]
print(result.best_timestamp)  # 14.0
print(result.truth_curve)     # per-frame satisfaction scores

With MCQ Candidates

result = selector.select_frames(
    video_path="video.mp4",
    question="What is the chef wearing?",
    candidates=["Blue apron", "Red jacket", "Green hat"],
    num_frames=16,
)

With a VLM (GPT-4o example)

import cv2, base64
from openai import OpenAI
from himu import HiMuSelector

selector = HiMuSelector()
result = selector.select_frames("video.mp4", question, num_frames=16)

# Extract selected frames
cap = cv2.VideoCapture("video.mp4")
fps = cap.get(cv2.CAP_PROP_FPS)
images = []
for idx in result.frame_indices:
    cap.set(cv2.CAP_PROP_POS_FRAMES, int(idx * fps / result.fps))
    ret, frame = cap.read()
    if ret:
        _, buf = cv2.imencode(".jpg", frame)
        images.append(base64.b64encode(buf).decode())
cap.release()

# Send to GPT-4o
client = OpenAI()
content = [{"type": "text", "text": question}]
for b64 in images:
    content.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}})

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": content}],
)

Configuration

Presets

from himu import create_himu_selector

selector = create_himu_selector("default")       # all experts
selector = create_himu_selector("visual_only")   # no audio experts
selector = create_himu_selector("fast")          # CLIP-only, no smoothing

Custom Configuration

from himu import HiMuSelector, HiMuConfig
from himu.config import ExpertConfig, SmoothingConfig

config = HiMuConfig(
    fps=2.0,                                      # sample at 2 FPS
    clip_preset="dfn5b-clip",                     # CLIP model preset
    asr=ExpertConfig(enabled=False),              # disable ASR
    smoothing=SmoothingConfig(visual_sigma=1.0),  # wider visual smoothing
)

selector = HiMuSelector(config=config, device="cuda:0")

LLM Backends

from himu import HiMuSelector
from himu.llm import create_llm

# OpenAI (default)
llm = create_llm("openai", model="gpt-4o")

# Google Gemini
llm = create_llm("gemini", model="gemini-2.0-flash")

# Local Qwen3 (no API key needed)
llm = create_llm("qwen3", model="Qwen/Qwen3-8B", device="cuda:1")

selector = HiMuSelector(llm=llm)

Feature Caching

Cache query-independent features for fast multi-query processing:

selector = HiMuSelector(cache_dir="/tmp/himu_cache")

# Extract features once
selector.cache_features("video.mp4")

# Fast per-query selection (only OVD re-runs)
for q in questions:
    result = selector.select_frames("video.mp4", q)

API Reference

`HiMuSelector`

HiMuSelector(
    config: HiMuConfig = None,     # Pipeline configuration
    llm: BaseLLM = None,           # LLM for query parsing
    device: str = None,            # GPU device (default: "cuda")
    cache_dir: str = None,         # Feature cache directory
    verbose: bool = False,         # Detailed logging
)

`select_frames()`

result = selector.select_frames(
    video_path: str,               # Path to video file
    question: str,                 # Natural language question
    candidates: List[str] = None,  # MCQ answer options
    num_frames: int = 16,          # Number of frames to select
    fps: float = None,             # Frame rate (default: config.fps)
)

`FrameSelectionResult`

Field	Type	Description
`frame_indices`	`np.ndarray`	Selected frame indices (sorted by score)
`timestamps`	`np.ndarray`	Timestamps of selected frames
`scores`	`np.ndarray`	Satisfaction scores of selected frames
`truth_curve`	`np.ndarray`	Per-frame satisfaction curve
`tree`	`dict`	Logic tree from LLM
`best_frame_idx`	`int`	Index of highest-scoring frame
`best_timestamp`	`float`	Timestamp of best frame
`best_score`	`float`	Score of best frame
`num_frames`	`int`	Total extracted frames
`fps`	`float`	Frame extraction rate

Requirements

Python >= 3.10
GPU with >= 8GB VRAM (for all experts)
OpenAI API key (for default LLM tree parsing)

Main Results

Accuracy on Video-MME, LongVideoBench_val, and HERBench-Lite. All methods select K=16 frames. HiMu is a plug-and-play enhancer — no model-specific tuning required.

Model	Method	Short	Medium	Long	Overall	LVB_val	HERBench-Lite
Qwen3-VL-8B	Uniform	76.34	66.31	55.58	66.36	55.74	41.70
Qwen3-VL-8B	HiMu	78.55	71.00	69.90	73.22	64.19	43.22

LLaVA-OV-1.5-8B	Uniform	72.26	62.33	54.85	63.55	54.33	35.75
LLaVA-OV-1.5-8B	HiMu	71.87	66.99	63.94	67.65	57.85	35.87

InternVL-3.5-8B	Uniform	75.41	67.39	56.55	66.63	59.23	38.30
InternVL-3.5-8B	HiMu	76.92	70.40	66.50	71.35	64.11	38.32

Qwen2.5-VL-7B	Uniform	72.49	61.01	53.82	62.57	54.58	34.05
Qwen2.5-VL-7B	HiMu	73.08	65.10	62.86	67.09	57.51	35.17

Gemma-3-12B	Uniform	73.31	60.29	55.83	62.99	47.59	31.20
Gemma-3-12B	HiMu	71.56	65.70	67.48	68.28	53.92	31.47

Gemini-2.5-Flash*	Uniform	78.43	67.11	61.22	68.95	56.33	37.27
Gemini-2.5-Flash*	HiMu	78.92	75.00	74.49	76.11	70.13	37.68

GPT-4o*	Uniform	76.96	73.03	71.43	73.81	55.58	37.47
GPT-4o*	HiMu	80.88	77.19	76.53	78.18	65.10	40.68

*Proprietary LVLMs evaluated on a stratified random 25% subset of each benchmark.

HiMu at 16 frames outperforms uniform sampling at 64 frames — a 4x frame budget reduction.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
examples		examples
himu		himu
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiMu: Hierarchical Multimodal Frame Selection

How It Works

Example: Questions and Logic Trees

Installation

Running the Examples

Quick Start

With MCQ Candidates

With a VLM (GPT-4o example)

Configuration

Presets

Custom Configuration

LLM Backends

Feature Caching

API Reference

`HiMuSelector`

`select_frames()`

`FrameSelectionResult`

Requirements

Main Results

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HiMu: Hierarchical Multimodal Frame Selection

How It Works

Example: Questions and Logic Trees

Installation

Running the Examples

Quick Start

With MCQ Candidates

With a VLM (GPT-4o example)

Configuration

Presets

Custom Configuration

LLM Backends

Feature Caching

API Reference

HiMuSelector

select_frames()

FrameSelectionResult

Requirements

Main Results

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`HiMuSelector`

`select_frames()`

`FrameSelectionResult`

Packages