Multimodal Perception Atomic Skill
Quick Start · Examples · Models · Agent Integration · 中文文档
One command. Files in, text or JSON out.
QSense is not an app — it's the lowest-level perception primitive for skills, agents, and scripts. It does one thing: send multimodal input to an LLM, get text or structured JSON back. Video splitting, audio segmentation, batch processing, rerun loops, and workflow logic still belong to the caller.
┌──────────────────────────────────────────────────────┐
│ Skills / Agents / Scripts │
│ video review, meeting notes, OCR pipeline, ... │
├──────────────────────────────────────────────────────┤
│ QSense ← you are here │
│ image / audio / video → LLM → text │
├──────────────────────────────────────────────────────┤
│ OpenAI-compatible API (Gemini, Claude, GPT, Grok…) │
└──────────────────────────────────────────────────────┘
| Feature | Detail | |
|---|---|---|
| 🖼 | Image | Auto-resize & encode local files; passthrough remote URLs |
| 🎙 | Audio | Streaming download & base64 encode (OpenAI input_audio format) |
| 🎬 | Video | Direct encode (default) or ffmpeg frame extraction + audio track |
| 🤖 | Multi-model | Gemini / Claude / GPT / Grok / Kimi / Gemma — YAML registry |
| ⚡ | Auto-adapt | Stream/non-stream fallback, model capability matching |
| 🔌 | Agent-ready | Plain text or JSON stdout, [qsense] stderr errors, exit 0/1, zero side effects |
Recommended (global, no activation needed):
pipx install qsense-cli
qsense --prompt "Describe this image" --image photo.png
# First run will interactively guide API key setupFor development:
bash setup.sh && source .venv/bin/activate
# or: uv venv --python 3.12 && source .venv/bin/activate && uv pip install -e .For agents / CI:
pipx install qsense-cli
QSENSE_API_KEY=sk-xxx qsense init --api-key $QSENSE_API_KEY# ── Image ──────────────────────────────────────────
qsense --prompt "What's in this image?" --image screenshot.png
qsense --prompt "Compare these" --image a.png --image b.png
qsense --prompt "Describe" --image https://example.com/photo.jpg
# ── Role-aware review ──────────────────────────────
qsense --prompt "Review the target against the references." --target page.png --reference ref.png --spec review.md
# ── Structured output ──────────────────────────────
qsense --prompt "Extract review findings." --target screenshot.png --schema review.schema.json --output json
# ── Higher visual detail ───────────────────────────
qsense --prompt "Read the dense chart." --target chart.png --vision-fidelity max
# ── Audio ──────────────────────────────────────────
qsense --prompt "Transcribe this" --audio recording.wav
qsense --prompt "What genre?" --audio https://example.com/song.mp3
# ── Video (direct passthrough) ─────────────────────
qsense --prompt "Summarize this video" --video clip.mp4
# ── Video (frame extraction) ──────────────────────
qsense --prompt "Describe" --video clip.mp4 --video-extract --fps 2
# ── Mixed modalities ──────────────────────────────
qsense --prompt "Analyze" --image frame.png --audio narration.wav
# ── Model override ────────────────────────────────
qsense --model anthropic/claude-opus-4-6 --prompt "Analyze" --image photo.pngqsense models # List all models
qsense models --detail # Show detailed limits| Model | Vision | Audio | Video | Context |
|---|---|---|---|---|
google/gemini-3-flash-preview |
✅ | ✅ | native | 1M |
google/gemini-3.1-pro-preview |
✅ | ✅ | native | 1M |
gemma-4-31B-it |
✅ | — | extract | 256K |
anthropic/claude-opus-4-6 |
✅ | — | — | 1M |
anthropic/claude-sonnet-4-6 |
✅ | — | — | 1M |
gpt-5.4 |
✅ | — | — | — |
grok-4.20-beta |
✅ | — | — | 256K |
Kimi-K2.5 |
✅ | — | native* | 256K |
* experimental
qsense [OPTIONS] [COMMAND]
Options:
--prompt TEXT Text prompt (required for inference)
--image TEXT Image path or URL (repeatable)
--target TEXT Main artifact under review (repeatable, max 1)
--reference TEXT Comparison asset, previous version, or style reference
--context TEXT Supporting context file or media input
--spec TEXT Text or media review criteria / requirements
--audio TEXT Audio file path or URL (repeatable)
--video TEXT Video file path or URL (repeatable)
--video-extract Use ffmpeg frame extraction
--fps FLOAT Extraction frame rate (default: 1)
--max-frames INT Max extracted frames (default: 30)
--model TEXT Override default model
--system TEXT Optional system prompt
--output [text|json] Output mode (default: text)
--schema TEXT Optional JSON schema file for validating response text
--vision-fidelity [low|standard|max]
Provider-neutral image detail control
--timeout INT Request timeout in seconds
--max-size INT Max image longest side in px (default: 2048)
Commands:
init Initialize configuration
config Show or update configuration
models List available models
Priority: CLI flags > environment variables > ~/.qsense/.env
# Show current config
qsense config
# Update
qsense config --model google/gemini-3.1-pro-preview
qsense config --api-key sk-xxx
qsense config --base-url https://api.openai.com/v1
# Environment variables
export QSENSE_API_KEY=sk-xxx
export QSENSE_BASE_URL=https://api.openai.com/v1
export QSENSE_MODEL=google/gemini-3-flash-previewQSense is an atomic skill — the smallest indivisible unit of perception.
What QSense does — send files to a model, return text or a structured response envelope. That's it.
What QSense does NOT do — batch iteration, rerun loops, conversation management, workflow orchestration, or domain-specific review rubrics. All left to the caller.
# Compose with higher-level skills
ffmpeg -i long.mp4 -segment_time 60 -f segment chunk_%03d.mp4
for f in chunk_*.mp4; do
qsense --prompt "Summarize this minute" --video "$f" >> result.txt
doneStay atomic, stay composable. See docs/design-rationale.md for the full story.
QSense is a Skill + CLI project: the CLI is the execution layer, the Skill is the knowledge layer that teaches AI agents how to use it effectively.
GitHub: https://github.com/hezi-ywt/qsense
Copy the following to your agent — it knows how to install skills for its own platform:
Install the qsense multimodal perception skill from https://github.com/hezi-ywt/qsense
The skill follows the Agent Skills standard (https://agentskills.io).
Install it using your platform's skill installation method.
For example: npx skills add hezi-ywt/qsense
skills/qsense/
├── SKILL.md # Stable facts
│ # Command syntax, output contract, error guide
│
└── references/
├── models.md # Model knowledge
│ # Capabilities, limits, video/audio strategy
│ # Syncs with `qsense models --detail`
│
└── user-notes.md # Living memory
# Agent-maintained: preferences, patterns, lessons
| File | Changes | Maintained by |
|---|---|---|
SKILL.md |
Rarely — only when CLI changes | Developer |
models.md |
When models are added/updated | Developer + Agent sync |
user-notes.md |
Continuously during use | Agent automatically |
The agent reads user-notes.md before each use and updates it when it learns something — a model preference, a failed command's fix, a recurring workflow. The more you use it, the better it gets.
⚠️ Experimental — This meta-skill is in early stages and has not been validated across multiple real-world skill creation cycles. Design principles are distilled from building qsense, but may evolve significantly.
This repo also includes skills/skill-craft/, a meta-skill that teaches agents how to design and evaluate Agent Skills. It covers:
- Structure — Three-file layering by change frequency, progressive context loading
- 6 Design Principles — Greedy-but-dense descriptions, explain why not MUST, no parroting, skill memory, give URLs not commands, atomic scripts
- Evaluation System — Subagent-based parallel testing, grading, blind A/B comparison, automated description optimization with train/test split
- CLI Design — Separated into
references/cli-design.md(not every skill needs a CLI)
skills/skill-craft/
├── SKILL.md # Design principles & structure guide
├── agents/ # Subagent instructions (grader, comparator, analyzer)
├── references/ # Evaluation workflow, CLI design, JSON schemas, examples
└── scripts/ # Automated eval: trigger testing, description optimization
src/qsense/
cli.py Click CLI entry point
client.py OpenAI-compatible API client
config.py Three-tier config: CLI > env > file
contracts.py Role-aware request contract helpers
image.py Image validation, resize, encoding
audio.py Audio validation, download, encoding
response.py Structured response envelope
schema.py Optional JSON schema validation
video.py Video passthrough and frame extraction
models.py Model registry loader
registry.yaml Curated model capabilities database
- Python >= 3.10
- ffmpeg (only for
--video-extractmode)
MIT