You changed three things at once. The output got worse. Now you can't tell which change broke it.
hunch_kit is a local-first experiment framework for creative workflows where you need to isolate variables, track lineage, and score outputs that require human eyes — slide decks, generated images, prompt compositions, PDFs, design systems.
One hunch. One variable. One comparison. Repeat until you actually know what works.
# Scaffold a campaign
hunch init-project ./my_campaign --name "slide_styles"
cd my_campaign
# Create your baseline experiment
hunch init --id ex_001_baseline \
--hypothesis "Default styling produces acceptable slides" \
--variable "style_template" --value "default"
# Drop your input file into the experiment directory, then run it
hunch run ex_001_baseline
# Branch: test a hunch
hunch init --id ex_002_dark_mode \
--hypothesis "Dark mode improves visual hierarchy" \
--variable "style_template" --value "dark_professional" \
--baseline ex_001_baseline
hunch run ex_002_dark_mode
# Open the evaluation UI — compare side-by-side, score with your rubric
hunch eval
# See the family tree
hunch lineage
# root
# └── ex_001_baseline
# └── ex_002_dark_modeExisting experiment tracking tools — Promptfoo, Langfuse, Agenta, MLflow — are built for programmatic LLM API pipelines. They assume your outputs are text strings you can diff, and your evaluation is automated scoring against assertions.
That doesn't work when:
- Your output is visual — a PDF, a slide deck, an image, a design composition
- Your evaluation is subjective — "does this feel better?" requires human eyes
- Your backend isn't an LLM API — it's a rendering pipeline, a style system, a build tool
hunch_kit fills that gap: structured experimentation with human-in-the-loop evaluation, for any input→output workflow.
pip install -e .With MCP server support (for AI-assisted experiment planning):
pip install -e ".[mcp]"Every experiment declares one hypothesis, one variable being changed, and its baseline (parent). The manifest schema enforces single-variable isolation — no more "I changed the font and the colours and the spacing."
A provider is anything that takes an input and produces an output. Text generators, slide deck builders, image pipelines, style evaluators — implement run(input, config) → result and you're done.
A local web UI presents outputs side-by-side for scoring against rubric dimensions. Configurable scales with calibration anchors ("1 = Unusable, 5 = Acceptable, 10 = Exceptional"). Scores write directly back to the experiment manifest.
Every experiment knows its parent. Over time this builds a genealogy — a traceable path from initial hunch to validated output. No more "wait, which version was the good one?"
No cloud. No accounts. No data leaves your machine. Everything is YAML files on disk.
Work is organised in campaigns — a directory containing hunch_project.yaml at the root:
my_campaign/
├── hunch_project.yaml # Campaign config (defaults, provider, rubric)
├── experiments/
│ ├── ex_001_baseline/
│ │ ├── experiment.yaml # ← the manifest (single source of truth)
│ │ ├── input.txt # your input
│ │ └── output/ # provider output
│ └── ex_002_dark_mode/
│ ├── experiment.yaml
│ └── output/
├── rubrics/
│ └── slide_quality.yaml # scoring dimensions + anchors
└── providers/ # optional campaign-local providers
└── my_backend.py # auto-discovered, overrides built-ins
Campaign resolution: the CLI and MCP server find your campaign root by:
--root /path/to/campaign(explicit)HUNCH_KIT_WORKSPACEenvironment variable- Walk upward from
cwdlooking forhunch_project.yaml - Fall back to current working directory
A reference layout lives in examples/campaign/.
from hunch_kit.providers.base import BaseProvider, ProviderResult
class MyProvider(BaseProvider):
name = "my_backend"
def run(self, input_text, config=None):
output = your_pipeline(input_text, config)
return ProviderResult(
output=output,
status="success",
metadata={"model": "v2"},
)Campaign-local: drop the file in <campaign>/providers/. It's discovered by name at runtime and overrides any built-in with the same name.
Programmatic: register directly in application code:
from hunch_kit.runner import register_provider
register_provider(MyProvider)hunch_kit ships an MCP server for AI-assisted experiment management. Three capability types:
| Type | Description |
|---|---|
| Tools | init_experiment, create_rubric, run_experiment, score_experiment |
| Resources | experiment://list, experiment://lineage, rubric://list, rubric://{name} |
| Prompts | experiment_planning, rubric_construction, experiment_review |
{
"mcpServers": {
"hunch-kit": {
"command": "python",
"args": ["-m", "hunch_kit_mcp"],
"env": {
"HUNCH_KIT_WORKSPACE": "/path/to/your/campaign"
}
}
}
}┌─────────────────────────────────────────────────────────┐
│ CLI (hunch init / run / eval / list / lineage) │
├─────────────────────────────────────────────────────────┤
│ MCP Server (tools / resources / prompts) │
├─────────────────────────────────────────────────────────┤
│ Core Library │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐ │
│ │ manifest │ │ rubric │ │ runner │ │ providers │ │
│ └──────────┘ └──────────┘ └──────────┘ └───────────┘ │
├─────────────────────────────────────────────────────────┤
│ Evaluation │
│ ┌──────────────────────┐ ┌───────────────────────────┐ │
│ │ Web UI (FastAPI) │ │ LLM Judge (optional) │ │
│ │ Side-by-side scoring │ │ Pluggable, bring your key │ │
│ └──────────────────────┘ └───────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
- No external evaluation dependency — the evaluation UI and runner are built natively in Python. No Node.js, no Promptfoo, no managed service.
- Provider-agnostic — hunch_kit doesn't assume LLMs. Any input→output workflow can be a provider.
- Human scoring is first-class — automated scoring is optional. The web UI is a core feature, not an afterthought.
- Local-first — everything runs on your machine. No cloud, no accounts, no telemetry.
- Manifest-driven —
experiment.yamlis the single source of truth for each experiment. - Campaign-scoped — rubrics, experiments, and providers live under one root. Multiple studies never collide.
MIT — earlyprototype