🧪 cobalt-python

Unit testing for AI Agents — Python port of Cobalt

Cobalt lets you write deterministic, repeatable tests for your LLM-powered agents and pipelines — the same way you'd write unit tests for regular code.

This is the Python port. The original TypeScript SDK lives at basalt-ai/cobalt.

Features

Dataset loaders — JSON, JSONL, CSV, remote URL, Langfuse, Langsmith, Braintrust, Basalt
Three evaluator types — LLM-judge, custom function, semantic similarity
Async-native runner — configurable concurrency + per-item timeout
SQLite history — compare runs over time with cobalt history / cobalt compare
Local dashboard — cobalt ui spins up a web UI with score charts, item drill-down, and run comparison
CI-ready — declare score thresholds, get exit code 1 on regression
Rich CLI — cobalt run, cobalt init, cobalt history, cobalt compare, cobalt ui, cobalt clean
MCP server — cobalt mcp exposes 4 tools, 3 resources, 3 prompts to Claude and other MCP clients
Full docs — docs/ matches TypeScript SDK structure and coverage

Install

pip install cobalt-ai

For development / from source:

git clone https://github.com/basalt-ai/cobalt-python
cd cobalt-python
pip install -e ".[dev]"

Quick start

# my_agent.cobalt.py
import asyncio
from cobalt import Dataset, Evaluator, EvalContext, EvalResult, ExperimentResult, experiment


async def my_agent(question: str) -> str:
    # Replace with your real LLM call
    return f"The answer is 42"


dataset = Dataset.from_items([
    {"input": "What is 6 × 7?", "expected_output": "42"},
    {"input": "What is the capital of France?", "expected_output": "Paris"},
])


def exact_match(ctx: EvalContext) -> EvalResult:
    expected = str(ctx.item.get("expected_output", ""))
    score = 1.0 if expected in str(ctx.output) else 0.0
    return EvalResult(score=score, reason=f"Expected: {expected}")


async def main():
    await experiment(
        "my-agent",
        dataset,
        runner=lambda ctx: my_agent(ctx.item["input"]).then(
            lambda out: ExperimentResult(output=out)
        ),
        evaluators=[
            Evaluator(name="exact-match", type="function", fn=exact_match),
        ],
    )


asyncio.run(main())

cobalt run --file my_agent.cobalt.py

Evaluators

Function evaluator

def my_check(ctx: EvalContext) -> EvalResult:
    return EvalResult(score=1.0 if "yes" in ctx.output.lower() else 0.0)

Evaluator(name="contains-yes", type="function", fn=my_check)

LLM Judge

Evaluator(
    name="helpfulness",
    type="llm-judge",
    model="gpt-4o-mini",       # or claude-3-5-haiku, etc.
    scoring="boolean",          # "boolean" (PASS/FAIL) or "scale" (0–1)
    chain_of_thought=True,
    prompt="""
You are evaluating an AI assistant's response.

Question: {{input}}
Response: {{output}}

Is the response helpful and accurate? Reply PASS or FAIL.
""",
)

Similarity

Evaluator(
    name="semantic-similarity",
    type="similarity",
    field="expected_output",   # dataset field to compare against
    threshold=0.7,             # score = 1.0 if similarity >= threshold
)

Datasets

# From Python
ds = Dataset.from_items([{"input": "hello", "expected": "world"}])

# From files
ds = Dataset.from_file("data.csv")     # csv / json / jsonl — auto-detected
ds = Dataset.from_jsonl("data.jsonl")
ds = Dataset.from_json("data.json")

# Remote
ds = await Dataset.from_remote("https://example.com/data.jsonl")

# Platforms
ds = await Dataset.from_langfuse("my-dataset")
ds = await Dataset.from_langsmith("my-dataset")
ds = await Dataset.from_braintrust("my-project", "my-dataset")
ds = await Dataset.from_basalt("dataset-id")

# Transformations (chainable)
ds = ds.filter(lambda item, i: item["score"] > 0.5)
ds = ds.map(lambda item, i: {**item, "idx": i})
ds = ds.sample(100)
ds = ds.slice(0, 50)

Configuration

Create cobalt.toml in your project root (or run cobalt init):

[judge]
model = "gpt-4o-mini"
provider = "openai"
# api_key = "sk-..."  # or set OPENAI_API_KEY env var

[experiment]
concurrency = 5
timeout = 30
test_dir = "./experiments"

Dashboard

pip install 'cobalt-ai[dashboard]'
cobalt ui
# Opens http://localhost:4000

The local dashboard provides:

Run history with colour-coded score pills
Per-run score distribution chart (avg / p95 / min per evaluator)
Item-level drill-down — input, output, evaluator reasons
Side-by-side run comparison

CLI

# Scaffold config + example experiment
cobalt init

# Run all *.cobalt.py files
cobalt run

# Run a specific file
cobalt run --file experiments/my-agent.cobalt.py

# CI mode — exit 1 if thresholds violated
cobalt run --ci

# List recent runs
cobalt history --limit 20

# Compare two runs
cobalt compare <run-id-1> <run-id-2>

# Local web dashboard
cobalt ui --port 4000

# Delete all stored results
cobalt clean

CI Integration

from cobalt.types import ThresholdConfig, ThresholdMetric

thresholds = ThresholdConfig(
    evaluators={
        "exact-match": ThresholdMetric(avg=0.9, p95=0.7),
        "helpfulness":  ThresholdMetric(avg=0.8),
    }
)

report = await experiment(
    "my-agent", dataset, runner,
    evaluators=[...],
    thresholds=thresholds,
)

# .github/workflows/eval.yml
- name: Run evaluations
  run: cobalt run --ci
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Architecture

src/cobalt/
├── __init__.py          # Public API surface
├── types.py             # All dataclasses
├── config.py            # cobalt.toml loader
├── dataset.py           # Dataset class
├── evaluator.py         # Evaluator + registry
├── experiment.py        # Core runner
├── evaluators/
│   ├── function.py      # Custom function evaluator
│   ├── llm_judge.py     # LLM-judge evaluator
│   └── similarity.py    # TF-IDF cosine similarity
├── storage/
│   ├── db.py            # SQLite history
│   └── results.py       # JSON result files
├── utils/
│   ├── stats.py         # Descriptive statistics
│   ├── template.py      # {{variable}} rendering
│   └── cost.py          # Token cost estimation
└── cli/
    └── main.py          # cobalt CLI (Typer)

Development

# Install
pip install -e ".[dev]"

# Test
pytest tests/ -v

# Lint
ruff check src/ tests/

Relationship to TypeScript Cobalt

Feature	TypeScript	Python
Dataset loaders	✅	✅
LLM judge	✅	✅
Function evaluator	✅	✅
Similarity	✅	✅ (TF-IDF)
CLI	✅	✅
History / compare	✅	✅
SQLite storage	✅	✅
CI thresholds	✅	✅
Local dashboard	✅	✅ (`cobalt ui`)
MCP integration	✅	✅ (`cobalt mcp`)
Platform integrations	Langfuse, Langsmith, Braintrust, Basalt	✅ same

Python conventions used throughout: async/await, dataclasses, asyncio.Semaphore, typer, rich.

License

MIT — see LICENSE.

Built by Basalt AI.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
docs		docs
src/cobalt		src/cobalt
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 cobalt-python

Features

Install

Quick start

Evaluators

Function evaluator

LLM Judge

Similarity

Datasets

Configuration

Dashboard

CLI

CI Integration

Architecture

Development

Relationship to TypeScript Cobalt

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧪 cobalt-python

Features

Install

Quick start

Evaluators

Function evaluator

LLM Judge

Similarity

Datasets

Configuration

Dashboard

CLI

CI Integration

Architecture

Development

Relationship to TypeScript Cobalt

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages