Skip to content

LLM testing for humans. Bring LLM evaluation into your existing pytest workflow.

License

Notifications You must be signed in to change notification settings

doganarif/pytest-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pytest-eval

LLM testing for humans.

PyPI version Python Tests License: MIT

Bring LLM evaluation into your existing pytest workflow.
No custom test runners. No new concepts. Just pytest.


Install

pip install pytest-eval

Quick Start

# No imports needed. The ai fixture IS the API.

def test_chatbot(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")
pytest -v
tests/test_chatbot.py::test_chatbot PASSED
    ✓ similar       ███████████████  0.94  ≥0.80

  ──────────────────────────────────────────────────────
  pytest-eval                                     v0.1.0
  ──────────────────────────────────────────────────────
    Test           Result Score                    Cost
  ──────────────────────────────────────────────────────
    test_chatbot     ✓   ██████████████░  0.94      $0
  ──────────────────────────────────────────────────────
    1 tests  │  1 passed  │  $0.0000 total
  ──────────────────────────────────────────────────────

That's it. No LLMTestCase objects, no custom runner, no cloud dashboard.

Why pytest-eval?

DeepEval pytest-eval
Basic test ~15 lines, 4 imports ~3 lines, 0 imports
Test runner deepeval test run pytest
Metrics 50+ to learn ~10 methods on one fixture
Dependencies 30+ (OpenTelemetry, gRPC, Sentry...) 4 core
Telemetry Cloud dashboard by default None. Fully local.

Methods

Method What it does Cost
ai.similar(a, b, threshold=0.8) Semantic similarity check Free (local)
ai.similarity_score(a, b) Returns similarity float 0–1 Free (local)
ai.judge(text, criteria) LLM evaluates against criteria $
ai.grounded(response, context) RAG faithfulness check $
ai.relevant(response, query) Answer relevancy $
ai.hallucinated(response, context) Detect unsupported claims $
ai.toxic(text) Toxicity detection Free
ai.biased(text) Bias detection Free
ai.valid_json(text, schema=None) JSON validation + Pydantic parsing Free
ai.assert_snapshot(value, name) Regression testing vs saved baseline Free (local)
ai.metric(name, text, **kw) Run a custom registered metric Varies
ai.cost Cumulative $ for this test
ai.latency Cumulative seconds for this test

Free methods use local models (sentence-transformers). No API key needed. $ methods call an LLM API (OpenAI by default). Requires OPENAI_API_KEY.

Examples

Semantic Similarity (free, local)

def test_capital(ai):
    response = my_chatbot("What is the capital of France?")
    assert ai.similar(response, "Paris is the capital of France")

LLM-as-Judge

def test_tone(ai):
    response = my_chatbot("I want to cancel my subscription")
    assert ai.judge(response, "Response is polite and offers help")

Structured Output

from pydantic import BaseModel

class City(BaseModel):
    name: str
    country: str

def test_structured(ai):
    response = my_llm("Give me Paris info as JSON")
    city = ai.valid_json(response, City)
    assert city.country == "France"

RAG Pipeline

def test_rag(ai):
    query = "What is our refund policy?"
    docs = retriever.get_relevant_docs(query)
    response = generator.generate(query, docs)

    assert ai.grounded(response, docs)
    assert ai.relevant(response, query)
    assert not ai.hallucinated(response, docs)

Snapshot Regression

def test_regression(ai):
    response = my_chatbot("What are your business hours?")
    ai.assert_snapshot(response, name="business_hours", threshold=0.85)
# First run saves baseline. Next runs compare.
# Update baselines when intentional changes are made:
pytest --snapshot-update

Multi-Model Comparison

import pytest

@pytest.mark.parametrize("model", ["gpt-4o", "claude-sonnet-4-20250514", "llama-3.1-8b"])
def test_accuracy(ai, model):
    response = call_llm(model=model, prompt="What is 2+2?")
    assert ai.similar(response, "4")

Custom Metrics

from pytest_eval import Metric, MetricResult

@Metric.register("brand_voice")
def brand_voice(text: str, **kwargs) -> MetricResult:
    formal = sum(1 for w in ["please", "thank you"] if w in text.lower())
    score = min(formal / 2, 1.0)
    return MetricResult(score=score, passed=score >= kwargs.get("threshold", 0.5))

def test_brand(ai):
    assert ai.metric("brand_voice", response, threshold=0.7)

Configuration

pyproject.toml

[tool.pytest.ini_options]
ai_provider = "openai"
ai_model = "gpt-4o-mini"
ai_embedding_model = "local"
ai_threshold = 0.8
ai_budget = 5.00
ai_snapshot_dir = ".pytest_eval_snapshots"

Environment Variables

OPENAI_API_KEY=sk-...
PYTEST_EVAL_PROVIDER=openai
PYTEST_EVAL_MODEL=gpt-4o-mini
PYTEST_EVAL_BUDGET=5.00

CLI Options

pytest --ai-provider=openai    # Provider
pytest --ai-model=gpt-4o       # Model
pytest --ai-threshold=0.9      # Similarity threshold
pytest --ai-budget=2.00        # Cap spending per run
pytest --ai-report=report.json # JSON report output
pytest --ai-verbose            # Show scores for passing tests
pytest --snapshot-update       # Update snapshot baselines
pytest -m ai                   # Run only @pytest.mark.ai tests
pytest -m "not cost_high"      # Skip expensive tests

Precedence: CLI > env vars > pyproject.toml > defaults

Providers

pytest-eval supports multiple LLM providers:

pip install 'pytest-eval[openai]'     # OpenAI (default)
pip install 'pytest-eval[anthropic]'  # Anthropic
pip install 'pytest-eval[litellm]'    # 100+ providers via LiteLLM
pip install 'pytest-eval[safety]'     # Toxicity/bias detection (detoxify)
pip install 'pytest-eval[all]'        # Everything

Local embeddings (sentence-transformers) are always included — no API key needed for similar(), similarity_score(), and assert_snapshot().

Rich Failure Messages

Every assertion failure explains what happened:

AssertionError: Semantic similarity below threshold
  actual:     "The capital of France is Lyon"
  expected:   "The capital of France is Paris"
  similarity: 0.72
  threshold:  0.85
  reason:     Texts differ on the key fact (Lyon vs Paris)

TUI Output

pytest-eval renders score bars and a summary table directly in your terminal:

  • Per-test metric detail lines (with -v or --ai-verbose)
  • Session summary table with visual score bars
  • Cost tracking per test and per session

Contributing

See CONTRIBUTING.md.

License

MIT

About

LLM testing for humans. Bring LLM evaluation into your existing pytest workflow.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages