Skip to content

emartai/evalflow

Repository files navigation

> evalflow

pytest for LLMs

PyPI Python License: MIT CI

You changed one prompt.
Summarization improved.
Classification silently broke.
Nobody noticed for 4 days.

evalflow catches this in CI before it ships.

Install

pip install evalflow

Quick Start

evalflow init
evalflow eval

What you get on day one:

  • local prompt and dataset files
  • SQLite-backed run history in .evalflow/
  • CI-friendly exit codes
  • offline cache support for repeatable checks

Terminal Screenshot

> evalflow eval

Running 5 test cases against gpt-4o-mini...

✓ summarize_short_article    0.91
✓ classify_sentiment         1.00
✓ extract_entities           0.87
✗ answer_with_context        0.61
✓ rewrite_formal             0.93

Quality Gate: PASS
Failures: 1
Run ID: 20240315-a3f9c2d81b4e

Why evalflow

Traditional unit tests do not tell you when a prompt tweak quietly degrades a task. evalflow gives you a small local quality gate for prompt, model, and dataset changes.

Use it when you need to:

  • catch regressions before merge
  • compare runs locally
  • keep prompt versions in YAML
  • run the same gate in CI and on a laptop

GitHub Actions Workflow

# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Features

  • pytest-style exit codes: 0=pass, 1=fail, 2=error
  • exact match, embedding, consistency, and LLM judge methods
  • baseline snapshots catch regressions, not just low scores
  • prompt registry keeps prompts versioned in YAML
  • works with OpenAI, Anthropic, Groq, Gemini, and Ollama
  • local SQLite storage, no account needed
  • offline cache for repeated and CI-safe checks

Command Surface

evalflow init
evalflow eval
evalflow doctor
evalflow runs
evalflow compare RUN_A RUN_B
evalflow prompt list

Documentation

Security

  • evalflow reads API keys from environment variables, never config files
  • evalflow.yaml stores env var names, not secret values
  • keep .env and .evalflow/ out of git
  • see docs/dev-doc/security.md for the full security model

Reporting Security Issues

Please do not open public GitHub issues for security vulnerabilities. Open a private GitHub Security Advisory.

Examples

Development

See CONTRIBUTING.md for local setup, tests, smoke checks, and performance baselines.

License

MIT

About

pytest for LLMs — catch prompt regressions before they reach production

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors