Skip to content

calebevans/reval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reval

reval correlates your Langfuse eval sessions with your git history and uses a multi-agent LLM pipeline to pinpoint which code changes caused which metric regressions. It produces a report with explanations, evidence, and suggested fixes.

Installation

From PyPI:

pip install reval-cli

From source:

git clone https://github.com/calebevans/reval.git
cd reval
pip install .

For development (includes pytest, mypy, ruff, pre-commit):

pip install ".[dev]"

Requires Python 3.10+.

Quick Start

  1. Generate a starter config:
reval init
  1. Set your Langfuse credentials (or add them to reval.yaml):
export LANGFUSE_BASE_URL="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-..."
export LANGFUSE_SECRET_KEY="sk-..."
  1. Run an analysis against a Langfuse eval session:
reval analyze --eval-results <session-id>
  1. To compare two sessions (current vs. baseline) and correlate regressions with code changes:
reval analyze \
  --eval-results <current-session-id> \
  --eval-baseline <baseline-session-id> \
  --base main

Configuration

reval is configured through a reval.yaml file in your project root. Every field has a sensible default, so the file is optional for simple use cases.

langfuse:
  api_url: https://cloud.langfuse.com
  public_key: pk-...
  secret_key: sk-...
  project_id: ""                  # auto-detected if omitted
  current_session_id: ""          # or use --eval-results
  baseline_session_id: ""         # or use --eval-baseline
  publish: false                  # post results back to Langfuse

metrics:
  - name: answer_relevancy
    threshold: 0.05               # flag if score drops by more than this
  - name: faithfulness
    threshold: 0.05

relevance:
  include_patterns: []            # empty = include all non-ignored files
  ignore_patterns:
    - "**/tests/**"
    - "**/__pycache__/**"
    - "*.md"
    - "*.lock"
  category_mappings:
    prompt:
      - "**/prompts/**"
      - "**/*.prompt"
    model_config:
      - "**/config/model*"
      - "**/*llm_config*"
    retrieval:
      - "**/retrieval/**"
      - "**/rag/**"
    tool_definition:
      - "**/tools/**"
      - "**/functions/**"
    output_parsing:
      - "**/parsers/**"
      - "**/schema*"
    eval_config:
      - "**/eval*"

llm:
  model: openai/gpt-4o            # any LiteLLM model identifier
  temperature: 0.2
  max_tokens: 4096
  context_window: null             # override the model's default context window
  diff_model: null                 # use a different model for diff analysis
  eval_model: null                 # use a different model for eval analysis
  synthesis_model: null            # use a different model for synthesis

git:
  base: HEAD                       # base commit ref
  head: working                    # "working" = uncommitted changes

Configuration Sections

langfuse - Connection settings for your Langfuse instance. Credentials can also be set through environment variables (see below). Set publish: true to write analysis results back to Langfuse as comments.

metrics - List of metric names and their regression thresholds. A metric is flagged as regressed when current_score - baseline_score falls below -threshold. Defaults to 0.05 if not specified.

relevance - Controls which files from the git diff are included in analysis. Files matching ignore_patterns are excluded. If include_patterns is non-empty, only files matching at least one include pattern (and no ignore pattern) are kept. The category_mappings section maps glob patterns to semantic categories (prompt, model_config, retrieval, etc.) so the analysis agents understand the role of each changed file.

llm - Model configuration. The model field accepts any LiteLLM model identifier (e.g. openai/gpt-4o, anthropic/claude-sonnet-4-20250514, vertex_ai/gemini-2.0-flash). You can assign different models to each analysis agent using diff_model, eval_model, and synthesis_model.

git - The commit refs to diff. Set head to working to diff uncommitted changes against base, or set both to commit SHAs/branch names.

Environment Variables

Langfuse credentials can be provided through environment variables instead of (or in addition to) reval.yaml. Environment variables take precedence when the corresponding config field is left empty.

Variable Config equivalent Description
LANGFUSE_BASE_URL langfuse.api_url Langfuse API URL
LANGFUSE_PUBLIC_KEY langfuse.public_key Langfuse public key
LANGFUSE_SECRET_KEY langfuse.secret_key Langfuse secret key
LANGFUSE_PROJECT_ID langfuse.project_id Langfuse project ID (auto-detected if omitted)

CLI Reference

reval init

Generate a starter reval.yaml with interactive prompts.

reval init [--output PATH]
Option Default Description
--output reval.yaml Path for the generated config file

reval analyze

Run the analysis pipeline. This is the main command.

reval analyze [OPTIONS]
Option Default Description
--eval-results Langfuse session ID for the current eval run (required)
--eval-baseline Langfuse session ID for the baseline run (omit for single-session mode)
--base From config or HEAD Base commit ref
--head From config or working Head ref (working for uncommitted changes)
--config reval.yaml Path to config file
--output terminal Output format: terminal, json, or markdown
--output-file Write the report to a file instead of stdout
--threshold 0.05 Global regression threshold (overrides per-metric config)
--model From config LLM model to use (overrides config)
--publish / --no-publish From config Publish results back to Langfuse
--verbose false Show debug information

reval report

Re-render a previously saved JSON report in a different format.

reval report REPORT_FILE [OPTIONS]
Option Default Description
--output terminal Output format: terminal, json, or markdown
--output-file Write the report to a file instead of stdout

Example: save a JSON report, then render it as markdown later:

reval analyze --eval-results sess-123 --output json --output-file report.json
reval report report.json --output markdown

Analysis Modes

Compare mode

Activated when you provide both --eval-results and --eval-baseline. reval fetches both sessions from Langfuse, diffs the git history between --base and --head, and runs three agents:

  1. Diff agent examines code changes in isolation and forms hypotheses about their potential eval impact.
  2. Eval agent investigates each regressed test case by comparing outputs, scores, and evaluator reasoning between current and baseline runs.
  3. Synthesis agent correlates the diff and eval findings into a final report with explanations and suggested fixes.

Single-session mode

Activated when you omit --eval-baseline. reval analyzes a single eval session without a baseline comparison. It loads source files matching your relevance patterns, runs the eval agent on any test cases that fall below threshold, and produces findings about what may be going wrong.

Output Formats

Format Flag Description
Terminal --output terminal Rich tables and panels with color-coded diffs (default)
JSON --output json Machine-readable output, can be re-rendered with reval report
Markdown --output markdown Tables and fenced diff blocks, suitable for PRs or documentation

All formats can be written to a file with --output-file PATH.

Publishing to Langfuse

When --publish is passed (or langfuse.publish is set to true in config), reval posts its analysis results back to Langfuse:

  • A session comment with the full markdown report is added to the current session.
  • A trace comment with relevant findings is added to each failed trace.

This makes it easy to review reval's analysis directly in the Langfuse UI alongside your eval results.

About

re-evaluate eval regressions

Resources

License

Stars

Watchers

Forks

Contributors

Languages