Skip to content

decimozs/regtrace

Regtrace Banner

CI Latest Release License: MIT

LLM quality degrades silently. A prompt change, a model update, a new feature — any of it can quietly break outputs that used to work.

Regtrace gives you golden sets, multi-dimensional scoring, and baseline comparison so you catch drift before your users do.

Quick Start

# Install
curl -L https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 -o regtrace
chmod +x regtrace
sudo mv regtrace /usr/local/bin/

# Create project, fill outputs, run
regtrace init
regtrace run --generate

Deterministic metrics (format checks, word overlap) work without an API key. LLM-judged metrics (factuality deep, tone) need one — set via .env or env vars.

Platform Install
Linux x64 curl -LO .../regtrace-linux-x64 && chmod +x regtrace && sudo mv regtrace /usr/local/bin/
macOS ARM64 curl -LO .../regtrace-darwin-arm64 && chmod +x regtrace && sudo mv regtrace /usr/local/bin/
Windows curl -LO .../regtrace-windows-x64.exe — move to %PATH%

Already installed? regtrace upgrade

Why Regtrace?

Tool Interface Deployment Regression Config
Promptfoo CLI + Web UI Node.js + cloud Manual diff JS/TS, YAML
Braintrust Web UI + SDK Cloud/SaaS Experiment tracking Python SDK
LangSmith Monitoring, traces, eval Cloud/SaaS Platform-level Python/JS SDK
DeepEval Library Python lib Pytest plugin Python decorators
RAGAS RAG-specific eval Python lib No built-in Python API
Regtrace CLI-first Standalone binary Automatic, always-on, gates CI Declarative YAML

CLI-first. Evaluation should be a version-controlled pipeline step, not a dashboard you log into.

Zero deps. Standalone binary — no Python, Node, or Docker.

Always-on regression. Every run compared to baseline. CI gates fail on drift.

No vendor lock-in. Pluggable judge providers: Anthropic, OpenAI, Groq, Gemini, Ollama.

CI Integration

Add to .github/workflows/regtrace.yml:

name: LLM Quality Gate
on: [pull_request]
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download regtrace
        run: |
          curl -L https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 -o /usr/local/bin/regtrace
          chmod +x /usr/local/bin/regtrace
      - name: Evaluate
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          regtrace run --format json --output results.json
          # 0 = pass, 1 = gate failure, 2 = config error

Key Features

  • 4 quality gates — suite score, max failures, regression status, NFR (latency / cost / coverage). AND-composed: all must pass.
  • Factuality — LLM-as-judge (shallow heuristic or deep LLM). Auto-detects JSON output and runs structural comparison.
  • Format — deterministic checks: JSON schema, required fields, regex, length, markdown, forbidden content.
  • Tone — profile across formality, sentiment, assertiveness, persona, verbosity. Deterministic fallback.
  • Regression detection — compare against baselines with per-metric tolerance. Stale baseline warnings.
  • Branch-aware baselines — per-branch baselines with fallback chain.
  • NFR enforcement — gate on max_latency_ms, max_cost_usd, min_coverage.
  • Generate mode--generate auto-fills outputs from LLM, skipping manual golden set authoring.
  • Parallel evaluation — configurable concurrency (default 4).
  • Fallback judge — secondary provider with exponential backoff + jitter.
  • Watch moderegtrace watch re-runs on file changes.

Commands

Command Description
regtrace init Scaffold a new project
regtrace run Evaluate all golden sets
regtrace run --generate Auto-fill null outputs from LLM, then evaluate
regtrace run --dry-run Validate config without spending tokens
regtrace list List recent run history
regtrace history --run-id <id> Show detailed run results
regtrace history --diff <a> [b] Compare two runs
regtrace baseline pin <run-id> Pin regression baseline
regtrace scaffold Create golden sets from existing run records or output files
regtrace watch Re-run on golden set changes
regtrace upgrade Update to the latest release

All commands support --config for custom config paths.

Security & Privacy

  • No telemetry. Regtrace never phones home.
  • Data stays local. LLM-judged metrics send response text to your configured provider. All other metrics run locally.
  • API keys from .env or env vars — never stored in config or golden set files.
  • Self-contained binary. No runtime dependencies, no npm install.

Roadmap

See ROADMAP.md for planned features: cost tracking, offline evaluation, SARIF output, Homebrew/npm/apt distribution.

Agent Skill

skills/regtrace/ teaches AI coding agents how to use regtrace — golden sets, evaluation, regression, CI integration.

Contributing

See CONTRIBUTING.md for setup, build, test, and release process.

License

MIT.

About

A regression-first evaluation framework for LLM outputs. Define golden sets, score outputs across multiple dimensions, and catch quality degradation before it ships.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages