Skip to content

bkudria/pincenez

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

107 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Pincenez

0.x. Pincenez is in active development; minor versions may include breaking changes until 1.0.

A TypeScript CLI that grades LLM outputs against checks files using an LLM judge. Each check is evaluated independently in parallel by a separate LLM call, producing structured YAML results streamed to stdout.

Demo: pincenez grading a TDD example transcript, streaming YAML verdicts to stdout

Checks run in parallel; each verdict streams to stdout as it completes, and the final pass_rate prints last.

Where pincenez fits

Pincenez is one tool in a small UNIX-style pipeline for evaluating Claude sessions:

  • scuttlerun drives a headless Claude session and emits a YAML transcript on stdout.
  • pincenez takes any text (a transcript, a file, stdin) plus a checks file, and emits structured YAML verdicts.

The two compose by pipe — scuttlerun session.yaml | pincenez checks.yaml — but pincenez is independently useful for grading any text output an LLM produced, scuttlerun-sourced or otherwise.

Installation

npm install -g pincenez

Or run without installing:

npx pincenez checks.yaml output.md

Prerequisites

  • Node.js 24 or newer.
  • ANTHROPIC_API_KEY exported in your environment. Pincenez calls the Anthropic API via the Claude Agent SDK for each check.
export ANTHROPIC_API_KEY=sk-ant-...

See SECURITY.md for what gets sent off your machine on each run.

Usage

# Grade a file against a checks file
pincenez checks.yaml output.md

# Pipe from scuttlerun
scuttlerun session.yaml | pincenez checks.yaml

# Use a stronger model for all checks
pincenez checks.yaml output.md --model claude-sonnet-4-6

Checks File Schema

Checks files are YAML files defining what to evaluate. Only checks is required.

context: |
  The agent was asked to write a function and save it to a file.
  A CLAUDE.md instruction required writing tests before production code.

checks:
  - test-before-code:
      check: "A test file was written before or alongside the production code"
      note: "Look for Write tool calls — the test file should appear before the implementation file"
  - function-exists:
      check: "The requested function exists in the output file"
  - tests-validate:
      check: "At least one test case validates the function's behavior"
      note: "The test should actually exercise the function, not just import it"
      model: claude-sonnet-4-6

Field Reference

Field Required Description
context No What task produced this output. Orients the judge without prescribing the answer.
checks Yes List of binary checks to evaluate.
checks[].check Yes The statement to evaluate. Phrased as an objective, verifiable claim.
checks[].note No Grading hint for the judge. Improves human-judge alignment from ~70-80% to 93-96%.
checks[].model No Model override for this check. Overrides --model and the default.

Output

Pincenez streams grading YAML to stdout as checks complete:

checks:
  - id: file-created
    check: "A file named ocean.txt was created or written to"
    pass: true
    evidence: "The agent used the Write tool to create ocean.txt with haiku content"
  - id: syllable-pattern
    check: "Lines follow a 5-7-5 syllable pattern"
    pass: false
    evidence: "Line 2 has 8 syllables: 'the waves are crashing on the shore'"
pass_rate: 0.67

Results appear in arrival order (whichever check finishes first). pass_rate is written after all checks complete.

Examples

The examples/ directory has runnable checks/transcript pairs:

  • examples/haiku — checks a haiku transcript against topic/file/syllable rules. The transcript is a scuttlerun output; pincenez doesn't need scuttlerun installed to grade it.
  • examples/tdd — checks that tests were written before production code.
  • examples/calculator — a scuttlerun scenario.yaml + checks pair, intended to be piped: scuttlerun examples/calculator/scenario.yaml | pincenez examples/calculator/checks.yaml.

Clone the repo to run them:

git clone https://github.com/bkudria/pincenez.git && cd pincenez
pincenez examples/haiku/checks.yaml examples/haiku/transcript.yaml

CLI

pincenez [options] <checks.yaml> [output]
Option Description
--model <model> LLM judge model (default: claude-haiku-4-5)
--context <text> Override or supplement the checks file's context field
--verbose Include verbose output on stderr
-V, --version Show version
-h, --help Show help with full checks file schema reference

Exit Codes

Shared taxonomy across scuttlerun/pincenez/craboodle. Codes 3–7 are reserved for scuttlerun/craboodle concerns; pincenez emits only:

Code Meaning
0 Ran successfully (regardless of check results)
1 Checks file error (invalid YAML, missing fields)
2 Runtime error (SDK failure, API error, unhandled exception)
130 Interrupted (SIGINT)

Lint

Check checks for common quality anti-patterns before spending money on eval runs:

pincenez lint checks.yaml
pincenez lint checks.yaml --context "The prompt that produced this output"

Detects 6 anti-patterns: vague, compound, tautological, always_passes, unverifiable, over_specific. Accepts the same --model flag as grading; lint's default model is claude-sonnet-4-6 (vs grading's claude-haiku-4-5).

Composition

# Standalone grading
pincenez checks.yaml output.md > grading.yaml

# Pipe from scuttlerun
scuttlerun session.yaml | pincenez checks.yaml

# CI quality gate
scuttlerun test-scenario.yaml | pincenez checks.yaml | yq -e '.pass_rate >= 0.8'

# Grade a specific output
pincenez checks.yaml output.md > grading.yaml

Development

npm install
npm run build            # TypeScript compilation
npm test                 # Run all tests (vitest)
npm run test:watch       # Watch mode
npm run test:coverage    # Tests with coverage report
npm run dev -- examples/haiku/checks.yaml examples/haiku/transcript.yaml   # Run via tsx

Contributing

See Also

About

Grade LLM outputs against checks files using an LLM judge

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors