Plinius

AI Model Benchmark & Evaluation System with multi-evaluator cross-validation.

Features

Unified CLI - Single entry point for all operations
Multi-Model Benchmarking - Test 13+ models across multiple providers
Cross-Validation - Evaluate with multiple evaluators (GPT-5.1, Claude Sonnet 4.5, Gemini 3.0)
Dynamic Configuration - Auto-discovery of prompts and configurable model lists
Detailed Reports - Markdown reports with rankings, analysis, and insights

Quick Start

# Install dependencies
pnpm install

# Configure environment
cp .env.example .env
# Edit .env and add your OPENROUTER_API_KEY

# Run the full pipeline
plinius benchmark    # Run benchmarks
plinius evaluate     # Evaluate results
plinius compare      # Generate reports

CLI Usage

plinius <command> [options]

Commands:
  benchmark    Run benchmark prompts against models
  evaluate     Evaluate benchmark results with multiple evaluators
  compare      Compare evaluations across evaluators
  clean        Remove benchmark artifacts

Options:
  -h, --help     Show help message
  -v, --version  Show version number

Examples:
  plinius benchmark              # Run all benchmarks
  plinius evaluate               # Evaluate results with all evaluators
  plinius compare                # Generate comparison report
  plinius clean                  # Remove all artifacts
  plinius clean benchmark        # Remove only benchmark results
  plinius clean evaluate         # Remove only evaluation data
  plinius clean reports          # Remove only reports

Benchmark Categories

Prompts are automatically discovered from benchmark/prompt/.

Note: The included prompts are samples. You can modify or replace them with your own prompts.

A: Quantitative Finance & Algorithms

A1: Abstract Market Generation Model Estimation
A2: Constrained Alpha Construction
A3: Portfolio Risk Decomposition

B: Formal Verification (F*/Coq)

B1: Monad Laws Proof Structure Design
B2: F* Dijkstra Monad for Non-Interference
B3: Type-Level Secret Key Logging Prevention

C: Business Strategy & Decision Making

C1: Decision Tree Under Uncertainty
C2: Causal Modeling of Business Metrics
C3: Strategic Analysis Through Abstraction

See docs/prompts.md for prompt creation guidelines.

Evaluation System

The multi-evaluator system uses three evaluators for cross-validation:

GPT-5.1 (OpenAI)
Claude Sonnet 4.5 (Anthropic)
Gemini 3.0 Preview (Google)

Evaluation Criteria

Each response is evaluated on 5 dimensions (0-5 points each, 25 points total):

Structure - Organization and logical flow
Depth of Reasoning - Insights, edge cases, and alternatives
Consistency & Coherence - Internal consistency and correctness
Creativity & Concreteness - Actionable proposals and specificity
Domain-Specific Correctness - Appropriate domain expertise

Pipeline

# 1. Run benchmarks (models × prompts)
plinius benchmark

# 2. Evaluate with all evaluators (evaluators × results)
plinius evaluate

# 3. Generate comparison report
plinius compare

Output Files

Benchmark results: benchmark/artifacts/result/{prompt}_{model}_{timestamp}.md
Evaluations: benchmark/artifacts/evaluation/{prompt}_{model}_{evaluator}_evaluation_{timestamp}.json
Reports: benchmark/artifacts/reports/

Configuration

Edit src/config.ts to customize:

// Models to benchmark
export const BENCHMARK_MODELS: OpenRouterModel[] = [
  OpenRouterModels.GPT_5_1,
  OpenRouterModels.CLAUDE_4_5_HAIKU,
  OpenRouterModels.GEMINI_3_0_PREVIEW,
  // ... add or remove models
];

// Models to use as evaluators
export const EVALUATOR_MODELS: OpenRouterModel[] = [
  OpenRouterModels.GPT_5_1,
  OpenRouterModels.CLAUDE_4_5_SONNET,
  OpenRouterModels.GEMINI_3_0_PREVIEW,
];

See docs/configuration.md for detailed configuration options.

Project Structure

.
├── benchmark/
│   ├── prompt/              # Benchmark prompts (A1.md - C3.md)
│   └── artifacts/
│       ├── result/          # Benchmark results
│       ├── evaluation/      # Evaluation results
│       └── reports/         # Generated reports
├── docs/
│   ├── prompts.md           # Prompt creation guide
│   ├── cli.md               # CLI reference
│   └── configuration.md     # Configuration guide
├── src/
│   ├── benchmark/           # Benchmark system
│   │   ├── loader.ts        # Load and discover prompts
│   │   └── runner.ts        # Run benchmarks
│   ├── commands/            # CLI commands
│   │   ├── benchmark.ts     # Benchmark command
│   │   ├── evaluate.ts      # Evaluate command
│   │   ├── compare.ts       # Compare command
│   │   └── clean.ts         # Clean command
│   ├── evaluation/          # Evaluation system
│   │   ├── evaluator.ts     # Evaluation logic
│   │   ├── parser.ts        # Parse result files
│   │   ├── progress.ts      # Progress tracking
│   │   └── rubric.ts        # Evaluation rubric
│   ├── types/               # TypeScript definitions
│   ├── cli.ts               # CLI entry point
│   ├── config.ts            # Central configuration
│   └── env.ts               # Environment setup
├── flake.nix                # Nix development environment
└── package.json

Available Models

The project supports 13+ models via OpenRouter:

Provider	Models
OpenAI	GPT-5.1
Anthropic	Claude 4.5 Haiku, Sonnet
Google	Gemini 3.0 Preview, Gemini 2.5 Pro
Meta	Llama 4 Maverick
Mistral	Mistral Medium 3.1
DeepSeek	DeepSeek R1-0528
xAI	Grok 4
Moonshot AI	Kimi K2 Thinking
Qwen	Qwen3 Max
MiniMax	MiniMax M2
Microsoft	Phi-4 Reasoning Plus, MAI-DS-R1

Scripts

pnpm run build      # Build TypeScript
pnpm run typecheck  # Type check only
pnpm run start      # Run compiled CLI

Development Environment

This project uses Nix for reproducible development:

nix develop    # Enter development shell

Or use direnv:

direnv allow

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
benchmark		benchmark
docs		docs
src		src
.env.example		.env.example
.envrc		.envrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plinius

Features

Quick Start

CLI Usage

Benchmark Categories

A: Quantitative Finance & Algorithms

B: Formal Verification (F*/Coq)

C: Business Strategy & Decision Making

Evaluation System

Evaluation Criteria

Pipeline

Output Files

Configuration

Project Structure

Available Models

Scripts

Development Environment

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

cariandrum22/plinius

Folders and files

Latest commit

History

Repository files navigation

Plinius

Features

Quick Start

CLI Usage

Benchmark Categories

A: Quantitative Finance & Algorithms

B: Formal Verification (F*/Coq)

C: Business Strategy & Decision Making

Evaluation System

Evaluation Criteria

Pipeline

Output Files

Configuration

Project Structure

Available Models

Scripts

Development Environment

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages