A lightweight, open-source evaluation harness for prompts, LLMs, and agent workflows.
PromptLens runs golden test sets against multiple models, scores outputs using LLM-as-judge, tracks cost and latency, and generates beautiful visual reports—all locally, with no cloud dependencies.
- Multi-Provider Support - Test Anthropic (Claude), OpenAI (GPT), Google (Gemini), You.com, and local models (Ollama, LM Studio)
- Tool/Function Calling Evaluation - Test tool usage with automatic + LLM judge scoring across 5 criteria
- LLM-as-Judge Scoring - Automated evaluation using another LLM with configurable criteria
- Cost & Latency Tracking - Monitor per-query costs and response times across models
- Beautiful Reports - Interactive HTML reports with charts, comparisons, and detailed results
- Multiple Export Formats - HTML, JSON, CSV, and Markdown outputs
- Parallel Execution - Async execution with configurable concurrency and retry logic
- Portable & Local - No cloud backend, all data stays on your machine
- Easy to Extend - Plugin architecture for custom providers, judges, and exporters
pip install promptlensgit clone https://github.com/sparker/promptlens.git
cd promptlens
pip install -e .
# Or with Poetry
poetry install- Python 3.9+
- API keys for the providers you want to use (Anthropic, OpenAI, Google, You.com)
# Copy the example environment file
cp .env.example .env
# Edit .env and add your API keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
export GOOGLE_API_KEY=...
export YOU_API_KEY=...# Run the basic customer support evaluation
promptlens run examples/configs/basic_config.yaml# Open the HTML report (path shown in CLI output)
open promptlens_results/latest/report.htmlThat's it! You've just evaluated an LLM against a golden test set.
→ Read the Complete Getting Started Guide for detailed workflows and use cases.
Golden sets are test cases in YAML or JSON format:
name: "My Test Set"
description: "Testing customer support responses"
version: "1.0"
test_cases:
- id: "test-001"
query: "How do I reset my password?"
expected_behavior: "Provide clear step-by-step instructions"
category: "account_management"
tags: ["password", "account"]
- id: "test-002"
query: "What's your refund policy?"
expected_behavior: "Explain the 30-day refund policy clearly"
category: "policy"
tags: ["refund", "billing"]Save as my_tests.yaml.
golden_set: ./my_tests.yaml
models:
- name: "Claude 3.5 Sonnet"
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.7
max_tokens: 1024
- name: "GPT-4 Turbo"
provider: openai
model: gpt-4-turbo-preview
temperature: 0.7
max_tokens: 1024
judge:
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.3
execution:
parallel_requests: 3
retry_attempts: 3
output:
directory: ./promptlens_results
formats: [html, json, csv, md]
run_name: "My Evaluation"Save as my_config.yaml.
promptlens run my_config.yaml# Run evaluation
promptlens run <config.yaml>
# Validate a golden set
promptlens validate <golden_set.yaml>
# List past runs
promptlens list-runs
# Export a run to different format
promptlens export <run_id> --format html
# Get help
promptlens --helpmodels:
- name: "Claude 3.5 Sonnet"
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.7
max_tokens: 1024Supported Models:
claude-3-5-sonnet-20241022claude-3-opus-20240229claude-3-haiku-20240307
models:
- name: "GPT-4 Turbo"
provider: openai
model: gpt-4-turbo-preview
temperature: 0.7
max_tokens: 1024Supported Models:
gpt-4-turbo-preview,gpt-4,gpt-3.5-turboo1-preview,o1-mini
models:
- name: "Gemini Pro"
provider: google
model: gemini-1.5-pro
temperature: 0.7
max_tokens: 1024Supported Models:
gemini-1.5-pro,gemini-1.5-flashgemini-pro
models:
- name: "You.com GPT-4"
provider: you
model: gpt-4
temperature: 0.7
max_tokens: 1024Supported Models:
gpt-4,claude-3-5-sonnet,llama-3-70b- Any model available through You.com's unified API
Setup:
- Get API key from: https://api.you.com/
- Set
YOU_API_KEYenvironment variable - Use model names as specified in You.com docs
models:
- name: "Local Llama"
provider: http
model: llama3.1:8b
temperature: 0.7
max_tokens: 1024
additional_params:
endpoint: "http://localhost:11434/api/generate"Setup:
- Install Ollama
- Pull a model:
ollama pull llama3.1:8b - Use the
httpprovider with the endpoint URL
models:
- name: "Display Name" # Human-readable name
provider: anthropic # anthropic, openai, google, http
model: model-identifier # Model ID
temperature: 0.7 # 0.0-1.0
max_tokens: 1024 # Maximum output tokens
additional_params: # Provider-specific params
endpoint: "http://..." # For HTTP providerjudge:
provider: anthropic # Provider for judge model
model: claude-3-5-sonnet-20241022 # Judge model (typically Claude or GPT-4)
temperature: 0.3 # Lower for consistent scoring
criteria: # Evaluation criteria
- accuracy
- helpfulness
- safetyexecution:
parallel_requests: 3 # Concurrent API calls
retry_attempts: 3 # Retries for failed requests
timeout_seconds: 60 # Request timeoutoutput:
directory: ./promptlens_results # Output directory
formats: # Export formats
- html # Interactive report
- json # Raw JSON data
- csv # Flattened spreadsheet
- md # Markdown summary
run_name: "My Evaluation" # Display namepromptlens run examples/configs/basic_config.yamlpromptlens run examples/configs/multi_model.yamlpromptlens run examples/configs/local_model.yamlSee examples/README.md for more details.
The HTML report includes:
- Summary Dashboard - Total cost, time, test cases, and models
- Model Comparison Cards - Side-by-side metrics for each model
- Score Distribution Charts - Visual breakdown of scores (1-5)
- Detailed Test Results - Expandable cards for each test case with:
- Original query and expected behavior
- Model responses
- Judge scores and explanations
- Cost and latency per response
- Dark Theme - Easy on the eyes with accent colors for data
- Responsive Design - Works on desktop and mobile
Test different prompt versions to find the best performer:
- Create test cases for your use case
- Update your prompt
- Run evaluation
- Compare scores with previous run
- Iterate
Compare models before committing to one:
- Add multiple models to config
- Run the same test set against all models
- Compare costs, latency, and quality scores
- Make data-driven decision
Ensure prompt changes don't break existing behavior:
- Maintain a golden set of important test cases
- Run before and after making changes
- Compare results to catch regressions
- Integrate into CI/CD pipeline
Evaluate multi-step agent workflows:
- Create test cases for agent tasks
- Implement agent logic
- Evaluate with PromptLens
- Iterate on tools and prompting
Test how well models use tools and functions:
- Define tools with JSON schema
- Specify expected tool calls
- Evaluate parameter correctness, tool selection, and efficiency
- Get multi-criteria scores with detailed feedback
Example test case:
- id: "tool-001"
query: "What's the weather in San Francisco?"
expected_behavior: "Call get_weather with location='San Francisco'"
evaluation_mode: "tool_and_answer"
tools:
- name: "get_weather"
description: "Get current weather"
parameters:
location:
type: "string"
required: true
expected_tool_calls:
- name: "get_weather"
arguments:
location: "San Francisco"Evaluation includes:
- Automatic comparison (expected vs actual tool calls)
- Parameter correctness scoring (1-5)
- Tool selection accuracy (1-5)
- Tool usage efficiency (1-5)
- Final answer quality (1-5)
Supported providers: Anthropic Claude, OpenAI GPT (other providers will warn gracefully)
Try it:
promptlens run examples/configs/tool_evaluation.yamlSee examples/golden_sets/tool_calling.yaml for complete examples.
judge:
provider: anthropic
model: claude-3-5-sonnet-20241022
custom_prompt: |
You are evaluating a coding assistant's response.
Query: {query}
Expected: {expected_behavior}
Response: {response}
Rate 1-5 based on code correctness, efficiency, and style.
SCORE: [1-5]
EXPLANATION: [Your reasoning]models:
- name: "GPT-4 with JSON mode"
provider: openai
model: gpt-4-turbo-preview
additional_params:
response_format: {"type": "json_object"}execution:
parallel_requests: 10 # Higher for faster execution
retry_attempts: 5 # More retries for flaky APIs
timeout_seconds: 120 # Longer timeout for slow modelspromptlens/
├── models/ # Pydantic data models
├── providers/ # LLM provider implementations
├── loaders/ # Golden set loaders (JSON/YAML)
├── runners/ # Orchestration and execution
├── judges/ # LLM-as-judge scoring
├── exporters/ # Report generators
├── utils/ # Utilities (cost, retry, timing)
└── templates/ # HTML report templates
Key Design Principles:
- Plugin Architecture - Easy to add new providers, judges, exporters
- Async-First - Parallel execution for speed
- Type-Safe - Pydantic models throughout
- Modular - Each component is independent and testable
from promptlens.providers.base import BaseProvider
from promptlens.models.result import ModelResponse
class MyProvider(BaseProvider):
async def generate(self, prompt: str, **kwargs) -> ModelResponse:
# Your implementation
pass
def estimate_cost(self, prompt_tokens: int, completion_tokens: int) -> float:
return 0.0
@property
def provider_name(self) -> str:
return "my_provider"
# Register it
from promptlens.providers.factory import register_provider
register_provider("my_provider", MyProvider)from promptlens.judges.base import BaseJudge
from promptlens.models.result import JudgeScore
class RuleBasedJudge(BaseJudge):
async def evaluate(self, test_case, model_response) -> JudgeScore:
# Your scoring logic
score = self.calculate_score(model_response.content)
return JudgeScore(
score=score,
explanation="Rule-based evaluation",
judge_model="rule-based",
judge_provider="custom"
)- Multi-provider support (Anthropic, OpenAI, Google, HTTP)
- LLM-as-judge scoring
- HTML reports with charts
- JSON/CSV/Markdown export
- Parallel execution with retry logic
- Multi-judge consensus scoring
- Synthetic test case generation
- Cross-run comparison and tracking
- GitHub Action for CI/CD
- Web UI for report exploration
- Embedding-based similarity scoring
- Custom plugin marketplace
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
See CONTRIBUTING.md for detailed guidelines.
MIT License - see LICENSE for details.
- Inspired by the need for simple, local LLM evaluation tools
- Built with Anthropic, OpenAI, and Google AI APIs
- Uses Rich for beautiful CLI output
- Charts powered by Chart.js
- Issues: https://github.com/sparker/promptlens/issues
- Discussions: https://github.com/sparker/promptlens/discussions
- Email: sparker@example.com
Made with ❤️ for the LLM developer community