An MCP (Model Context Protocol) server that uses OpenAI-compatible LLMs to evaluate outputs from other LLMs. It provides specialized evaluation tools for code review, architecture design, UI/UX design, test cases, and custom scenarios.
-
6 Evaluation Tools:
evaluate_code_review- Evaluate code for quality, bugs, security, and performanceevaluate_architecture- Evaluate system/software architecture designsevaluate_uiux- Evaluate UI/UX designs for usability and accessibilityevaluate_test_cases- Evaluate test case coverage and qualityevaluate_custom- Evaluate with custom user-defined criterialist_evaluation_criteria- List predefined evaluation criteria
-
Flexible Input: Accept content as direct text or file paths
-
Structured Output: JSON results with scores, issues, strengths, and improvements
-
OpenAI Compatible: Works with any OpenAI-compatible API (OpenAI, Azure, local LLMs)
- Clone or download this repository
- Install dependencies:
npm install- Build the TypeScript code:
npm run buildThe server requires the following environment variables:
| Variable | Description | Required | Default |
|---|---|---|---|
LLM_API_KEY |
API key for the LLM service | Yes | - |
LLM_API_BASE_URL |
OpenAI-compatible API base URL | No | https://api.openai.com/v1 |
LLM_MODEL |
Model name to use | No | gpt-5.2 |
LLM_MAX_TOKENS |
Maximum tokens for response | No | 4096 |
LLM_TEMPERATURE |
Temperature for generation | No | 0.3 |
Add this server to your MCP settings file:
Edit ~/Library/Application Support/Code/User/globalStorage/kilocode.kilo-code/settings/mcp_settings.json:
{
"mcpServers": {
"codex-evaluator": {
"command": "node",
"args": ["/path/to/codex-mcp/build/index.js"],
"env": {
"LLM_API_KEY": "your-openai-api-key",
"LLM_API_BASE_URL": "https://api.openai.com/v1",
"LLM_MODEL": "gpt-5.2",
"LLM_MAX_TOKENS": "4096",
"LLM_TEMPERATURE": "0.3"
},
"disabled": false,
"alwaysAllow": [],
"disabledTools": []
}
}
}Edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"codex-evaluator": {
"command": "node",
"args": ["/path/to/codex-mcp/build/index.js"],
"env": {
"LLM_API_KEY": "your-openai-api-key",
"LLM_MODEL": "gpt-5.2"
}
}
}
}Use evaluate_code_review to review this Python function:
def calculate_discount(price, discount):
return price - (price * discount / 100)
Use evaluate_architecture to evaluate this microservices design:
Service A -> Message Queue -> Service B -> Database
Service A -> Cache -> Service C -> External API
Use evaluate_uiux to evaluate this login page design:
- Header with logo
- Email input field
- Password input field
- "Remember me" checkbox
- "Forgot password" link
- Login button
- "Sign up" link at bottom
Use evaluate_test_cases to evaluate these unit tests:
describe('Calculator', () => {
it('should add two numbers', () => {
expect(add(2, 3)).toBe(5);
});
});
Use evaluate_custom with criteria "Check for SQL injection vulnerabilities and XSS attacks" on this code:
const query = "SELECT * FROM users WHERE name = '" + userName + "'";
Use list_evaluation_criteria with scenario "code_review" to see what criteria are used.
All evaluation tools return a structured JSON response:
{
"summary": "Brief overall assessment",
"score": {
"overall": 7.5,
"categories": {
"correctness": 8,
"security": 7,
"performance": 8,
"maintainability": 7,
"best_practices": 7
}
},
"issues": [
{
"severity": "high",
"category": "security",
"location": "line 5",
"description": "SQL injection vulnerability",
"suggestion": "Use parameterized queries"
}
],
"strengths": [
"Clear function naming",
"Good code organization"
],
"improvements": [
{
"priority": "high",
"description": "Add input validation",
"rationale": "Prevents invalid data from causing errors"
}
],
"metadata": {
"evaluator_model": "gpt-5.2",
"scenario": "code_review",
"timestamp": "2024-01-15T10:30:00Z",
"input_type": "text",
"processing_time_ms": 2500
}
}npm run buildnpm run devnpm start- Correctness (25%): Logic errors, edge cases, type safety
- Security (20%): Vulnerabilities, input validation, data exposure
- Performance (15%): Efficiency, memory usage, async patterns
- Maintainability (20%): Organization, naming, complexity
- Best Practices (20%): Design patterns, SOLID, DRY/KISS
- Scalability (20%): Scaling capabilities, bottlenecks
- Reliability (20%): Fault tolerance, redundancy
- Maintainability (20%): Modularity, coupling, cohesion
- Security (15%): Authentication, data protection
- Cost Efficiency (10%): Resource utilization
- Performance (15%): Latency, throughput
- Usability (25%): Navigation, efficiency, error prevention
- Accessibility (20%): WCAG, screen readers, contrast
- Visual Design (15%): Consistency, hierarchy, typography
- User Flow (20%): Task completion, feedback
- Responsiveness (10%): Device adaptation
- Content (10%): Messaging clarity, help text
- Coverage (25%): Code/branch/path coverage
- Edge Cases (20%): Boundaries, errors, null inputs
- Assertions (20%): Meaningful assertions, clarity
- Structure (20%): Organization, naming, isolation
- Maintainability (15%): Independence, data management
MIT