Production-grade LLM evaluation and improvement system. Generates responses, evaluates them via LLM-as-judge scoring, and iteratively refines them for maximum accuracy. · FastAPI backend · Pydantic validation · Google Gemini Integration.
Building scalable Generative AI applications requires robust evaluation pipelines. Static testing is often not enough to handle the non-deterministic nature of LLMs. Evalynx solves this by implementing an autonomous "LLM-as-judge" pipeline. By capturing an initial response and intelligently critiquing it across multiple distinct axes—Correctness, Clarity, and Reasoning—Evalynx guarantees that only high-quality answers are returned, autonomously rewriting responses that fall below the target confidence threshold.
- Intelligent Orchestration: Fully automated dual-LLM generation and evaluation pipeline.
- Strict Scoring Mechanism: Enforces strict JSON-bound scoring for
correctness,clarity, andreasoning(1-10 mapping). - Auto-Refinement Loop: Automatically injects negative feedback into an
ImproverServiceif the average score falls below a defined threshold (7.0). - Trace Logging: Complete structured logs to track API latencies and specific trace iterations via
loguru.
┌─────────────────────────────────────────────────────────────────────┐
│ USER REQUEST (FastAPI) │
│ POST /evaluate ──► query: "Explain relativity" │
└────────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────────▼────────────────────────────────────┐
│ EVALYNX ORCHESTRATOR LAYER │
│ │
│ 1. GeneratorService ──► Generates Initial Response │
│ │ │
│ 2. EvaluatorService ──► LLM-as-Judge (Structured JSON Score) │
│ │ │
│ [ Average Score < 7.0 ] │
│ Yes ───┴─── No │
│ │ │ │
│ ┌─────────────────────▼─┐ ┌──▼───────────────────────────┐ │
│ │ 3. ImproverService │ │ │ │
│ │ Rewrites using ├───►──┤ FINAL RESPONSE │ │
│ │ Evaluation Feedback │ │ │ │
│ └───────────────────────┘ └──────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────┘
│ HTTP JSON
┌────────────────────────────────▼────────────────────────────────────┐
│ CLIENT DASHBOARD / API │
│ { "initial_response": "...", "final_score": 9.6, "status": ...} │
└─────────────────────────────────────────────────────────────────────┘
# Create a virtual environment (recommended)
python -m venv venv
.\venv\Scripts\activate # Windows
source venv/bin/activate # macOS / Linux
pip install -r requirements.txtCopy the configuration setup and add your API Keys:
# Update the .env file in the root
GEMINI_API_KEY=your_gemini_api_key_here
EVAL_SCORE_THRESHOLD=7.0
MAX_IMPROVEMENT_RETRIES=2python run.py
# Server runs on: http://127.0.0.1:8000
# API docs: http://localhost:8000/docsSubmit a query for LLM generation, scoring, and potential autonomous refinement. Request body:
{
"query": "Can you explain the theory of relativity briefly?",
"context": "Focus on general relativity"
}Response (Example of an Initial Pass):
{
"initial_response": "General Relativity is...",
"initial_scores": {
"correctness": 10.0,
"clarity": 9.0,
"reasoning": 10.0,
"feedback": "The response provides an exceptionally accurate explanation..."
},
"improved_response": null,
"final_score": 9.666,
"iterations": [],
"status": "Success - Initial Pass"
}Check the health status of the Evaluator.
{
"status": "Evalynx API is running"
}Evalynx/
├── core/
│ ├── config.py # Pydantic BaseSettings environment loader
│ └── logger.py # Loguru structured tracing
├── models/
│ └── schemas.py # Pydantic v2 validation constraints mapping
├── services/
│ ├── llm_client.py # Google Gemini SDK wrapper interface
│ ├── generator.py # Initial LLM target answer generation
│ ├── evaluator.py # LLM-as-judge strict JSON evaluator
│ ├── improver.py # Auto-correction and instruction rewriting
│ └── orchestrator.py # State machine controlling retry loops
├── .env # Local Environment secrets
├── .gitignore # Secrets/Caching ignore tracking
├── main.py # FastAPI Application Factory
├── requirements.txt # Standardized Dependencies
├── run.py # Default Uvicorn server launcher
└── test_api.py # Sandbox client for rapid testing
- Why Google Gemini? Gemini 2.5 Flash natively natively supports enforcing strict Pydantic Output Generation (
response_schema), meaning the LLM-as-Judge guarantees well-formed JSON floats and strings without manual regex parsing. - Why Pydantic Settings? Allows strict type definitions for
.envfiles preventing strings from creeping into thefloatoperations (like the threshold scoring limit). - Why separated modular services? The Generative AI ecosystem moves quickly. By separating
llm_client.py, migrating from Google Gemini to OpenAI passing/structured outputs requires editing a single wrapper file instead of rebuilding logic.
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Commit your changes:
git commit -m 'feat: add your feature' - Push and open a Pull Request
MIT License Built as an AI application demonstrating production-grade evaluation pipelines.