Skip to content

coderleeon/Evalynx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evalynx

Production-grade LLM evaluation and improvement system. Generates responses, evaluates them via LLM-as-judge scoring, and iteratively refines them for maximum accuracy. · FastAPI backend · Pydantic validation · Google Gemini Integration.

Why This Project Matters

Building scalable Generative AI applications requires robust evaluation pipelines. Static testing is often not enough to handle the non-deterministic nature of LLMs. Evalynx solves this by implementing an autonomous "LLM-as-judge" pipeline. By capturing an initial response and intelligently critiquing it across multiple distinct axes—Correctness, Clarity, and Reasoning—Evalynx guarantees that only high-quality answers are returned, autonomously rewriting responses that fall below the target confidence threshold.

✨ Key Features

  • Intelligent Orchestration: Fully automated dual-LLM generation and evaluation pipeline.
  • Strict Scoring Mechanism: Enforces strict JSON-bound scoring for correctness, clarity, and reasoning (1-10 mapping).
  • Auto-Refinement Loop: Automatically injects negative feedback into an ImproverService if the average score falls below a defined threshold (7.0).
  • Trace Logging: Complete structured logs to track API latencies and specific trace iterations via loguru.

🏗 Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                       USER REQUEST (FastAPI)                        │
│   POST /evaluate  ──►  query: "Explain relativity"                  │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
┌────────────────────────────────▼────────────────────────────────────┐
│                    EVALYNX ORCHESTRATOR LAYER                       │
│                                                                     │
│  1. GeneratorService ──► Generates Initial Response                 │
│                                │                                    │
│  2. EvaluatorService ──► LLM-as-Judge (Structured JSON Score)       │
│                                │                                    │
│                       [ Average Score < 7.0 ]                       │
│                        Yes ───┴─── No                               │
│                         │           │                               │
│   ┌─────────────────────▼─┐      ┌──▼───────────────────────────┐   │
│   │ 3. ImproverService    │      │                              │   │
│   │ Rewrites using        ├───►──┤       FINAL RESPONSE         │   │
│   │ Evaluation Feedback   │      │                              │   │
│   └───────────────────────┘      └──────────────────────────────┘   │
└────────────────────────────────┬────────────────────────────────────┘
                                 │ HTTP JSON
┌────────────────────────────────▼────────────────────────────────────┐
│                       CLIENT DASHBOARD / API                        │
│   { "initial_response": "...", "final_score": 9.6, "status": ...}   │
└─────────────────────────────────────────────────────────────────────┘

1 — Install dependencies

# Create a virtual environment (recommended)
python -m venv venv
.\venv\Scripts\activate      # Windows
source venv/bin/activate     # macOS / Linux

pip install -r requirements.txt

2 — Configure Environment

Copy the configuration setup and add your API Keys:

# Update the .env file in the root
GEMINI_API_KEY=your_gemini_api_key_here
EVAL_SCORE_THRESHOLD=7.0
MAX_IMPROVEMENT_RETRIES=2

3 — Start the API

python run.py 
# Server runs on: http://127.0.0.1:8000
# API docs: http://localhost:8000/docs

📡 API Endpoints

POST /evaluate

Submit a query for LLM generation, scoring, and potential autonomous refinement. Request body:

{
  "query": "Can you explain the theory of relativity briefly?",
  "context": "Focus on general relativity"
}

Response (Example of an Initial Pass):

{
  "initial_response": "General Relativity is...",
  "initial_scores": {
    "correctness": 10.0,
    "clarity": 9.0,
    "reasoning": 10.0,
    "feedback": "The response provides an exceptionally accurate explanation..."
  },
  "improved_response": null,
  "final_score": 9.666,
  "iterations": [],
  "status": "Success - Initial Pass"
}

GET /

Check the health status of the Evaluator.

{
  "status": "Evalynx API is running"
}

🗂 Project Structure

Evalynx/
├── core/
│   ├── config.py                 # Pydantic BaseSettings environment loader
│   └── logger.py                 # Loguru structured tracing
├── models/
│   └── schemas.py                # Pydantic v2 validation constraints mapping
├── services/
│   ├── llm_client.py             # Google Gemini SDK wrapper interface
│   ├── generator.py              # Initial LLM target answer generation
│   ├── evaluator.py              # LLM-as-judge strict JSON evaluator
│   ├── improver.py               # Auto-correction and instruction rewriting
│   └── orchestrator.py           # State machine controlling retry loops
├── .env                          # Local Environment secrets
├── .gitignore                    # Secrets/Caching ignore tracking
├── main.py                       # FastAPI Application Factory
├── requirements.txt              # Standardized Dependencies
├── run.py                        # Default Uvicorn server launcher
└── test_api.py                   # Sandbox client for rapid testing

📐 Design Decisions

  • Why Google Gemini? Gemini 2.5 Flash natively natively supports enforcing strict Pydantic Output Generation (response_schema), meaning the LLM-as-Judge guarantees well-formed JSON floats and strings without manual regex parsing.
  • Why Pydantic Settings? Allows strict type definitions for .env files preventing strings from creeping into the float operations (like the threshold scoring limit).
  • Why separated modular services? The Generative AI ecosystem moves quickly. By separating llm_client.py, migrating from Google Gemini to OpenAI passing/structured outputs requires editing a single wrapper file instead of rebuilding logic.

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Commit your changes: git commit -m 'feat: add your feature'
  4. Push and open a Pull Request

📄 License

MIT License Built as an AI application demonstrating production-grade evaluation pipelines.

About

Evalynx is a production-grade LLM evaluation and self-improvement engine that scores, compares, and iteratively refines model outputs using LLM-as-judge, enabling reliable and high-quality AI system behavior.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages