Evalynx

Production-grade LLM evaluation and improvement system. Generates responses, evaluates them via LLM-as-judge scoring, and iteratively refines them for maximum accuracy. · FastAPI backend · Pydantic validation · Google Gemini Integration.

Why This Project Matters

Building scalable Generative AI applications requires robust evaluation pipelines. Static testing is often not enough to handle the non-deterministic nature of LLMs. Evalynx solves this by implementing an autonomous "LLM-as-judge" pipeline. By capturing an initial response and intelligently critiquing it across multiple distinct axes—Correctness, Clarity, and Reasoning—Evalynx guarantees that only high-quality answers are returned, autonomously rewriting responses that fall below the target confidence threshold.

✨ Key Features

Intelligent Orchestration: Fully automated dual-LLM generation and evaluation pipeline.
Strict Scoring Mechanism: Enforces strict JSON-bound scoring for correctness, clarity, and reasoning (1-10 mapping).
Auto-Refinement Loop: Automatically injects negative feedback into an ImproverService if the average score falls below a defined threshold (7.0).
Trace Logging: Complete structured logs to track API latencies and specific trace iterations via loguru.

🏗 Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                       USER REQUEST (FastAPI)                        │
│   POST /evaluate  ──►  query: "Explain relativity"                  │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
┌────────────────────────────────▼────────────────────────────────────┐
│                    EVALYNX ORCHESTRATOR LAYER                       │
│                                                                     │
│  1. GeneratorService ──► Generates Initial Response                 │
│                                │                                    │
│  2. EvaluatorService ──► LLM-as-Judge (Structured JSON Score)       │
│                                │                                    │
│                       [ Average Score < 7.0 ]                       │
│                        Yes ───┴─── No                               │
│                         │           │                               │
│   ┌─────────────────────▼─┐      ┌──▼───────────────────────────┐   │
│   │ 3. ImproverService    │      │                              │   │
│   │ Rewrites using        ├───►──┤       FINAL RESPONSE         │   │
│   │ Evaluation Feedback   │      │                              │   │
│   └───────────────────────┘      └──────────────────────────────┘   │
└────────────────────────────────┬────────────────────────────────────┘
                                 │ HTTP JSON
┌────────────────────────────────▼────────────────────────────────────┐
│                       CLIENT DASHBOARD / API                        │
│   { "initial_response": "...", "final_score": 9.6, "status": ...}   │
└─────────────────────────────────────────────────────────────────────┘

1 — Install dependencies

# Create a virtual environment (recommended)
python -m venv venv
.\venv\Scripts\activate      # Windows
source venv/bin/activate     # macOS / Linux

pip install -r requirements.txt

2 — Configure Environment

Copy the configuration setup and add your API Keys:

# Update the .env file in the root
GEMINI_API_KEY=your_gemini_api_key_here
EVAL_SCORE_THRESHOLD=7.0
MAX_IMPROVEMENT_RETRIES=2

3 — Start the API

python run.py 
# Server runs on: http://127.0.0.1:8000
# API docs: http://localhost:8000/docs

📡 API Endpoints

POST /evaluate

Submit a query for LLM generation, scoring, and potential autonomous refinement. Request body:

{
  "query": "Can you explain the theory of relativity briefly?",
  "context": "Focus on general relativity"
}

Response (Example of an Initial Pass):

{
  "initial_response": "General Relativity is...",
  "initial_scores": {
    "correctness": 10.0,
    "clarity": 9.0,
    "reasoning": 10.0,
    "feedback": "The response provides an exceptionally accurate explanation..."
  },
  "improved_response": null,
  "final_score": 9.666,
  "iterations": [],
  "status": "Success - Initial Pass"
}

GET /

Check the health status of the Evaluator.

{
  "status": "Evalynx API is running"
}

🗂 Project Structure

Evalynx/
├── core/
│   ├── config.py                 # Pydantic BaseSettings environment loader
│   └── logger.py                 # Loguru structured tracing
├── models/
│   └── schemas.py                # Pydantic v2 validation constraints mapping
├── services/
│   ├── llm_client.py             # Google Gemini SDK wrapper interface
│   ├── generator.py              # Initial LLM target answer generation
│   ├── evaluator.py              # LLM-as-judge strict JSON evaluator
│   ├── improver.py               # Auto-correction and instruction rewriting
│   └── orchestrator.py           # State machine controlling retry loops
├── .env                          # Local Environment secrets
├── .gitignore                    # Secrets/Caching ignore tracking
├── main.py                       # FastAPI Application Factory
├── requirements.txt              # Standardized Dependencies
├── run.py                        # Default Uvicorn server launcher
└── test_api.py                   # Sandbox client for rapid testing

📐 Design Decisions

Why Google Gemini? Gemini 2.5 Flash natively natively supports enforcing strict Pydantic Output Generation (response_schema), meaning the LLM-as-Judge guarantees well-formed JSON floats and strings without manual regex parsing.
Why Pydantic Settings? Allows strict type definitions for .env files preventing strings from creeping into the float operations (like the threshold scoring limit).
Why separated modular services? The Generative AI ecosystem moves quickly. By separating llm_client.py, migrating from Google Gemini to OpenAI passing/structured outputs requires editing a single wrapper file instead of rebuilding logic.

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Commit your changes: git commit -m 'feat: add your feature'
Push and open a Pull Request

📄 License

MIT License Built as an AI application demonstrating production-grade evaluation pipelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evalynx

Why This Project Matters

✨ Key Features

🏗 Architecture

1 — Install dependencies

2 — Configure Environment

3 — Start the API

📡 API Endpoints

POST /evaluate

GET /

🗂 Project Structure

📐 Design Decisions

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
core		core
models		models
services		services
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run.py		run.py
test_api.py		test_api.py

Folders and files

Latest commit

History

Repository files navigation

Evalynx

Why This Project Matters

✨ Key Features

🏗 Architecture

1 — Install dependencies

2 — Configure Environment

3 — Start the API

📡 API Endpoints

POST /evaluate

GET /

🗂 Project Structure

📐 Design Decisions

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages