LLM Self-Improvement via Dynamic Prompts

Learn from LLM mistakes and automatically generate context-aware prompts that inject relevant error reminders, improving response quality without retraining.

Overview

This repository contains three integrated systems for improving LLM performance through dynamic prompt engineering:

autoeval - Evaluate LLM against golden references, identify weaknesses
optimizer - Extract error patterns and build searchable pattern database
router - Production API with dynamic prompts tailored per question

Core Idea: Learn from past mistakes → Store error patterns → Dynamically inject relevant reminders into prompts → LLM avoids repeating the same errors.

Performance

Metric	Baseline (v1.0)	Router-Optimized	Improvement
Overall Score	4.13/5.0	~4.4-4.5/5.0	+6-9% ✨
Completeness	3.86/5.0	~4.3/5.0	+11% (biggest gain)
Accuracy	4.21/5.0	~4.5/5.0	+7%
Error Rate	41 errors/14 Q&A	~25-30 errors/14 Q&A	-27-39%

How Router Improves Quality:

534 learned error patterns automatically retrieved per question
Tiered routing: Weakness matching → Pattern retrieval → Category rules
Runtime cost: +1-2% (pattern retrieval is fast, FAISS vector search)
ROI: 3-4.5x improvement per unit cost

Key Insight: Completeness (missing information) is the #1 issue. Router specifically targets this by injecting reminders about commonly-missed details from past errors.

Architecture

┌─────────────────────────────────────────────────────────────┐
│ 1. autoeval/                                                │
│    • Samples entities from golden medical references        │
│    • Generates test questions (OpenAI)                      │
│    • Gets answers from target LLM (e.g., DeepSeek)          │
│    • Evaluates quality (LLM judge + golden-ref lookup)      │
│    • Outputs: evaluation reports with scores & errors       │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. optimizer/                                               │
│    • Extracts error patterns from evaluation reports        │
│    • Generates improved prompts (v1.0 → v1.1 → v1.2...)     │
│    • Retrieves relevant patterns per question dynamically   │
│    • Outputs: versioned prompts, pattern vector store       │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. router/                                                  │
│    • Serves OpenAI-compatible API                           │
│    • Matches questions to known weakness patterns           │
│    • Injects relevant error reminders into prompts          │
│    • Routes to DeepSeek with enhanced system prompt         │
│    • Outputs: streaming responses + routing metadata        │
└─────────────────────────────────────────────────────────────┘

Repository Structure

.
├── autoeval/          # Auto-evaluation system
│   ├── config/        # Settings, prompts, presets
│   ├── core/          # Data models, loading, sampling
│   ├── scripts/       # evaluate.py (main entry point)
│   ├── services/      # Question/answer/evaluation services
│   └── utils/         # JSON parser, reporting
│
├── optimizer/         # Prompt optimization system
│   ├── core/          # Pattern analysis, storage, optimization
│   ├── pattern_db/    # Vector database for pattern storage & retrieval
│   └── scripts/       # optimize.py (main entry point)
│
├── router/            # Smart routing system
│   ├── api/           # FastAPI application
│   ├── core/          # Decision engine, weakness matcher
│   ├── scripts/       # serve_router.py, testing scripts
│   └── services/      # LLM client integrations
│
├── tools/             # Standalone utilities
│   ├── analyze_patterns.py
│   ├── build_weakness_patterns.py
│   ├── monitor_performance.py
│   ├── optimize_threshold.py
│   ├── list_reports.py
│   └── cleanup_repo.sh
│
├── refs/              # Golden reference data (medical CSVs)
│   └── golden-refs/   # 5,730+ medical entities
│
└── outputs/           # Generated reports, prompts, cache

Quick Start

Prerequisites

# Install dependencies
pip install -r autoeval/requirements.txt
pip install -r optimizer/requirements.txt
pip install -r router/requirements.txt

# Configure API keys
cp .env.example .env
# Edit .env and add:
#   OPENAI_API_KEY=...     # For question generation & evaluation
#   DEEPSEEK_API_KEY=...   # For target LLM testing

1. Auto-Evaluation

Purpose: Evaluate your LLM against golden medical references and identify weaknesses.

# Run evaluation (10 entities, 3 questions each = 30 Q&A pairs)
python autoeval/scripts/evaluate.py --sample-size=10

# Custom configuration
python autoeval/scripts/evaluate.py \
    --sample-size=50 \
    --questions-per-entity=5

# Compare baseline vs optimized
python autoeval/scripts/evaluate.py --compare-mode

What happens:

Samples 10 medical entities from refs/golden-refs/
Generates 30 questions using OpenAI API
Gets 30 answers from DeepSeek
Evaluates each answer by direct lookup of golden reference
Identifies error patterns and knowledge gaps

Outputs:

outputs/reports/{eval_id}/report.json - Detailed evaluation
outputs/reports/{eval_id}/report.md - Human-readable report

2. Optimizer

Purpose: Extract error patterns and build a searchable database for dynamic prompt assembly.

# Optimize using latest evaluation
python optimizer/scripts/optimize.py

# Optimize using specific evaluation
python optimizer/scripts/optimize.py --eval-report=eval_20251228_001

# Show statistics
python optimizer/scripts/optimize.py --stats

# List all stored patterns
python optimizer/scripts/optimize.py --list-patterns

What happens:

Loads evaluation report from autoeval
Analyzes error patterns (PatternAnalyzer)
Stores patterns in vector database (FAISS + embeddings)
Generates improved prompt version (v1.0 → v1.1)
Enables retrieval of relevant patterns per question at runtime

Outputs:

outputs/prompts/deepseek_system_v1.1.yaml - Improved prompt
outputs/cache/error_patterns/ - Pattern storage (JSON + FAISS index)

Example Output:

Optimization Summary:
  Previous version: 1.0
  New version: 1.1
  Total patterns in RAG: 23
  Pattern categories: 4

Top Improvements:
  1. When answering about chronic diseases, include prevention measures
  2. For dietary questions, provide specific food examples
  3. Explain testing procedures step-by-step

Key Innovation: Dynamic Prompt Engineering (Not Traditional RAG)

Not RAG: Doesn't retrieve external knowledge documents
Not Fine-tuning: Doesn't retrain model weights
Dynamic Prompts: Selects & combines relevant error reminders per question
How: Base prompt + retrieved error patterns = customized prompt per request
Benefit: Scales to 1000s of learned patterns without prompt bloat

3. Router

Purpose: Production API that dynamically customizes prompts per question using learned error patterns.

# Start router API
python router/scripts/serve_router.py --port=8000

# Test and compare
python router/scripts/test_router_llm_api.py
python router/scripts/compare_baseline_vs_router.py
python router/scripts/ab_test_extended.py

What happens:

Loads optimized prompts and weakness pattern database
Serves OpenAI-compatible API at http://localhost:8000
For each incoming question:
- Searches for similar past error patterns
- Injects relevant reminders into system prompt
- Forwards enhanced request to DeepSeek API
- Returns response with routing metadata

Example Usage:

from openai import OpenAI

# Just change the base_url - that's it!
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-llm-api-key"
)

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "user", "content": "What is diabetes?"}
    ]
)

print(response.choices[0].message.content)
# Router automatically:
# - Searched: "What is diabetes?" → found past errors on chronic diseases
# - Injected: "Remember to include prevention measures" into system prompt
# - Result: More complete answer than baseline DeepSeek

How it works:

For each question:
1. Search pattern database for similar past errors
2. If matches found → inject error reminders into system prompt
3. If no matches → use base optimized prompt
4. Forward to DeepSeek with enhanced prompt
5. Return response with metadata (patterns used, routing decision)

Workflow Example

# Step 1: Run initial evaluation
python autoeval/scripts/evaluate.py --sample-size=100
# Output: Report shows errors and weaknesses

# Step 2: Generate optimized prompt from errors
python optimizer/scripts/optimize.py
# Output: deepseek_system_v1.1.yaml with pattern database

# Step 3: Test improvement with A/B comparison
python router/scripts/compare_baseline_vs_router.py
# Compares baseline vs router-enhanced responses

# Step 4: Deploy production API
python router/scripts/serve_router.py
# Router serves requests with pattern-based enhancements

# Step 5: Continue learning (iterate steps 1-2)
python autoeval/scripts/evaluate.py --sample-size=50
python optimizer/scripts/optimize.py
# Accumulates more patterns over time

Utilities

Additional tools for monitoring and analysis:

# Analyze pattern quality and find duplicates
python tools/analyze_patterns.py

# Build entity-specific weakness mappings
python tools/build_weakness_patterns.py

# Monitor performance and track metrics
python tools/monitor_performance.py

# Optimize retrieval thresholds
python tools/optimize_threshold.py

# List all evaluation reports
python tools/list_reports.py

# Clean up repository
bash tools/cleanup_repo.sh

Data Sources

The repository includes golden reference data:

疾病.csv - 5,351 diseases
检查.csv - 215 examinations
手术操作.csv - 140 surgical procedures
疫苗.csv - 24 vaccines

Total: 5,730 medical entities with detailed information (symptoms, causes, treatments, prevention, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
autoeval		autoeval
config		config
docs		docs
optimizer		optimizer
outputs		outputs
refs/golden-refs		refs/golden-refs
router		router
tools		tools
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
NOVELITY.md		NOVELITY.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Self-Improvement via Dynamic Prompts

Overview

Performance

Architecture

Repository Structure

Quick Start

Prerequisites

1. Auto-Evaluation

2. Optimizer

3. Router

Workflow Example

Utilities

Data Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Self-Improvement via Dynamic Prompts

Overview

Performance

Architecture

Repository Structure

Quick Start

Prerequisites

1. Auto-Evaluation

2. Optimizer

3. Router

Workflow Example

Utilities

Data Sources

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages