Production SQL Agent — From Scratch, in Raw Python

A complete, working SQL Agent built without any agent framework — to understand exactly what every framework is hiding from you.

What This Project Is

A natural language → SQL → answer system that takes a plain English question, figures out what SQL to write, executes it against a real database, self-corrects on failures, assesses its own confidence, and flags uncertain answers for human review.

The twist: built entirely in raw Python. No LangChain. No LlamaIndex. No openai-agents SDK. Every component — the loop, the tool dispatcher, the retry logic, the tracer, the eval harness — is written from scratch. This is intentional: once you build it yourself, you understand what every framework is actually doing.

This project is complete through Phase 5, including a Streamlit web UI.

What You Learn From This Project

If you can explain every file in this repo, you understand:

Agentic AI systems

How the ReAct (Reason + Act) pattern works — not as an abstraction, but as actual Python code
How tool calling works with the OpenAI Responses API: JSON schemas, function dispatch, previous_response_id conversation chaining
How an agent maintains state across multiple LLM turns (AgentState dataclass)
How self-correction works: error → feed error to LLM → retry loop → error budget exhaustion
Why more tools = more complexity, and how to resist over-engineering

Safety and guardrails

Defense in depth: two independent guardrail layers (natural language check + SQL check) plus a third OS-level read-only DB connection
Why regex is the right tool for guardrails (instant, free) vs. when an LLM guardrail adds value
How to block destructive intent in natural language vs. in generated SQL
Why "belt and suspenders" — never rely on a single safety check

Evaluations

The difference between execution accuracy (did SQL run?) and answer accuracy (was it right?)
How to design a golden dataset with difficulty tiers and query categories
Why evals must use the real LLM (mocking it only tests your plumbing, not your prompts)
How to detect regressions: compare current run to last run, flag drops > 3%
What self-correction rate tells you about base prompt quality

Observability

The difference between a Span (one step) and a Trace (a full run) — the core OpenTelemetry concepts
Why structured JSON logs beat print statements for production debugging
How to compute latency percentiles (p50/p90/p95/p99) and what each tells you
How to track token cost per run and project it to monthly spend
Why slow query detection matters and how to implement it

Production patterns

Cache key design: MD5(normalize(query) + schema_hash) — automatic invalidation on schema change
Why TTL should be checked at read time, not write time (config changes apply retroactively)
Confidence self-assessment: why a separate LLM call gives better signal than asking the same model in-conversation
Human-in-the-loop: when and how to escalate uncertain answers to a human reviewer
JSONL as an append-only log format: O(1) writes, human-readable, no schema migrations
Async Python: asyncio, aiosqlite, asyncio.gather for concurrent batch queries

Software engineering

Why dataclasses beat dicts for state (typos become AttributeError, not silent new keys)
Never-raise executor contracts: always return a result object, let callers decide what to do
Configuration as code: pydantic-settings with typed fields that fail loudly on missing keys
Mock strategy: mock the LLM in integration tests, use real LLM in evals — different purposes
When @st.cache_data and st.session_state matter in Streamlit

Architecture

User question (natural language)
        │
        ▼
┌──────────────────────────────────┐
│  Layer 1 Guardrail               │  regex on raw English — zero tokens, instant
│  check_query(user_query)         │  blocks: "delete all users", "drop the table"
└──────────────────────────────────┘
        │ allowed=True
        ▼
┌──────────────────────────────────┐
│  SchemaInspector                 │  reads live DB via PRAGMA
│  format_for_prompt()             │  formats as CREATE TABLE syntax for LLM
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────┐
│  QueryCache.get()                │  MD5(normalize(query) + schema_hash)
│                                  │  HIT → return cached answer (no LLM call)
└──────────────────────────────────┘
        │ MISS
        ▼
┌──────────────────────────────────────────────────────────┐
│                    ReAct Loop                            │
│                                                          │
│  LLM call (Responses API, tool_choice enabled)           │
│      ↓                                                   │
│  Model calls execute_sql(sql=...) or submit_answer(...)  │
│      ↓                                                   │
│  execute_sql path:                                       │
│    Layer 2 Guardrail: check_sql() — catches non-SELECT   │
│    SQLExecutor.run() — read-only OS-level connection     │
│    success → feed rows back to LLM (next turn)           │
│    error   → record error, feed error back (self-correct)│
│      ↓                                                   │
│  submit_answer path:                                     │
│    state.record_success() → loop stops                   │
│                                                          │
│  Retry budget: max 3 SQL errors before FAILED            │
│  Conversation chain: previous_response_id per turn       │
└──────────────────────────────────────────────────────────┘
        │ SUCCESS
        ▼
┌──────────────────────────────────┐
│  assess_confidence()             │  fresh LLM call (no conversation history)
│                                  │  returns score 1-10 + reason
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────┐
│  maybe_flag_for_review()         │  score < threshold → append to JSONL queue
│  QueryCache.set()                │  store answer for future identical queries
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────┐
│  Trace.finish() + Trace.save()   │  write structured JSON log to logs/
└──────────────────────────────────┘
        │
        ▼
    AgentState returned to caller
    (final_answer, confidence_score, flagged_for_review, total_tokens_used)

Project Structure

sql-agent/
│
├── agent/
│   ├── loop.py              # ReAct agent loop — the heart of the system
│   ├── loop_async.py        # async version + run_batch_async() for concurrent queries
│   ├── tools.py             # TOOL_DEFINITIONS (JSON schemas) + handle_tool_call()
│   ├── guardrails.py        # check_query() + check_sql() — two-layer safety
│   ├── state.py             # AgentState dataclass + AgentStatus enum
│   ├── confidence.py        # assess_confidence() — self-rating LLM call
│   ├── review_queue.py      # JSONL queue for low-confidence answers
│   └── cache.py             # SQLite-backed query result cache with TTL
│
├── llm/
│   └── client.py            # Azure AI Foundry wrapper — sync + async generate()
│
├── db/
│   ├── schema.py            # SchemaInspector — reads live DB, formats for LLM
│   ├── executor.py          # SQLExecutor — read-only, never raises, QueryResult contract
│   ├── executor_async.py    # AsyncSQLExecutor — aiosqlite version
│   └── seed.py              # generates realistic e-commerce SQLite DB (5 tables, ~500 rows)
│
├── evals/
│   ├── dataset/
│   │   └── golden.jsonl     # 40+ hand-written Q&A pairs with difficulty + category tags
│   ├── runner.py            # runs all eval cases against real LLM, saves timestamped report
│   ├── metrics.py           # execution accuracy, answer accuracy, F1, self-correction rate
│   └── reports/             # timestamped JSON reports of every eval run (git-ignored)
│
├── observability/
│   ├── tracer.py            # Span + Trace dataclasses, start_span/end_span, save to JSON
│   └── dashboard.py         # reads logs/, computes p50/p90/p99, success rate, cost
│
├── tests/
│   ├── unit/                # 238 unit tests — each module tested in isolation
│   │   ├── test_guardrails.py
│   │   ├── test_executor.py
│   │   ├── test_tools.py
│   │   ├── test_state.py
│   │   ├── test_cache.py        # 41 tests — cache key, TTL, hits, invalidation
│   │   ├── test_confidence.py   # 22 tests — JSON parsing, score clamping, defaults
│   │   ├── test_review_queue.py # 25 tests — JSONL append, load, update_status
│   │   └── test_metrics.py
│   └── integration/
│       └── test_agent_loop.py   # 45 tests — full loop with mocked LLM, real SQLite
│
├── data/
│   ├── ecommerce.db         # SQLite database (git-ignored, regenerated by seed.py)
│   └── cache.db             # query result cache (git-ignored)
│
├── logs/                    # one JSON file per agent run (git-ignored)
├── app.py                   # Streamlit web UI — Ask / Dashboard / Review Queue tabs
├── config.py                # pydantic-settings — all env vars typed + validated
├── pyproject.toml           # uv-managed dependencies
├── .env.example             # template for required env vars
└── .gitignore

The Database

A realistic e-commerce SQLite database with 5 tables and ~500 rows generated by db/seed.py:

Table	Rows	Description
`users`	~100	id, name, email, signup_date, active
`products`	~50	id, name, category, price, stock_qty
`orders`	~200	id, user_id, order_date, status, total_amount
`order_items`	~500	id, order_id, product_id, quantity, unit_price
`reviews`	~150	id, user_id, product_id, rating, review_date

The schema is automatically injected into the LLM's system prompt before every run via SchemaInspector.format_for_prompt().

Setup

Prerequisites

Python 3.12+
uv installed
Microsoft Azure AI Foundry project with a model deployed

Install

cd sql-agent
uv sync

Configure

cp .env.example .env
# Edit .env with your credentials

Required .env variables:

FOUNDRY_ENDPOINT=https://<resource>.services.ai.azure.com/api/projects/<project>
FOUNDRY_API_KEY=<your-key>
MODEL_NAME=gpt-5.2-chat
MAX_RETRIES=3
CONFIDENCE_THRESHOLD=6
DB_PATH=data/ecommerce.db

Seed the database

uv run python db/seed.py

How to Run

Web UI (Streamlit)

uv run streamlit run app.py

Opens at http://localhost:8501. Three tabs:

Ask — type a question, get an answer with SQL, confidence score, and full execution trace
Dashboard — aggregate stats across all logged runs (success rate, latency, cost)
Review Queue — approve or reject low-confidence answers flagged for human review

CLI (single query)

uv run python -c "from agent.loop import run_agent; s = run_agent('how many users?'); print(s.final_answer)"

Evals (against real LLM)

uv run python evals/runner.py              # all 40+ questions
uv run python evals/runner.py --limit 10  # first 10 only
uv run python evals/runner.py --difficulty hard

Saves a timestamped report to evals/reports/ and compares against the previous run to detect regressions.

Observability dashboard (CLI)

uv run python observability/dashboard.py           # all runs
uv run python observability/dashboard.py --last 20
uv run python observability/dashboard.py --status failed

Tests

uv run pytest tests/ -v

283 tests, all passing.

Key Design Decisions

Decision	Choice	Why
Agent framework	None — raw Python	Forces understanding of what frameworks abstract away
LLM conversation	Responses API + `previous_response_id`	Stateful conversation; model sees full history
Tool count	2 (`execute_sql`, `submit_answer`)	Complexity grows with tools; minimal set is debuggable
DB safety	3 independent layers	No single point of failure
State	Dataclass, not dict	Attribute typos raise `AttributeError` immediately
Errors	Return dataclass, never raise	Consistent contract; no scattered try/except
Logs	JSON files, not database	Portable, deletable, reprocessable, zero infra
Cache key	`MD5(normalized_query + schema_hash)`	Auto-invalidates on schema change
Confidence default	`None` (not a number)	Failed assessment ≠ low confidence
Tests: loop	LLM mocked, real SQLite	Tests loop logic; not whether GPT generates good SQL
Tests: evals	Real LLM, no mocks	Only real calls catch prompt regressions

Tech Stack

Layer	Tool
LLM	GPT-5.2 via Microsoft Azure AI Foundry
LLM API	OpenAI Responses API (`openai` Python SDK)
Database	SQLite (`sqlite3` stdlib + `aiosqlite` for async)
Agent framework	Raw Python — no LangChain, no openai-agents
Config	`pydantic-settings` v2
Web UI	Streamlit
Package manager	`uv`
Tests	`pytest`

Last updated: February 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Production SQL Agent — From Scratch, in Raw Python

What This Project Is

What You Learn From This Project

Architecture

Project Structure

The Database

Setup

Prerequisites

Install

Configure

Seed the database

How to Run

Web UI (Streamlit)

CLI (single query)

Evals (against real LLM)

Observability dashboard (CLI)

Tests

Key Design Decisions

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agent		agent
db		db
evals		evals
llm		llm
observability		observability
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.py		config.py
conftest.py		conftest.py
main.py		main.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Production SQL Agent — From Scratch, in Raw Python

What This Project Is

What You Learn From This Project

Architecture

Project Structure

The Database

Setup

Prerequisites

Install

Configure

Seed the database

How to Run

Web UI (Streamlit)

CLI (single query)

Evals (against real LLM)

Observability dashboard (CLI)

Tests

Key Design Decisions

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages