A complete, working SQL Agent built without any agent framework — to understand exactly what every framework is hiding from you.
A natural language → SQL → answer system that takes a plain English question, figures out what SQL to write, executes it against a real database, self-corrects on failures, assesses its own confidence, and flags uncertain answers for human review.
The twist: built entirely in raw Python. No LangChain. No LlamaIndex. No openai-agents SDK. Every component — the loop, the tool dispatcher, the retry logic, the tracer, the eval harness — is written from scratch. This is intentional: once you build it yourself, you understand what every framework is actually doing.
This project is complete through Phase 5, including a Streamlit web UI.
If you can explain every file in this repo, you understand:
Agentic AI systems
- How the ReAct (Reason + Act) pattern works — not as an abstraction, but as actual Python code
- How tool calling works with the OpenAI Responses API: JSON schemas, function dispatch,
previous_response_idconversation chaining - How an agent maintains state across multiple LLM turns (AgentState dataclass)
- How self-correction works: error → feed error to LLM → retry loop → error budget exhaustion
- Why more tools = more complexity, and how to resist over-engineering
Safety and guardrails
- Defense in depth: two independent guardrail layers (natural language check + SQL check) plus a third OS-level read-only DB connection
- Why regex is the right tool for guardrails (instant, free) vs. when an LLM guardrail adds value
- How to block destructive intent in natural language vs. in generated SQL
- Why "belt and suspenders" — never rely on a single safety check
Evaluations
- The difference between execution accuracy (did SQL run?) and answer accuracy (was it right?)
- How to design a golden dataset with difficulty tiers and query categories
- Why evals must use the real LLM (mocking it only tests your plumbing, not your prompts)
- How to detect regressions: compare current run to last run, flag drops > 3%
- What self-correction rate tells you about base prompt quality
Observability
- The difference between a Span (one step) and a Trace (a full run) — the core OpenTelemetry concepts
- Why structured JSON logs beat print statements for production debugging
- How to compute latency percentiles (p50/p90/p95/p99) and what each tells you
- How to track token cost per run and project it to monthly spend
- Why slow query detection matters and how to implement it
Production patterns
- Cache key design:
MD5(normalize(query) + schema_hash)— automatic invalidation on schema change - Why TTL should be checked at read time, not write time (config changes apply retroactively)
- Confidence self-assessment: why a separate LLM call gives better signal than asking the same model in-conversation
- Human-in-the-loop: when and how to escalate uncertain answers to a human reviewer
- JSONL as an append-only log format: O(1) writes, human-readable, no schema migrations
- Async Python:
asyncio,aiosqlite,asyncio.gatherfor concurrent batch queries
Software engineering
- Why dataclasses beat dicts for state (typos become
AttributeError, not silent new keys) - Never-raise executor contracts: always return a result object, let callers decide what to do
- Configuration as code:
pydantic-settingswith typed fields that fail loudly on missing keys - Mock strategy: mock the LLM in integration tests, use real LLM in evals — different purposes
- When
@st.cache_dataandst.session_statematter in Streamlit
User question (natural language)
│
▼
┌──────────────────────────────────┐
│ Layer 1 Guardrail │ regex on raw English — zero tokens, instant
│ check_query(user_query) │ blocks: "delete all users", "drop the table"
└──────────────────────────────────┘
│ allowed=True
▼
┌──────────────────────────────────┐
│ SchemaInspector │ reads live DB via PRAGMA
│ format_for_prompt() │ formats as CREATE TABLE syntax for LLM
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ QueryCache.get() │ MD5(normalize(query) + schema_hash)
│ │ HIT → return cached answer (no LLM call)
└──────────────────────────────────┘
│ MISS
▼
┌──────────────────────────────────────────────────────────┐
│ ReAct Loop │
│ │
│ LLM call (Responses API, tool_choice enabled) │
│ ↓ │
│ Model calls execute_sql(sql=...) or submit_answer(...) │
│ ↓ │
│ execute_sql path: │
│ Layer 2 Guardrail: check_sql() — catches non-SELECT │
│ SQLExecutor.run() — read-only OS-level connection │
│ success → feed rows back to LLM (next turn) │
│ error → record error, feed error back (self-correct)│
│ ↓ │
│ submit_answer path: │
│ state.record_success() → loop stops │
│ │
│ Retry budget: max 3 SQL errors before FAILED │
│ Conversation chain: previous_response_id per turn │
└──────────────────────────────────────────────────────────┘
│ SUCCESS
▼
┌──────────────────────────────────┐
│ assess_confidence() │ fresh LLM call (no conversation history)
│ │ returns score 1-10 + reason
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ maybe_flag_for_review() │ score < threshold → append to JSONL queue
│ QueryCache.set() │ store answer for future identical queries
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Trace.finish() + Trace.save() │ write structured JSON log to logs/
└──────────────────────────────────┘
│
▼
AgentState returned to caller
(final_answer, confidence_score, flagged_for_review, total_tokens_used)
sql-agent/
│
├── agent/
│ ├── loop.py # ReAct agent loop — the heart of the system
│ ├── loop_async.py # async version + run_batch_async() for concurrent queries
│ ├── tools.py # TOOL_DEFINITIONS (JSON schemas) + handle_tool_call()
│ ├── guardrails.py # check_query() + check_sql() — two-layer safety
│ ├── state.py # AgentState dataclass + AgentStatus enum
│ ├── confidence.py # assess_confidence() — self-rating LLM call
│ ├── review_queue.py # JSONL queue for low-confidence answers
│ └── cache.py # SQLite-backed query result cache with TTL
│
├── llm/
│ └── client.py # Azure AI Foundry wrapper — sync + async generate()
│
├── db/
│ ├── schema.py # SchemaInspector — reads live DB, formats for LLM
│ ├── executor.py # SQLExecutor — read-only, never raises, QueryResult contract
│ ├── executor_async.py # AsyncSQLExecutor — aiosqlite version
│ └── seed.py # generates realistic e-commerce SQLite DB (5 tables, ~500 rows)
│
├── evals/
│ ├── dataset/
│ │ └── golden.jsonl # 40+ hand-written Q&A pairs with difficulty + category tags
│ ├── runner.py # runs all eval cases against real LLM, saves timestamped report
│ ├── metrics.py # execution accuracy, answer accuracy, F1, self-correction rate
│ └── reports/ # timestamped JSON reports of every eval run (git-ignored)
│
├── observability/
│ ├── tracer.py # Span + Trace dataclasses, start_span/end_span, save to JSON
│ └── dashboard.py # reads logs/, computes p50/p90/p99, success rate, cost
│
├── tests/
│ ├── unit/ # 238 unit tests — each module tested in isolation
│ │ ├── test_guardrails.py
│ │ ├── test_executor.py
│ │ ├── test_tools.py
│ │ ├── test_state.py
│ │ ├── test_cache.py # 41 tests — cache key, TTL, hits, invalidation
│ │ ├── test_confidence.py # 22 tests — JSON parsing, score clamping, defaults
│ │ ├── test_review_queue.py # 25 tests — JSONL append, load, update_status
│ │ └── test_metrics.py
│ └── integration/
│ └── test_agent_loop.py # 45 tests — full loop with mocked LLM, real SQLite
│
├── data/
│ ├── ecommerce.db # SQLite database (git-ignored, regenerated by seed.py)
│ └── cache.db # query result cache (git-ignored)
│
├── logs/ # one JSON file per agent run (git-ignored)
├── app.py # Streamlit web UI — Ask / Dashboard / Review Queue tabs
├── config.py # pydantic-settings — all env vars typed + validated
├── pyproject.toml # uv-managed dependencies
├── .env.example # template for required env vars
└── .gitignore
A realistic e-commerce SQLite database with 5 tables and ~500 rows generated by db/seed.py:
| Table | Rows | Description |
|---|---|---|
users |
~100 | id, name, email, signup_date, active |
products |
~50 | id, name, category, price, stock_qty |
orders |
~200 | id, user_id, order_date, status, total_amount |
order_items |
~500 | id, order_id, product_id, quantity, unit_price |
reviews |
~150 | id, user_id, product_id, rating, review_date |
The schema is automatically injected into the LLM's system prompt before every run via SchemaInspector.format_for_prompt().
- Python 3.12+
- uv installed
- Microsoft Azure AI Foundry project with a model deployed
cd sql-agent
uv synccp .env.example .env
# Edit .env with your credentialsRequired .env variables:
FOUNDRY_ENDPOINT=https://<resource>.services.ai.azure.com/api/projects/<project>
FOUNDRY_API_KEY=<your-key>
MODEL_NAME=gpt-5.2-chat
MAX_RETRIES=3
CONFIDENCE_THRESHOLD=6
DB_PATH=data/ecommerce.dbuv run python db/seed.pyuv run streamlit run app.pyOpens at http://localhost:8501. Three tabs:
- Ask — type a question, get an answer with SQL, confidence score, and full execution trace
- Dashboard — aggregate stats across all logged runs (success rate, latency, cost)
- Review Queue — approve or reject low-confidence answers flagged for human review
uv run python -c "from agent.loop import run_agent; s = run_agent('how many users?'); print(s.final_answer)"uv run python evals/runner.py # all 40+ questions
uv run python evals/runner.py --limit 10 # first 10 only
uv run python evals/runner.py --difficulty hardSaves a timestamped report to evals/reports/ and compares against the previous run to detect regressions.
uv run python observability/dashboard.py # all runs
uv run python observability/dashboard.py --last 20
uv run python observability/dashboard.py --status faileduv run pytest tests/ -v283 tests, all passing.
| Decision | Choice | Why |
|---|---|---|
| Agent framework | None — raw Python | Forces understanding of what frameworks abstract away |
| LLM conversation | Responses API + previous_response_id |
Stateful conversation; model sees full history |
| Tool count | 2 (execute_sql, submit_answer) |
Complexity grows with tools; minimal set is debuggable |
| DB safety | 3 independent layers | No single point of failure |
| State | Dataclass, not dict | Attribute typos raise AttributeError immediately |
| Errors | Return dataclass, never raise | Consistent contract; no scattered try/except |
| Logs | JSON files, not database | Portable, deletable, reprocessable, zero infra |
| Cache key | MD5(normalized_query + schema_hash) |
Auto-invalidates on schema change |
| Confidence default | None (not a number) |
Failed assessment ≠ low confidence |
| Tests: loop | LLM mocked, real SQLite | Tests loop logic; not whether GPT generates good SQL |
| Tests: evals | Real LLM, no mocks | Only real calls catch prompt regressions |
| Layer | Tool |
|---|---|
| LLM | GPT-5.2 via Microsoft Azure AI Foundry |
| LLM API | OpenAI Responses API (openai Python SDK) |
| Database | SQLite (sqlite3 stdlib + aiosqlite for async) |
| Agent framework | Raw Python — no LangChain, no openai-agents |
| Config | pydantic-settings v2 |
| Web UI | Streamlit |
| Package manager | uv |
| Tests | pytest |
Last updated: February 2026