Skip to content

ashishreddy2411/sql-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Production SQL Agent — From Scratch, in Raw Python

A complete, working SQL Agent built without any agent framework — to understand exactly what every framework is hiding from you.


What This Project Is

A natural language → SQL → answer system that takes a plain English question, figures out what SQL to write, executes it against a real database, self-corrects on failures, assesses its own confidence, and flags uncertain answers for human review.

The twist: built entirely in raw Python. No LangChain. No LlamaIndex. No openai-agents SDK. Every component — the loop, the tool dispatcher, the retry logic, the tracer, the eval harness — is written from scratch. This is intentional: once you build it yourself, you understand what every framework is actually doing.

This project is complete through Phase 5, including a Streamlit web UI.


What You Learn From This Project

If you can explain every file in this repo, you understand:

Agentic AI systems

  • How the ReAct (Reason + Act) pattern works — not as an abstraction, but as actual Python code
  • How tool calling works with the OpenAI Responses API: JSON schemas, function dispatch, previous_response_id conversation chaining
  • How an agent maintains state across multiple LLM turns (AgentState dataclass)
  • How self-correction works: error → feed error to LLM → retry loop → error budget exhaustion
  • Why more tools = more complexity, and how to resist over-engineering

Safety and guardrails

  • Defense in depth: two independent guardrail layers (natural language check + SQL check) plus a third OS-level read-only DB connection
  • Why regex is the right tool for guardrails (instant, free) vs. when an LLM guardrail adds value
  • How to block destructive intent in natural language vs. in generated SQL
  • Why "belt and suspenders" — never rely on a single safety check

Evaluations

  • The difference between execution accuracy (did SQL run?) and answer accuracy (was it right?)
  • How to design a golden dataset with difficulty tiers and query categories
  • Why evals must use the real LLM (mocking it only tests your plumbing, not your prompts)
  • How to detect regressions: compare current run to last run, flag drops > 3%
  • What self-correction rate tells you about base prompt quality

Observability

  • The difference between a Span (one step) and a Trace (a full run) — the core OpenTelemetry concepts
  • Why structured JSON logs beat print statements for production debugging
  • How to compute latency percentiles (p50/p90/p95/p99) and what each tells you
  • How to track token cost per run and project it to monthly spend
  • Why slow query detection matters and how to implement it

Production patterns

  • Cache key design: MD5(normalize(query) + schema_hash) — automatic invalidation on schema change
  • Why TTL should be checked at read time, not write time (config changes apply retroactively)
  • Confidence self-assessment: why a separate LLM call gives better signal than asking the same model in-conversation
  • Human-in-the-loop: when and how to escalate uncertain answers to a human reviewer
  • JSONL as an append-only log format: O(1) writes, human-readable, no schema migrations
  • Async Python: asyncio, aiosqlite, asyncio.gather for concurrent batch queries

Software engineering

  • Why dataclasses beat dicts for state (typos become AttributeError, not silent new keys)
  • Never-raise executor contracts: always return a result object, let callers decide what to do
  • Configuration as code: pydantic-settings with typed fields that fail loudly on missing keys
  • Mock strategy: mock the LLM in integration tests, use real LLM in evals — different purposes
  • When @st.cache_data and st.session_state matter in Streamlit

Architecture

User question (natural language)
        │
        ▼
┌──────────────────────────────────┐
│  Layer 1 Guardrail               │  regex on raw English — zero tokens, instant
│  check_query(user_query)         │  blocks: "delete all users", "drop the table"
└──────────────────────────────────┘
        │ allowed=True
        ▼
┌──────────────────────────────────┐
│  SchemaInspector                 │  reads live DB via PRAGMA
│  format_for_prompt()             │  formats as CREATE TABLE syntax for LLM
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────┐
│  QueryCache.get()                │  MD5(normalize(query) + schema_hash)
│                                  │  HIT → return cached answer (no LLM call)
└──────────────────────────────────┘
        │ MISS
        ▼
┌──────────────────────────────────────────────────────────┐
│                    ReAct Loop                            │
│                                                          │
│  LLM call (Responses API, tool_choice enabled)           │
│      ↓                                                   │
│  Model calls execute_sql(sql=...) or submit_answer(...)  │
│      ↓                                                   │
│  execute_sql path:                                       │
│    Layer 2 Guardrail: check_sql() — catches non-SELECT   │
│    SQLExecutor.run() — read-only OS-level connection     │
│    success → feed rows back to LLM (next turn)           │
│    error   → record error, feed error back (self-correct)│
│      ↓                                                   │
│  submit_answer path:                                     │
│    state.record_success() → loop stops                   │
│                                                          │
│  Retry budget: max 3 SQL errors before FAILED            │
│  Conversation chain: previous_response_id per turn       │
└──────────────────────────────────────────────────────────┘
        │ SUCCESS
        ▼
┌──────────────────────────────────┐
│  assess_confidence()             │  fresh LLM call (no conversation history)
│                                  │  returns score 1-10 + reason
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────┐
│  maybe_flag_for_review()         │  score < threshold → append to JSONL queue
│  QueryCache.set()                │  store answer for future identical queries
└──────────────────────────────────┘
        │
        ▼
┌──────────────────────────────────┐
│  Trace.finish() + Trace.save()   │  write structured JSON log to logs/
└──────────────────────────────────┘
        │
        ▼
    AgentState returned to caller
    (final_answer, confidence_score, flagged_for_review, total_tokens_used)

Project Structure

sql-agent/
│
├── agent/
│   ├── loop.py              # ReAct agent loop — the heart of the system
│   ├── loop_async.py        # async version + run_batch_async() for concurrent queries
│   ├── tools.py             # TOOL_DEFINITIONS (JSON schemas) + handle_tool_call()
│   ├── guardrails.py        # check_query() + check_sql() — two-layer safety
│   ├── state.py             # AgentState dataclass + AgentStatus enum
│   ├── confidence.py        # assess_confidence() — self-rating LLM call
│   ├── review_queue.py      # JSONL queue for low-confidence answers
│   └── cache.py             # SQLite-backed query result cache with TTL
│
├── llm/
│   └── client.py            # Azure AI Foundry wrapper — sync + async generate()
│
├── db/
│   ├── schema.py            # SchemaInspector — reads live DB, formats for LLM
│   ├── executor.py          # SQLExecutor — read-only, never raises, QueryResult contract
│   ├── executor_async.py    # AsyncSQLExecutor — aiosqlite version
│   └── seed.py              # generates realistic e-commerce SQLite DB (5 tables, ~500 rows)
│
├── evals/
│   ├── dataset/
│   │   └── golden.jsonl     # 40+ hand-written Q&A pairs with difficulty + category tags
│   ├── runner.py            # runs all eval cases against real LLM, saves timestamped report
│   ├── metrics.py           # execution accuracy, answer accuracy, F1, self-correction rate
│   └── reports/             # timestamped JSON reports of every eval run (git-ignored)
│
├── observability/
│   ├── tracer.py            # Span + Trace dataclasses, start_span/end_span, save to JSON
│   └── dashboard.py         # reads logs/, computes p50/p90/p99, success rate, cost
│
├── tests/
│   ├── unit/                # 238 unit tests — each module tested in isolation
│   │   ├── test_guardrails.py
│   │   ├── test_executor.py
│   │   ├── test_tools.py
│   │   ├── test_state.py
│   │   ├── test_cache.py        # 41 tests — cache key, TTL, hits, invalidation
│   │   ├── test_confidence.py   # 22 tests — JSON parsing, score clamping, defaults
│   │   ├── test_review_queue.py # 25 tests — JSONL append, load, update_status
│   │   └── test_metrics.py
│   └── integration/
│       └── test_agent_loop.py   # 45 tests — full loop with mocked LLM, real SQLite
│
├── data/
│   ├── ecommerce.db         # SQLite database (git-ignored, regenerated by seed.py)
│   └── cache.db             # query result cache (git-ignored)
│
├── logs/                    # one JSON file per agent run (git-ignored)
├── app.py                   # Streamlit web UI — Ask / Dashboard / Review Queue tabs
├── config.py                # pydantic-settings — all env vars typed + validated
├── pyproject.toml           # uv-managed dependencies
├── .env.example             # template for required env vars
└── .gitignore

The Database

A realistic e-commerce SQLite database with 5 tables and ~500 rows generated by db/seed.py:

Table Rows Description
users ~100 id, name, email, signup_date, active
products ~50 id, name, category, price, stock_qty
orders ~200 id, user_id, order_date, status, total_amount
order_items ~500 id, order_id, product_id, quantity, unit_price
reviews ~150 id, user_id, product_id, rating, review_date

The schema is automatically injected into the LLM's system prompt before every run via SchemaInspector.format_for_prompt().


Setup

Prerequisites

  • Python 3.12+
  • uv installed
  • Microsoft Azure AI Foundry project with a model deployed

Install

cd sql-agent
uv sync

Configure

cp .env.example .env
# Edit .env with your credentials

Required .env variables:

FOUNDRY_ENDPOINT=https://<resource>.services.ai.azure.com/api/projects/<project>
FOUNDRY_API_KEY=<your-key>
MODEL_NAME=gpt-5.2-chat
MAX_RETRIES=3
CONFIDENCE_THRESHOLD=6
DB_PATH=data/ecommerce.db

Seed the database

uv run python db/seed.py

How to Run

Web UI (Streamlit)

uv run streamlit run app.py

Opens at http://localhost:8501. Three tabs:

  • Ask — type a question, get an answer with SQL, confidence score, and full execution trace
  • Dashboard — aggregate stats across all logged runs (success rate, latency, cost)
  • Review Queue — approve or reject low-confidence answers flagged for human review

CLI (single query)

uv run python -c "from agent.loop import run_agent; s = run_agent('how many users?'); print(s.final_answer)"

Evals (against real LLM)

uv run python evals/runner.py              # all 40+ questions
uv run python evals/runner.py --limit 10  # first 10 only
uv run python evals/runner.py --difficulty hard

Saves a timestamped report to evals/reports/ and compares against the previous run to detect regressions.

Observability dashboard (CLI)

uv run python observability/dashboard.py           # all runs
uv run python observability/dashboard.py --last 20
uv run python observability/dashboard.py --status failed

Tests

uv run pytest tests/ -v

283 tests, all passing.


Key Design Decisions

Decision Choice Why
Agent framework None — raw Python Forces understanding of what frameworks abstract away
LLM conversation Responses API + previous_response_id Stateful conversation; model sees full history
Tool count 2 (execute_sql, submit_answer) Complexity grows with tools; minimal set is debuggable
DB safety 3 independent layers No single point of failure
State Dataclass, not dict Attribute typos raise AttributeError immediately
Errors Return dataclass, never raise Consistent contract; no scattered try/except
Logs JSON files, not database Portable, deletable, reprocessable, zero infra
Cache key MD5(normalized_query + schema_hash) Auto-invalidates on schema change
Confidence default None (not a number) Failed assessment ≠ low confidence
Tests: loop LLM mocked, real SQLite Tests loop logic; not whether GPT generates good SQL
Tests: evals Real LLM, no mocks Only real calls catch prompt regressions

Tech Stack

Layer Tool
LLM GPT-5.2 via Microsoft Azure AI Foundry
LLM API OpenAI Responses API (openai Python SDK)
Database SQLite (sqlite3 stdlib + aiosqlite for async)
Agent framework Raw Python — no LangChain, no openai-agents
Config pydantic-settings v2
Web UI Streamlit
Package manager uv
Tests pytest

Last updated: February 2026

About

Production SQL Agent built from scratch in raw Python — ReAct loop, guardrails, evals, observability, caching, confidence scoring, Streamlit UI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages