Text-to-SQL RAG

A production-ready Text-to-SQL system that uses Retrieval-Augmented Generation (RAG) over a dbt catalog to answer business questions with validated, executable SQL.

Architecture

flowchart LR
    Q([User Question]):::input --> R[Retriever\nVector Search]
    R --> CB[Context Builder\nTable + FK docs]
    CB --> PB[Prompt Builder\nSystem + user msg]
    PB --> CL[Claude\nSQL generation]
    CL --> V[SQL Validator\nsqlglot]
    V --> DB[DuckDB Executor\nParquet files]
    DB --> UI([Results + UI]):::output

    subgraph RAG Layer
        IDX[(ChromaDB\nVector Store)] --> R
    end

    subgraph Catalog
        CAT[catalog.json\nmanifest.json\nschema.yml] --> IDX
    end

    classDef input  fill:#161b22,stroke:#58a6ff,color:#c9d1d9
    classDef output fill:#161b22,stroke:#3fb950,color:#c9d1d9

Quickstart (4 commands)

# 1. Install
pip install -e ".[dev]"

# 2. Generate sample data
python sample_catalog/generate_parquet.py

# 3. Set your Anthropic API key
export ANTHROPIC_API_KEY=sk-ant-...

# 4. Run
uvicorn api.main:app --reload
# → open http://localhost:8000/ui

Project Structure

text-to-sql-rag/
├── api/                        FastAPI app + routes
│   ├── main.py                 App factory, lifespan (startup index)
│   └── routes/
│       ├── query.py            POST /api/v1/query, /query/explain
│       ├── catalog.py          GET/POST /api/v1/catalog/*
│       └── health.py           GET /health
├── rag/
│   ├── vector_store.py         ChromaDB + sentence-transformers
│   ├── indexer.py              Builds the vector index from CatalogIndex
│   ├── retriever.py            Semantic table retrieval
│   └── context_builder.py      Formats retrieved metadata for the prompt
├── sql_generator/
│   ├── generator.py            Orchestrates retrieve → prompt → Claude → parse
│   ├── prompt_builder.py       System prompt + user message construction
│   └── validator.py            sqlglot-based SQL validation
├── catalog/
│   ├── dbt_catalog_parser.py   Merges catalog.json + manifest.json + schema.yml
│   ├── schema_loader.py        YAML loader helper
│   └── relationship_extractor.py  FK graph between retrieved tables
├── integrations/
│   ├── anthropic_client.py     Claude API wrapper
│   └── duckdb_executor.py      Parquet → DuckDB views, SQL execution
├── models/
│   ├── catalog_models.py       ColumnMeta, TableMeta, CatalogIndex
│   └── query_models.py         Request/response Pydantic models
├── sample_catalog/
│   ├── catalog.json            dbt catalog (6 e-commerce tables)
│   ├── manifest.json           dbt manifest (descriptions, tags, owners)
│   ├── schema.yml              Column tests → PK/FK inference
│   ├── generate_parquet.py     Seed script (100-row realistic Parquet files)
│   └── data/                   *.parquet files (auto-generated)
├── ui/index.html               Self-contained dark-theme SPA
├── tests/                      pytest suite (85 %+ coverage target)
├── config.yaml                 All runtime settings
└── pyproject.toml

API Reference

Method	Path	Description
`POST`	`/api/v1/query`	Translate question → SQL → execute
`POST`	`/api/v1/query/explain`	Plain-English explanation of SQL
`GET`	`/api/v1/catalog/tables`	List all indexed tables
`GET`	`/api/v1/catalog/tables/{name}`	Full TableMeta for one table
`POST`	`/api/v1/catalog/rebuild`	Re-index from source catalog files
`GET`	`/health`	Liveness check

POST /api/v1/query

// Request
{ "question": "What was total revenue by country last month?", "execute": true, "max_results": 100 }

// Response
{
  "sql": "WITH paid AS (...) SELECT ...",
  "validation": { "is_valid": true, "errors": [], "normalized_sql": "..." },
  "result": { "columns": ["country_code","revenue"], "rows": [...], "row_count": 10, "truncated": false },
  "cannot_answer": false,
  "retrieved_tables": ["silver_orders", "gold_revenue_daily"],
  "metadata": { "tokens_used": 312, "latency_ms": 1840 }
}

Example Q&A Pairs

1. Revenue by country last month

Q: What was total revenue by country last month?

WITH paid_orders AS (
    SELECT o.country_code, o.total_amount
    FROM silver_orders AS o
    WHERE o.is_paid = TRUE
      AND DATE_TRUNC('month', o.created_at) = DATE_TRUNC('month', DATEADD(month, -1, CURRENT_DATE))
)
SELECT p.country_code, SUM(p.total_amount) AS total_revenue
FROM paid_orders AS p
GROUP BY p.country_code
ORDER BY total_revenue DESC

2. Top 10 products by quantity sold

Q: What are the top 10 products by total quantity sold?

SELECT p.name, p.category, SUM(oi.quantity) AS total_qty
FROM silver_order_items AS oi
JOIN silver_products AS p ON p.product_id = oi.product_id
GROUP BY p.product_id, p.name, p.category
ORDER BY total_qty DESC
LIMIT 10

3. Daily active users (mobile vs desktop)

Q: How many daily active users are there by device type this week?

SELECT DATE_TRUNC('day', s.started_at) AS day,
       s.device_type,
       COUNT(DISTINCT s.user_id) AS active_users
FROM silver_sessions AS s
WHERE s.started_at >= DATE_TRUNC('week', CURRENT_DATE)
GROUP BY 1, 2
ORDER BY 1, 2

4. Average order value by product category

Q: What is the average order value for each product category?

SELECT p.category,
       AVG(o.total_amount) AS avg_order_value,
       COUNT(DISTINCT o.order_id) AS order_count
FROM silver_orders AS o
JOIN silver_order_items AS oi ON oi.order_id = o.order_id
JOIN silver_products AS p ON p.product_id = oi.product_id
WHERE o.is_paid = TRUE
GROUP BY p.category
ORDER BY avg_order_value DESC

5. User cohort retention (new users vs active)

Q: How many users created in the last 30 days are still active?

SELECT COUNT(u.user_id) AS new_users,
       SUM(CASE WHEN u.is_active THEN 1 ELSE 0 END) AS active_new_users
FROM silver_users AS u
WHERE u.created_at >= DATEADD(day, -30, CURRENT_DATE)

Using Your Own dbt Catalog

Export from dbt:

dbt docs generate   # produces target/catalog.json + target/manifest.json

Point config.yaml at your files:

catalog:
  catalog_json: "path/to/target/catalog.json"
  manifest_json: "path/to/target/manifest.json"
  schema_ymls:
    - "path/to/models/schema.yml"
    - "path/to/models/staging/schema.yml"

Rebuild the index via the API:
```
curl -X POST http://localhost:8000/api/v1/catalog/rebuild
```
Or restart the server (index is rebuilt on startup).
(Optional) Drop your own Parquet/CSV files into sample_catalog/data/ and the DuckDB executor will register them automatically (filename stem = table name).

Configuration (`config.yaml`)

Key	Default	Description
`anthropic.model`	`claude-sonnet-4-6`	Claude model for generation
`anthropic.max_tokens`	`2048`	Max output tokens
`vector_store.persist_dir`	`.chroma_db`	ChromaDB storage path
`vector_store.embedding_model`	`all-MiniLM-L6-v2`	sentence-transformers model
`retriever.top_k_tables`	`4`	Tables retrieved per query
`duckdb.data_dir`	`sample_catalog/data`	Parquet files directory
`duckdb.max_rows`	`500`	Max rows returned per query

Known Limitations

Limitation	Detail
Redshift ↔ DuckDB dialect gaps	`DATEADD` and some Redshift-specific window functions may not execute correctly in DuckDB; results are best-effort for demo data
NVL translation	`NVL(x, y)` is translated to `COALESCE(x, y)` at execution time, but `GETDATE()` / `SYSDATE` become `CURRENT_TIMESTAMP`
No multi-database joins	The retriever returns tables across schemas but DuckDB views are flat; cross-database references will fail at execution
Embedding cold start	First request downloads `all-MiniLM-L6-v2` (~90 MB) if not cached
Context window	Very wide tables (100+ columns) may push the prompt close to Claude's context limit; consider reducing `top_k_tables`
No streaming	The `/query` endpoint waits for the full Claude response before returning

Running Tests

pytest -v --cov=. --cov-report=term-missing

Target: ≥ 85 % coverage across all modules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-SQL RAG

Architecture

Quickstart (4 commands)

Project Structure

API Reference

POST /api/v1/query

Example Q&A Pairs

1. Revenue by country last month

2. Top 10 products by quantity sold

3. Daily active users (mobile vs desktop)

4. Average order value by product category

5. User cohort retention (new users vs active)

Using Your Own dbt Catalog

Configuration (`config.yaml`)

Known Limitations

Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
catalog		catalog
integrations		integrations
models		models
rag		rag
sample_catalog		sample_catalog
sql_generator		sql_generator
tests		tests
ui		ui
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Text-to-SQL RAG

Architecture

Quickstart (4 commands)

Project Structure

API Reference

POST /api/v1/query

Example Q&A Pairs

1. Revenue by country last month

2. Top 10 products by quantity sold

3. Daily active users (mobile vs desktop)

4. Average order value by product category

5. User cohort retention (new users vs active)

Using Your Own dbt Catalog

Configuration (config.yaml)

Known Limitations

Running Tests

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`config.yaml`)

Packages