A production-ready Text-to-SQL system that uses Retrieval-Augmented Generation (RAG) over a dbt catalog to answer business questions with validated, executable SQL.
flowchart LR
Q([User Question]):::input --> R[Retriever\nVector Search]
R --> CB[Context Builder\nTable + FK docs]
CB --> PB[Prompt Builder\nSystem + user msg]
PB --> CL[Claude\nSQL generation]
CL --> V[SQL Validator\nsqlglot]
V --> DB[DuckDB Executor\nParquet files]
DB --> UI([Results + UI]):::output
subgraph RAG Layer
IDX[(ChromaDB\nVector Store)] --> R
end
subgraph Catalog
CAT[catalog.json\nmanifest.json\nschema.yml] --> IDX
end
classDef input fill:#161b22,stroke:#58a6ff,color:#c9d1d9
classDef output fill:#161b22,stroke:#3fb950,color:#c9d1d9
# 1. Install
pip install -e ".[dev]"
# 2. Generate sample data
python sample_catalog/generate_parquet.py
# 3. Set your Anthropic API key
export ANTHROPIC_API_KEY=sk-ant-...
# 4. Run
uvicorn api.main:app --reload
# → open http://localhost:8000/uitext-to-sql-rag/
├── api/ FastAPI app + routes
│ ├── main.py App factory, lifespan (startup index)
│ └── routes/
│ ├── query.py POST /api/v1/query, /query/explain
│ ├── catalog.py GET/POST /api/v1/catalog/*
│ └── health.py GET /health
├── rag/
│ ├── vector_store.py ChromaDB + sentence-transformers
│ ├── indexer.py Builds the vector index from CatalogIndex
│ ├── retriever.py Semantic table retrieval
│ └── context_builder.py Formats retrieved metadata for the prompt
├── sql_generator/
│ ├── generator.py Orchestrates retrieve → prompt → Claude → parse
│ ├── prompt_builder.py System prompt + user message construction
│ └── validator.py sqlglot-based SQL validation
├── catalog/
│ ├── dbt_catalog_parser.py Merges catalog.json + manifest.json + schema.yml
│ ├── schema_loader.py YAML loader helper
│ └── relationship_extractor.py FK graph between retrieved tables
├── integrations/
│ ├── anthropic_client.py Claude API wrapper
│ └── duckdb_executor.py Parquet → DuckDB views, SQL execution
├── models/
│ ├── catalog_models.py ColumnMeta, TableMeta, CatalogIndex
│ └── query_models.py Request/response Pydantic models
├── sample_catalog/
│ ├── catalog.json dbt catalog (6 e-commerce tables)
│ ├── manifest.json dbt manifest (descriptions, tags, owners)
│ ├── schema.yml Column tests → PK/FK inference
│ ├── generate_parquet.py Seed script (100-row realistic Parquet files)
│ └── data/ *.parquet files (auto-generated)
├── ui/index.html Self-contained dark-theme SPA
├── tests/ pytest suite (85 %+ coverage target)
├── config.yaml All runtime settings
└── pyproject.toml
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/query |
Translate question → SQL → execute |
POST |
/api/v1/query/explain |
Plain-English explanation of SQL |
GET |
/api/v1/catalog/tables |
List all indexed tables |
GET |
/api/v1/catalog/tables/{name} |
Full TableMeta for one table |
POST |
/api/v1/catalog/rebuild |
Re-index from source catalog files |
GET |
/health |
Liveness check |
// Request
{ "question": "What was total revenue by country last month?", "execute": true, "max_results": 100 }
// Response
{
"sql": "WITH paid AS (...) SELECT ...",
"validation": { "is_valid": true, "errors": [], "normalized_sql": "..." },
"result": { "columns": ["country_code","revenue"], "rows": [...], "row_count": 10, "truncated": false },
"cannot_answer": false,
"retrieved_tables": ["silver_orders", "gold_revenue_daily"],
"metadata": { "tokens_used": 312, "latency_ms": 1840 }
}Q: What was total revenue by country last month?
WITH paid_orders AS (
SELECT o.country_code, o.total_amount
FROM silver_orders AS o
WHERE o.is_paid = TRUE
AND DATE_TRUNC('month', o.created_at) = DATE_TRUNC('month', DATEADD(month, -1, CURRENT_DATE))
)
SELECT p.country_code, SUM(p.total_amount) AS total_revenue
FROM paid_orders AS p
GROUP BY p.country_code
ORDER BY total_revenue DESCQ: What are the top 10 products by total quantity sold?
SELECT p.name, p.category, SUM(oi.quantity) AS total_qty
FROM silver_order_items AS oi
JOIN silver_products AS p ON p.product_id = oi.product_id
GROUP BY p.product_id, p.name, p.category
ORDER BY total_qty DESC
LIMIT 10Q: How many daily active users are there by device type this week?
SELECT DATE_TRUNC('day', s.started_at) AS day,
s.device_type,
COUNT(DISTINCT s.user_id) AS active_users
FROM silver_sessions AS s
WHERE s.started_at >= DATE_TRUNC('week', CURRENT_DATE)
GROUP BY 1, 2
ORDER BY 1, 2Q: What is the average order value for each product category?
SELECT p.category,
AVG(o.total_amount) AS avg_order_value,
COUNT(DISTINCT o.order_id) AS order_count
FROM silver_orders AS o
JOIN silver_order_items AS oi ON oi.order_id = o.order_id
JOIN silver_products AS p ON p.product_id = oi.product_id
WHERE o.is_paid = TRUE
GROUP BY p.category
ORDER BY avg_order_value DESCQ: How many users created in the last 30 days are still active?
SELECT COUNT(u.user_id) AS new_users,
SUM(CASE WHEN u.is_active THEN 1 ELSE 0 END) AS active_new_users
FROM silver_users AS u
WHERE u.created_at >= DATEADD(day, -30, CURRENT_DATE)-
Export from dbt:
dbt docs generate # produces target/catalog.json + target/manifest.json -
Point
config.yamlat your files:catalog: catalog_json: "path/to/target/catalog.json" manifest_json: "path/to/target/manifest.json" schema_ymls: - "path/to/models/schema.yml" - "path/to/models/staging/schema.yml"
-
Rebuild the index via the API:
curl -X POST http://localhost:8000/api/v1/catalog/rebuild
Or restart the server (index is rebuilt on startup).
-
(Optional) Drop your own Parquet/CSV files into
sample_catalog/data/and the DuckDB executor will register them automatically (filename stem = table name).
| Key | Default | Description |
|---|---|---|
anthropic.model |
claude-sonnet-4-6 |
Claude model for generation |
anthropic.max_tokens |
2048 |
Max output tokens |
vector_store.persist_dir |
.chroma_db |
ChromaDB storage path |
vector_store.embedding_model |
all-MiniLM-L6-v2 |
sentence-transformers model |
retriever.top_k_tables |
4 |
Tables retrieved per query |
duckdb.data_dir |
sample_catalog/data |
Parquet files directory |
duckdb.max_rows |
500 |
Max rows returned per query |
| Limitation | Detail |
|---|---|
| Redshift ↔ DuckDB dialect gaps | DATEADD and some Redshift-specific window functions may not execute correctly in DuckDB; results are best-effort for demo data |
| NVL translation | NVL(x, y) is translated to COALESCE(x, y) at execution time, but GETDATE() / SYSDATE become CURRENT_TIMESTAMP |
| No multi-database joins | The retriever returns tables across schemas but DuckDB views are flat; cross-database references will fail at execution |
| Embedding cold start | First request downloads all-MiniLM-L6-v2 (~90 MB) if not cached |
| Context window | Very wide tables (100+ columns) may push the prompt close to Claude's context limit; consider reducing top_k_tables |
| No streaming | The /query endpoint waits for the full Claude response before returning |
pytest -v --cov=. --cov-report=term-missingTarget: ≥ 85 % coverage across all modules.