Skip to content

VenusDataAI/text-to-sql-rag

Repository files navigation

Text-to-SQL RAG

A production-ready Text-to-SQL system that uses Retrieval-Augmented Generation (RAG) over a dbt catalog to answer business questions with validated, executable SQL.

Architecture

flowchart LR
    Q([User Question]):::input --> R[Retriever\nVector Search]
    R --> CB[Context Builder\nTable + FK docs]
    CB --> PB[Prompt Builder\nSystem + user msg]
    PB --> CL[Claude\nSQL generation]
    CL --> V[SQL Validator\nsqlglot]
    V --> DB[DuckDB Executor\nParquet files]
    DB --> UI([Results + UI]):::output

    subgraph RAG Layer
        IDX[(ChromaDB\nVector Store)] --> R
    end

    subgraph Catalog
        CAT[catalog.json\nmanifest.json\nschema.yml] --> IDX
    end

    classDef input  fill:#161b22,stroke:#58a6ff,color:#c9d1d9
    classDef output fill:#161b22,stroke:#3fb950,color:#c9d1d9
Loading

Quickstart (4 commands)

# 1. Install
pip install -e ".[dev]"

# 2. Generate sample data
python sample_catalog/generate_parquet.py

# 3. Set your Anthropic API key
export ANTHROPIC_API_KEY=sk-ant-...

# 4. Run
uvicorn api.main:app --reload
# → open http://localhost:8000/ui

Project Structure

text-to-sql-rag/
├── api/                        FastAPI app + routes
│   ├── main.py                 App factory, lifespan (startup index)
│   └── routes/
│       ├── query.py            POST /api/v1/query, /query/explain
│       ├── catalog.py          GET/POST /api/v1/catalog/*
│       └── health.py           GET /health
├── rag/
│   ├── vector_store.py         ChromaDB + sentence-transformers
│   ├── indexer.py              Builds the vector index from CatalogIndex
│   ├── retriever.py            Semantic table retrieval
│   └── context_builder.py      Formats retrieved metadata for the prompt
├── sql_generator/
│   ├── generator.py            Orchestrates retrieve → prompt → Claude → parse
│   ├── prompt_builder.py       System prompt + user message construction
│   └── validator.py            sqlglot-based SQL validation
├── catalog/
│   ├── dbt_catalog_parser.py   Merges catalog.json + manifest.json + schema.yml
│   ├── schema_loader.py        YAML loader helper
│   └── relationship_extractor.py  FK graph between retrieved tables
├── integrations/
│   ├── anthropic_client.py     Claude API wrapper
│   └── duckdb_executor.py      Parquet → DuckDB views, SQL execution
├── models/
│   ├── catalog_models.py       ColumnMeta, TableMeta, CatalogIndex
│   └── query_models.py         Request/response Pydantic models
├── sample_catalog/
│   ├── catalog.json            dbt catalog (6 e-commerce tables)
│   ├── manifest.json           dbt manifest (descriptions, tags, owners)
│   ├── schema.yml              Column tests → PK/FK inference
│   ├── generate_parquet.py     Seed script (100-row realistic Parquet files)
│   └── data/                   *.parquet files (auto-generated)
├── ui/index.html               Self-contained dark-theme SPA
├── tests/                      pytest suite (85 %+ coverage target)
├── config.yaml                 All runtime settings
└── pyproject.toml

API Reference

Method Path Description
POST /api/v1/query Translate question → SQL → execute
POST /api/v1/query/explain Plain-English explanation of SQL
GET /api/v1/catalog/tables List all indexed tables
GET /api/v1/catalog/tables/{name} Full TableMeta for one table
POST /api/v1/catalog/rebuild Re-index from source catalog files
GET /health Liveness check

POST /api/v1/query

// Request
{ "question": "What was total revenue by country last month?", "execute": true, "max_results": 100 }

// Response
{
  "sql": "WITH paid AS (...) SELECT ...",
  "validation": { "is_valid": true, "errors": [], "normalized_sql": "..." },
  "result": { "columns": ["country_code","revenue"], "rows": [...], "row_count": 10, "truncated": false },
  "cannot_answer": false,
  "retrieved_tables": ["silver_orders", "gold_revenue_daily"],
  "metadata": { "tokens_used": 312, "latency_ms": 1840 }
}

Example Q&A Pairs

1. Revenue by country last month

Q: What was total revenue by country last month?

WITH paid_orders AS (
    SELECT o.country_code, o.total_amount
    FROM silver_orders AS o
    WHERE o.is_paid = TRUE
      AND DATE_TRUNC('month', o.created_at) = DATE_TRUNC('month', DATEADD(month, -1, CURRENT_DATE))
)
SELECT p.country_code, SUM(p.total_amount) AS total_revenue
FROM paid_orders AS p
GROUP BY p.country_code
ORDER BY total_revenue DESC

2. Top 10 products by quantity sold

Q: What are the top 10 products by total quantity sold?

SELECT p.name, p.category, SUM(oi.quantity) AS total_qty
FROM silver_order_items AS oi
JOIN silver_products AS p ON p.product_id = oi.product_id
GROUP BY p.product_id, p.name, p.category
ORDER BY total_qty DESC
LIMIT 10

3. Daily active users (mobile vs desktop)

Q: How many daily active users are there by device type this week?

SELECT DATE_TRUNC('day', s.started_at) AS day,
       s.device_type,
       COUNT(DISTINCT s.user_id) AS active_users
FROM silver_sessions AS s
WHERE s.started_at >= DATE_TRUNC('week', CURRENT_DATE)
GROUP BY 1, 2
ORDER BY 1, 2

4. Average order value by product category

Q: What is the average order value for each product category?

SELECT p.category,
       AVG(o.total_amount) AS avg_order_value,
       COUNT(DISTINCT o.order_id) AS order_count
FROM silver_orders AS o
JOIN silver_order_items AS oi ON oi.order_id = o.order_id
JOIN silver_products AS p ON p.product_id = oi.product_id
WHERE o.is_paid = TRUE
GROUP BY p.category
ORDER BY avg_order_value DESC

5. User cohort retention (new users vs active)

Q: How many users created in the last 30 days are still active?

SELECT COUNT(u.user_id) AS new_users,
       SUM(CASE WHEN u.is_active THEN 1 ELSE 0 END) AS active_new_users
FROM silver_users AS u
WHERE u.created_at >= DATEADD(day, -30, CURRENT_DATE)

Using Your Own dbt Catalog

  1. Export from dbt:

    dbt docs generate   # produces target/catalog.json + target/manifest.json
  2. Point config.yaml at your files:

    catalog:
      catalog_json: "path/to/target/catalog.json"
      manifest_json: "path/to/target/manifest.json"
      schema_ymls:
        - "path/to/models/schema.yml"
        - "path/to/models/staging/schema.yml"
  3. Rebuild the index via the API:

    curl -X POST http://localhost:8000/api/v1/catalog/rebuild

    Or restart the server (index is rebuilt on startup).

  4. (Optional) Drop your own Parquet/CSV files into sample_catalog/data/ and the DuckDB executor will register them automatically (filename stem = table name).

Configuration (config.yaml)

Key Default Description
anthropic.model claude-sonnet-4-6 Claude model for generation
anthropic.max_tokens 2048 Max output tokens
vector_store.persist_dir .chroma_db ChromaDB storage path
vector_store.embedding_model all-MiniLM-L6-v2 sentence-transformers model
retriever.top_k_tables 4 Tables retrieved per query
duckdb.data_dir sample_catalog/data Parquet files directory
duckdb.max_rows 500 Max rows returned per query

Known Limitations

Limitation Detail
Redshift ↔ DuckDB dialect gaps DATEADD and some Redshift-specific window functions may not execute correctly in DuckDB; results are best-effort for demo data
NVL translation NVL(x, y) is translated to COALESCE(x, y) at execution time, but GETDATE() / SYSDATE become CURRENT_TIMESTAMP
No multi-database joins The retriever returns tables across schemas but DuckDB views are flat; cross-database references will fail at execution
Embedding cold start First request downloads all-MiniLM-L6-v2 (~90 MB) if not cached
Context window Very wide tables (100+ columns) may push the prompt close to Claude's context limit; consider reducing top_k_tables
No streaming The /query endpoint waits for the full Claude response before returning

Running Tests

pytest -v --cov=. --cov-report=term-missing

Target: ≥ 85 % coverage across all modules.

About

Production-ready Text-to-SQL system with RAG over dbt catalog metadata: FastAPI, Claude, ChromaDB and DuckDB

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors