This repository contains a natural language agent that connects to a SQL database, learns schema + domain context, and answers questions using a ReAct SQL loop.
It can:
- Connect to any SQL database supported by SQLAlchemy
- Introspect table/column schema automatically
- Bootstrap a domain YAML from natural language description
- Route tables/metrics and reason iteratively with grounded SQL evidence
- Enforce read-only SQL validation before query execution
| File | Role |
|---|---|
| src/nl2sql/agent.py | CLI entry point — interactive + single-question modes |
| src/nl2sql/db.py | DB connection, schema introspection, safe query execution |
| src/nl2sql/router.py | Two-stage routing — tables then metrics |
| src/nl2sql/react_engine.py | ReAct loop — hypothesis → SQL → observe → repeat |
| src/nl2sql/domain_bootstrap.py | Auto-generate domain YAML for unknown domains |
| domains/trading.yaml | Domain config for your trading DB |
- Install dependencies
uv pip install --python .venv/bin/python -r requirements.txt- Configure environment (
.env)
# LLM
GEMINI_API_KEY=your_key
# Preferred: one URL works for any SQLAlchemy-supported database
DATABASE_URL=postgresql+psycopg2://user:password@localhost:5432/dbname
# Backward-compatible PostgreSQL vars (used if DATABASE_URL is missing)
DATABASE_HOST=localhost
DATABASE_PORT=5432
DATABASE_NAME=dbName
DATABASE_USER=dbUser
DATABASE_PASSWORD=dbPasswordExamples for DATABASE_URL:
- PostgreSQL:
postgresql+psycopg2://dbUser:dbPassword@host:5432/dbName - MySQL:
mysql+pymysql://dbUser:dbPassword@host:3306/dbName - SQLite:
sqlite:///local.db - SQL Server:
mssql+pyodbc://dbUser:dbPassword@host/dbName?driver=ODBC+Driver+18+for+SQL+Server
PYTHONPATH=src python -m nl2sql.agentPYTHONPATH=src python -m nl2sql.agent -q "Why is my portfolio down?" --tracePYTHONPATH=src python -m nl2sql.agent --bootstrap "This is a hospital patient records database" --save-domain domains/generated.yamlPYTHONPATH=src .venv/bin/python -m uvicorn nl2sql.api:app --host 0.0.0.0 --port 8000 --reloadHealth check:
curl http://127.0.0.1:8000/healthzQuery endpoint:
curl -X POST http://127.0.0.1:8000/v1/query \
-H "Content-Type: application/json" \
-d '{
"question": "Why is my portfolio down?",
"domain_path": "domains/trading.yaml",
"include_trace": true
}'Bootstrap endpoint:
curl -X POST http://127.0.0.1:8000/v1/bootstrap \
-H "Content-Type: application/json" \
-d '{
"description": "This is a stock portfolio monitoring database",
"save_path": "domains/generated.yaml"
}'Flow:
- User asks a natural language question.
- Router chooses relevant tables and domain metrics.
- ReAct engine generates one SQL step at a time.
- SQL safety layer validates query as read-only.
- Database executes query and returns evidence.
- ReAct engine synthesizes grounded answer.
User Question
|
v
router.py (table + metric routing)
|
v
react_engine.py (ReAct loop)
|
v
db.py SQL guard (single-statement read-only)
|
v
Database via SQLAlchemy
|
v
Grounded Answer + Trace
- Set
DATABASE_URLand verify connectivity with interactive mode. - Bootstrap domain config and save it:
PYTHONPATH=src python -m nl2sql.agent --bootstrap "<your business domain>" --save-domain domains/generated.yaml - Start asking questions with trace mode enabled for verification.
- Promote/refine generated domain YAML into a curated domain file for your use case.
One-command live demo:
bash scripts/demo_recording.shDefault domain description can come from .env:
DATABASE_DOMAIN_DESC=Stock Portfolio MonitoringOptional domain override:
bash scripts/demo_recording.sh "This is a healthcare operations database"Precedence for demo domain description:
- First CLI argument to
scripts/demo_recording.sh DATABASE_DOMAIN_DESCfrom environment or.env- Built-in fallback (
This is a logistics operations database)
Optional interactive handoff at the end:
DEMO_INTERACTIVE=true bash scripts/demo_recording.sh- Bootstrap a domain for an unfamiliar database:
PYTHONPATH=src python -m nl2sql.agent --bootstrap "This is a logistics operations database" --save-domain domains/generated.yaml - Ask a root-cause question with trace:
PYTHONPATH=src python -m nl2sql.agent -q "Why did fulfillment delays spike this week?" --trace --domain domains/generated.yaml - Run interactive mode:
PYTHONPATH=src python -m nl2sql.agent --domain domains/generated.yaml - Ask broad + specific prompts:
What changed in the last 7 days?Which segment contributes most to the decline?What is the strongest hypothesis and supporting evidence?
Install dev dependencies:
uv pip install --python .venv/bin/python -e .[dev]Run tests:
.venv/bin/python -m pytest -q- Only single-statement read-only SQL is executed.
- DDL/DML statements (INSERT/UPDATE/DELETE/DROP/ALTER...) are blocked.
- Result rows are capped per query for stable CLI outputs.
A grounded NL-to-SQL agent that inspects your database schema, builds a lightweight domain model, and answers analytical questions using iterative, read-only SQL.
This project is built for cases where users want to ask business questions in natural language — including drill-down and root-cause style questions such as:
- Why is my portfolio down?
- What changed in the last 7 days?
- Which segment contributes most to the decline?
- What evidence supports that conclusion?
Rather than relying on one-shot text-to-SQL generation, this agent uses a routing + reasoning loop:
- understand the question
- identify likely tables and metrics
- generate one SQL step at a time
- validate the query as read-only
- inspect results
- continue until it has enough evidence to answer
This repository is a reasoning-based analytics agent for SQL databases. It combines:
- schema introspection to understand tables and columns
- domain bootstrapping to generate an initial semantic/domain YAML from a business description
- table and metric routing to narrow the search space
- iterative SQL reasoning to investigate questions step by step
- read-only SQL enforcement before execution
It is designed to help bridge the gap between raw database structure and business-facing analytical questions.
This project does not claim to create a full enterprise ontology or a complete knowledge graph automatically.
What it generates is better described as a lightweight semantic domain model — useful for grounding NL-to-SQL and reasoning, but still meant to be reviewed and refined for production use.
For serious enterprise usage, you should expect to add:
- curated metric definitions
- business-approved relationships and entity mappings
- role-aware access controls
- validation and observability around generated query plans
- governance over domain YAML evolution
- Connect to any SQL database supported by SQLAlchemy
- Introspect table and column schema automatically
- Bootstrap a domain YAML from a natural language description
- Route relevant tables and metrics for a given question
- Run a ReAct-style SQL loop with grounded evidence
- Enforce read-only SQL validation before execution
- Return a final answer with optional reasoning trace
| File | Responsibility |
|---|---|
src/nl2sql/agent.py |
CLI entry point for interactive and single-question execution |
src/nl2sql/db.py |
Database connection, schema introspection, and safe query execution |
src/nl2sql/router.py |
Two-stage routing for tables and metrics |
src/nl2sql/react_engine.py |
Iterative reasoning loop: hypothesis → SQL → observe → repeat |
src/nl2sql/domain_bootstrap.py |
Bootstraps a domain YAML for unfamiliar databases |
src/nl2sql/api.py |
FastAPI server exposing health, query, and bootstrap endpoints |
domains/trading.yaml |
Example domain config for a trading/portfolio database |
The agent reads schema metadata from the target SQL database using SQLAlchemy.
For known use cases, it can use a curated domain YAML. For unknown databases, it can bootstrap an initial domain YAML from a business description and schema structure.
The router narrows the search space by selecting likely tables and relevant metrics before SQL generation starts.
Instead of generating one large SQL query immediately, the agent works step by step:
- form a hypothesis
- run a read-only SQL query
- inspect returned evidence
- refine the next step
- synthesize an answer when confidence is sufficient
Every SQL statement is validated before execution. Only single-statement read-only SQL is allowed.
User Question
|
v
router.py (table + metric routing)
|
v
react_engine.py (iterative reasoning loop)
|
v
db.py SQL guard (single-statement read-only validation)
|
v
Database via SQLAlchemy
|
v
Grounded Answer + Optional Trace
uv pip install --python .venv/bin/python -r requirements.txtCreate a .env file:
# LLM
GEMINI_API_KEY=your_key
# Preferred: one URL works for any SQLAlchemy-supported database
DATABASE_URL=postgresql+psycopg2://user:password@localhost:5432/dbname
# Backward-compatible PostgreSQL vars (used if DATABASE_URL is missing)
DATABASE_HOST=localhost
DATABASE_PORT=5432
DATABASE_NAME=dbName
DATABASE_USER=dbUser
DATABASE_PASSWORD=dbPasswordExamples for DATABASE_URL:
- PostgreSQL:
postgresql+psycopg2://dbUser:dbPassword@host:5432/dbName - MySQL:
mysql+pymysql://dbUser:dbPassword@host:3306/dbName - SQLite:
sqlite:///local.db - SQL Server:
mssql+pyodbc://dbUser:dbPassword@host/dbName?driver=ODBC+Driver+18+for+SQL+Server
PYTHONPATH=src python -m nl2sql.agentPYTHONPATH=src python -m nl2sql.agent -q "Why is my portfolio down?" --tracePYTHONPATH=src python -m nl2sql.agent --bootstrap "This is a hospital patient records database" --save-domain domains/generated.yamlPYTHONPATH=src .venv/bin/python -m uvicorn nl2sql.api:app --host 0.0.0.0 --port 8000 --reloadcurl http://127.0.0.1:8000/healthzcurl -X POST http://127.0.0.1:8000/v1/query \
-H "Content-Type: application/json" \
-d '{
"question": "Why is my portfolio down?",
"domain_path": "domains/trading.yaml",
"include_trace": true
}'curl -X POST http://127.0.0.1:8000/v1/bootstrap \
-H "Content-Type: application/json" \
-d '{
"description": "This is a stock portfolio monitoring database",
"save_path": "domains/generated.yaml"
}'- Set
DATABASE_URLand verify connectivity in interactive mode. - Bootstrap an initial domain YAML:
PYTHONPATH=src python -m nl2sql.agent --bootstrap "<your business domain>" --save-domain domains/generated.yaml - Ask questions with
--traceenabled to inspect reasoning and evidence. - Refine the generated YAML into a curated domain file for your real use case.
- Reuse the curated domain file for more consistent routing and answers.
- Bootstrap a domain for an unfamiliar database:
PYTHONPATH=src python -m nl2sql.agent --bootstrap "This is a logistics operations database" --save-domain domains/generated.yaml - Ask a root-cause question:
PYTHONPATH=src python -m nl2sql.agent -q "Why did fulfillment delays spike this week?" --trace --domain domains/generated.yaml - Continue in interactive mode:
PYTHONPATH=src python -m nl2sql.agent --domain domains/generated.yaml
- Try follow-up prompts such as:
What changed in the last 7 days?Which segment contributes most to the decline?What is the strongest hypothesis and supporting evidence?
One-command live demo:
bash scripts/demo_recording.shDefault domain description can come from .env:
DATABASE_DOMAIN_DESC=Stock Portfolio MonitoringOptional override:
bash scripts/demo_recording.sh "This is a healthcare operations database"Precedence for the demo domain description:
- First CLI argument to
scripts/demo_recording.sh DATABASE_DOMAIN_DESCfrom environment or.env- Built-in fallback:
This is a logistics operations database
Optional interactive handoff at the end:
DEMO_INTERACTIVE=true bash scripts/demo_recording.shInstall dev dependencies:
uv pip install --python .venv/bin/python -e .[dev]Run tests:
.venv/bin/python -m pytest -q- Only single-statement read-only SQL is executed
- DDL/DML statements such as
INSERT,UPDATE,DELETE,DROP, andALTERare blocked - Result rows are capped per query for stable outputs and safer inspection
Most NL-to-SQL demos stop at query generation. Real analytical questioning needs more than that.
Users do not just ask:
- “Show me table X”
They ask:
- “Why did this metric move?”
- “What changed recently?”
- “Which factor contributed most?”
- “What evidence supports this answer?”
That requires iterative investigation, not just SQL autocomplete.
This repo is an attempt to move from text-to-query toward grounded analytical reasoning over structured data.
A practical way to describe this repo:
A guarded NL-to-SQL reasoning agent that learns schema and domain context, then answers analytical questions using iterative, evidence-backed SQL.
A blunter version:
Not magic database intelligence — a controlled analytics agent that turns business questions into grounded SQL investigation steps.