Skip to content

bharatr21/text2sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Text2SQL

Python 3.11+ FastAPI LangChain uv

A modern, production-ready Text-to-SQL system built with LangChain, LangGraph, and FastAPI. Designed for social science research with NORP (https://norpanel.org/) datasets including crime, homelessness, population, and economic data. Convert natural language questions into SQL queries with conversational context and session management.

✨ Features

  • πŸ€– LangGraph-powered SQL Agent - Intelligent workflow for reliable SQL generation
  • πŸš€ Async-First Architecture - Built for performance with async/await throughout
  • πŸ—ƒοΈ SQLite Integration - Lightweight, fast database operations with aiosqlite
  • πŸ“Š NORP Social Science Data - Pre-loaded with US shootings, NYC crime, homelessness, economic, and population datasets
  • πŸ’¬ Conversational Interface - Maintains context across questions with Redis sessions
  • πŸ”Œ Multi-LLM Support - Works with OpenAI, Anthropic, and Together AI
  • πŸ›‘οΈ Type Safety - Full type hints and Pydantic validation
  • πŸ“Š Health Monitoring - Built-in health checks and structured logging
  • πŸ§ͺ Comprehensive Testing - Unit and integration tests with pytest
  • πŸ“š REST API - Complete FastAPI application with automatic docs

πŸš€ Quick Start

Prerequisites

  • Python 3.11 or higher
  • uv package manager
  • Redis server (optional, for session management)

Installation

# Clone the repository
git clone https://github.com/bharatr21/text2sql.git
cd text2sql

# Install dependencies with uv
uv sync

# Activate the virtual environment
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate  # Windows

Configuration

  1. Copy the example environment file:
cp .env.example .env
  1. Edit .env with your settings:
# Database
DB_URL=sqlite:///data/sample.db

# LLM Configuration
LLM_PROVIDER=openai
LLM_MODEL=gpt-4
OPENAI_API_KEY=your-api-key-here

# Optional: Redis for sessions
REDIS_HOST=localhost
REDIS_PORT=6379

Create NORP Sample Database

# Create a sample SQLite database with NORP social science data
uv run text2sql create-db --db-path data/sample.db

# View available sample questions
uv run text2sql sample-questions

Start the Server

# Development mode with auto-reload
uv run text2sql serve --reload

# Production mode
uv run text2sql serve --host 0.0.0.0 --port 8000 --workers 4

The API will be available at http://localhost:8000 with interactive docs at http://localhost:8000/docs.

πŸ’» Usage

CLI Interface

# Send a query via CLI
uv run text2sql query --question "How many shooting incidents occurred in New York?"

# View sample questions
uv run text2sql sample-questions --limit 5

# Check system health
uv run text2sql health

# List active sessions
uv run text2sql sessions

# Get session details
uv run text2sql sessions --session-id your-session-id

API Examples

Query with curl

curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Which state had the highest number of victims killed in shootings?",
    "session_id": "user-123",
    "message_type": "human"
  }'

Query with Python

import httpx

async with httpx.AsyncClient() as client:
    response = await client.post(
        "http://localhost:8000/query",
        json={
            "question": "What is the number of individuals experiencing homelessness in California?",
            "session_id": "user-123",
            "message_type": "human"
        }
    )
    print(response.json())

Sample Questions

Try these questions with the NORP sample database:

Population & Demographics:

  • "What was the population of California in 2020?"
  • "Which state had the highest population in 2020?"
  • "What is the population of Los Angeles County?"

Crime & Public Safety:

  • "How many shooting incidents occurred in New York?"
  • "Which state had the highest number of victims killed in shootings?"
  • "Get all criminal records from NYC where the crime classification is 'Felony'"
  • "How many incidents of Assault occurred in Manhattan?"

Social Issues:

  • "What is the number of individuals experiencing homelessness in California?"
  • "Which counties have the highest poverty rates?"
  • "List the top 5 locations with the highest number of homeless individuals"

Correlational Analysis:

  • "Compare the population of New York and Texas over the last three census years"
  • "Get the number of shooting incidents per 1 million residents in each state"
  • "Which areas with high homelessness rates also have high crime rates?"

πŸ—οΈ Architecture

Project Structure

src/text2sql/
β”œβ”€β”€ core/           # App configuration, logging, FastAPI setup
β”œβ”€β”€ agents/         # LangGraph SQL agent
β”œβ”€β”€ services/       # Database, LLM, Redis, Session services
β”œβ”€β”€ models/         # Pydantic schemas
β”œβ”€β”€ utils/          # Database utilities and helpers
└── cli.py          # Command-line interface

tests/
β”œβ”€β”€ unit/           # Unit tests
β”œβ”€β”€ integration/    # Integration tests
└── conftest.py     # Test configuration

Key Components

  • SQLAgent: LangGraph state machine for SQL generation workflow
  • DatabaseService: Async SQLAlchemy 2.0 database operations
  • LLMService: Multi-provider LLM interface (OpenAI, Anthropic, Together)
  • SessionService: Redis-based conversation management
  • FastAPI App: Modern async REST API with dependency injection

πŸ§ͺ Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/text2sql --cov-report=html

# Run only unit tests
uv run pytest tests/unit/ -m unit

# Run only integration tests
uv run pytest tests/integration/ -m integration

πŸ”§ Development

Code Quality

# Linting
uv run ruff check src/ tests/

# Auto-fix issues
uv run ruff check --fix src/ tests/

# Type checking
uv run mypy src/

# Format code
uv run ruff format src/ tests/

Adding New LLM Providers

  1. Install the provider's LangChain integration
  2. Add provider configuration to LLMSettings
  3. Implement provider initialization in LLMService._create_*_llm()
  4. Update tests and documentation

NORP Dataset Information

The system includes comprehensive social science datasets from NORP:

Available Tables:

  • us_shootings - Gun violence incidents across the US with casualty data
  • nyc_crime_data - New York City crime incidents with classifications and locations
  • homelessness_demographics - Homelessness counts by state, year, and age group
  • economic_income_and_benefits - Household income data by zipcode and year
  • us_population - State population data from US Census
  • us_population_county - County-level population data
  • food_access - Food security and access metrics by census tract

Database Support

While optimized for SQLite, the system can work with other databases:

  1. Install appropriate async driver (e.g., asyncpg for PostgreSQL)
  2. Update DB_URL in configuration
  3. Modify DatabaseService.async_engine for database-specific settings

πŸ“š API Documentation

Once the server is running, visit:

  • Interactive API Docs: http://localhost:8000/docs
  • ReDoc Documentation: http://localhost:8000/redoc
  • OpenAPI Schema: http://localhost:8000/openapi.json

Main Endpoints

  • POST /query - Execute natural language query
  • GET /health - System health check
  • GET /sessions - List active sessions
  • GET /sessions/{id}/history - Get conversation history
  • GET /database/tables - List database tables
  • GET /database/schema - Get database schema

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests (uv run pytest)
  5. Run code quality checks (uv run ruff check src/)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • LangChain for the excellent LLM framework
  • LangGraph for agent workflow orchestration
  • FastAPI for the modern web framework
  • uv for fast Python package management

πŸ“§ Contact

Bharat Raghunathan - bharatraghunthan9767@gmail.com

Project Link: https://github.com/bharatr21/text2sql

About

Text-to-SQL system to interact with NORP and Metabase: https://norpanel.org/about/ and https://www.metabase.com/

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages