Skip to content

deepdhani/sql_ai_engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 SQL AI Engine

Advanced SQL Query Analyzer, Optimizer & Natural Language-to-SQL Engine

Compiler Design Group Project – B.Tech CSE


📁 Project Structure

sql_ai_engine/
│
├── backend/
│   ├── api/
│   │   └── main.py              ← FastAPI REST API (all endpoints)
│   ├── lexer/
│   │   └── tokenizer.py         ← Lexical Analyser (tokenizer)
│   ├── parser/
│   │   └── sql_parser.py        ← Recursive-Descent Parser → AST
│   ├── semantic/
│   │   └── semantic_analyzer.py ← Semantic Validator
│   ├── optimizer/
│   │   └── query_optimizer.py   ← Rule-Based + Cost-Based Optimizer
│   ├── execution/
│   │   └── executor.py          ← SQLite Execution Engine
│   └── nlp/
│       ├── nl_to_sql.py         ← Natural Language → SQL
│       └── explainer.py         ← SQL → Human Explanation
│
├── frontend/
│   └── app.py                   ← Streamlit UI
│
├── database/
│   ├── schema.sql               ← Table definitions + seed data
│   └── init_db.py               ← DB initializer + schema inspector
│
├── utils/
│   └── helpers.py               ← Shared utilities
│
├── tests/
│   └── test_all.py              ← Full test suite (39 tests)
│
├── requirements.txt
└── README.md                    ← This file

⚙️ Tech Stack

Layer Technology
Backend API FastAPI + Uvicorn
Database SQLite
Parser Hand-written Recursive Descent (Python)
NLP Rule-based + OpenAI API (optional)
Frontend Streamlit

🚀 Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Initialise the database

python database/init_db.py

3. Start the FastAPI backend

uvicorn backend.api.main:app --reload --port 8000

4. Start the Streamlit frontend (new terminal)

streamlit run frontend/app.py

5. Run tests

python tests/test_all.py

Open http://localhost:8501 in your browser.


🔌 API Endpoints

Method Endpoint Description
GET /health Health check
GET /schema Live database schema
GET /tables/{name} Preview table rows
POST /analyze Full pipeline (master endpoint)
POST /tokenize Lexical analysis only
POST /parse Parser → AST
POST /semantic Semantic validation
POST /optimize Query optimizer
POST /execute Execute SQL
POST /explain Human-readable explanation
POST /nl-to-sql Natural Language → SQL

🧩 Module Explanations

1. Lexer (tokenizer.py)

Converts raw SQL into a stream of typed tokens using regex.

Token types: KEYWORD, IDENTIFIER, OPERATOR, LITERAL_NUM, LITERAL_STR, PUNCTUATION, WILDCARD

Example:

Input:  SELECT name FROM students WHERE marks > 80
Tokens: [KEYWORD:SELECT] [IDENTIFIER:name] [KEYWORD:FROM]
        [IDENTIFIER:students] [KEYWORD:WHERE] [IDENTIFIER:marks]
        [OPERATOR:>] [LITERAL_NUM:80]

2. Parser (sql_parser.py)

Implements a Recursive Descent Parser using the token stream. Builds an Abstract Syntax Tree (AST) as a nested Python dict.

Supports: SELECT, INSERT, UPDATE, DELETE, CREATE TABLE Features: JOINs, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, subexpressions, functions

Example AST (simplified):

{
  "node": "SELECT",
  "columns": [{"node": "COLUMN", "expr": {"node": "IDENTIFIER", "name": "name"}}],
  "from_clause": {"node": "TABLE_REF", "name": "students"},
  "where": {"node": "COMPARE", "op": ">", "left": {"node": "IDENTIFIER", "name": "marks"}, "right": {"node": "LITERAL", "value": "80"}}
}

3. Semantic Analyser (semantic_analyzer.py)

Walks the AST and validates against the live database schema:

  • Table existence check
  • Column existence check (per table, with alias resolution)
  • Warns about SELECT * usage
  • Suggests corrections for errors

4. Query Optimizer (query_optimizer.py)

Rule-Based Optimization (RBO):

  • Rule 1: Expand SELECT * → explicit columns
  • Rule 2: Remove tautologies (WHERE 1=1, WHERE TRUE)
  • Rule 3: Confirm predicate pushdown (before JOIN)
  • Rule 4: Warn if no LIMIT on potentially large scans
  • Rule 5: Flag NOT IN → suggest NOT EXISTS
  • Rule 6: Rewrite col=X OR col=Ycol IN (X,Y)

Cost-Based Optimization (CBO):

  • Estimates full-scan cost = row_count × 1.0
  • Estimates index-scan cost = 5 + row_count × 0.1
  • Recommends cheaper access path

Index Recommender:

  • Detects unindexed WHERE / JOIN / ORDER BY columns
  • Generates ready-to-run CREATE INDEX DDL

5. Execution Engine (executor.py)

  • Executes SQL on SQLite safely
  • Measures wall-clock time (ms)
  • Runs EXPLAIN QUERY PLAN for each query
  • Compares original vs optimized execution side-by-side

6. NL → SQL (nl_to_sql.py)

Two-tier approach:

  1. OpenAI GPT (if OPENAI_API_KEY is set) – high accuracy
  2. Rule-based fallback – covers 80% of common academic queries

Detects: table name, column names, WHERE conditions, aggregate functions, ORDER BY, LIMIT.


7. Explainer (explainer.py)

Converts SQL → plain English at two levels:

  • Simple: One-sentence summary
  • Detailed: Clause-by-clause walkthrough with emojis

🎤 Viva Questions & Answers

Compiler Design Theory

Q1. What is a compiler? How does this project relate to compiler design? A: A compiler translates source code (high-level) into target code through phases: Lexical Analysis → Syntax Analysis → Semantic Analysis → Optimization → Code Generation. This project implements exactly those phases for SQL: the tokenizer is the lexer, the recursive-descent parser builds an AST (syntax analysis), the semantic analyzer validates meaning, and the optimizer transforms the query. Execution is analogous to code generation/interpretation.

Q2. What is a token? What are token types in your lexer? A: A token is the smallest meaningful unit of a language. Our tokenizer recognises: KEYWORD (SELECT, WHERE…), IDENTIFIER (table/column names), OPERATOR (=, >, <…), LITERAL_NUM (numbers), LITERAL_STR (quoted strings), PUNCTUATION, and WILDCARD (*).

Q3. What is an Abstract Syntax Tree (AST)? A: An AST is a tree representation of the syntactic structure of source code. Unlike a parse tree, it omits unnecessary grammar details. Each node represents a language construct (SELECT node, COMPARE node, LITERAL node, etc.). Our parser produces a JSON-serializable dict tree.

Q4. What parsing technique did you use and why? A: Recursive Descent Parsing. Each grammar rule maps to a Python method (_parse_select, _parse_condition, etc.). It's easy to understand, implement, and debug. It naturally handles operator precedence through method call hierarchy.

Q5. What is semantic analysis? What does your semantic analyzer check? A: Semantic analysis checks meaning, not just syntax. Syntactically SELECT xyz FROM abc is valid but semantically wrong if table abc or column xyz don't exist. Our analyzer checks: table existence, column existence per table, alias resolution, and warns about anti-patterns like SELECT *.

Q6. What is predicate pushdown? A: Moving WHERE filter conditions as close to the data source as possible, ideally before a JOIN. This reduces the number of rows that flow through the JOIN, improving performance. SQLite does this automatically; our optimizer confirms it and documents it.

Q7. What is the difference between Rule-Based and Cost-Based optimization? A: Rule-Based Optimization (RBO) applies fixed transformation rules regardless of data (e.g., always expand SELECT *). Cost-Based Optimization (CBO) uses statistics about the data (row counts, index availability) to estimate the cost of different execution plans and picks the cheapest. Our project implements both.

Q8. What is an index? Why do indexes speed up queries? A: An index is a separate data structure (usually a B-tree) that maps column values to row locations. Without an index, a query must scan every row (O(n)). With an index, it can jump directly to matching rows (O(log n)). Our optimizer detects unindexed filter/join columns and generates CREATE INDEX statements.

Q9. What does EXPLAIN QUERY PLAN show? A: It shows how SQLite intends to execute the query: which scan type (full table scan vs index scan), which index (if any) will be used, and how JOINs will be ordered. We surface this in the UI for both original and optimized queries.

Q10. How does your NL-to-SQL work? A: Two tiers. The rule-based tier uses regex patterns to detect table names (matched against live schema), column names, comparison operators in natural language phrases ("greater than" → >), aggregate intents (count, average), and ordering cues. The OpenAI tier sends the schema and question to GPT-3.5 for higher accuracy.

Q11. What grammar does your parser implement? A: A subset of SQL grammar:

statement   → select_stmt | insert_stmt | update_stmt | delete_stmt | create_stmt
select_stmt → SELECT [DISTINCT] column_list FROM table_ref [join]* [where] [group_by] [having] [order_by] [limit]
condition   → comparison { (AND|OR) comparison }*
comparison  → expression operator expression | expression IS [NOT] NULL | expression [NOT] IN (list) | expression BETWEEN e AND e
expression  → literal | identifier | table.column | function(args) | *

Q12. What is the role of the symbol table in compilers? What's its equivalent here? A: In traditional compilers, the symbol table stores variable declarations, types, and scope. In our SQL compiler, get_schema_info() serves as the symbol table — it stores table names, column names, and data types. The semantic analyzer uses it the same way a compiler uses a symbol table: to validate identifiers.

Q13. What optimizations does your optimizer NOT implement that a production optimizer would? A: Join reordering (choosing which table to scan first), materialized views, partition pruning, query caching, histogram-based selectivity estimation, adaptive query execution, and parallel execution plans.

Q14. Explain the pipeline flow for a Natural Language query. A:

  1. User enters "find students with marks above 80"
  2. NL→SQL converts it → SELECT * FROM students WHERE marks > 80;
  3. Tokenizer → [SELECT][*][FROM][students][WHERE][marks][>][80]
  4. Parser → AST (SELECT node with WHERE COMPARE node)
  5. Semantic Analyzer → validates 'students' exists, 'marks' exists
  6. Optimizer → expands SELECT *, warns about missing LIMIT, estimates cost
  7. Executor → runs original + optimized, measures time, gets EXPLAIN PLAN
  8. Explainer → "Retrieves all columns from 'students' where marks > 80"
  9. UI displays all of the above in tabs

Q15. What data structures does your AST use? A: Nested Python dictionaries. Each node is {"node": "TYPE", key: value, ...}. Lists represent multiple children (e.g., column list, join list). This is serializable to JSON, easy to traverse recursively, and human-readable. A production implementation might use typed dataclasses or a dedicated AST node class hierarchy.


📊 Sample Queries to Demo

-- Basic select with filter
SELECT name, marks FROM students WHERE marks > 80 ORDER BY marks DESC;

-- Join query
SELECT s.name, c.course_name FROM students s
JOIN courses c ON s.course = c.course_name WHERE s.marks > 85;

-- Aggregate
SELECT course, COUNT(*) as count, AVG(marks) as avg_marks
FROM students GROUP BY course HAVING AVG(marks) > 75;

-- SELECT * (triggers optimization)
SELECT * FROM students;

-- Tautology (triggers optimization)
SELECT name FROM students WHERE 1=1;

NL Queries:

  • "Show all students"
  • "Find students where marks greater than 80"
  • "Count students in Computer Science"
  • "Average marks of students"
  • "Who has the highest marks?"
  • "List students from Delhi"

👥 Team Contributions

Member Module
Member 1 Lexer + Parser + Database
Member 2 Semantic Analyzer + AST Visualization
Member 3 Query Optimizer + Index Recommender
Member 4 NL→SQL + Explainer + FastAPI + Streamlit

🔑 Key Compiler Concepts Used

Concept Where Used
Lexical Analysis tokenizer.py – regex-based tokenizer
Syntax Analysis sql_parser.py – recursive descent
Abstract Syntax Tree AST dict returned by parser
Semantic Analysis semantic_analyzer.py – schema checks
Symbol Table get_schema_info() in init_db.py
Rule-Based Opt. RuleBasedOptimizer class
Cost-Based Opt. CostBasedOptimizer class
Code Generation executor.py – SQL execution
Error Recovery Errors + suggestions in semantic module
Intermediate Repr. JSON AST (equivalent to IR in compilers)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages