🧠 SQL AI Engine

Advanced SQL Query Analyzer, Optimizer & Natural Language-to-SQL Engine

Compiler Design Group Project – B.Tech CSE

📁 Project Structure

sql_ai_engine/
│
├── backend/
│   ├── api/
│   │   └── main.py              ← FastAPI REST API (all endpoints)
│   ├── lexer/
│   │   └── tokenizer.py         ← Lexical Analyser (tokenizer)
│   ├── parser/
│   │   └── sql_parser.py        ← Recursive-Descent Parser → AST
│   ├── semantic/
│   │   └── semantic_analyzer.py ← Semantic Validator
│   ├── optimizer/
│   │   └── query_optimizer.py   ← Rule-Based + Cost-Based Optimizer
│   ├── execution/
│   │   └── executor.py          ← SQLite Execution Engine
│   └── nlp/
│       ├── nl_to_sql.py         ← Natural Language → SQL
│       └── explainer.py         ← SQL → Human Explanation
│
├── frontend/
│   └── app.py                   ← Streamlit UI
│
├── database/
│   ├── schema.sql               ← Table definitions + seed data
│   └── init_db.py               ← DB initializer + schema inspector
│
├── utils/
│   └── helpers.py               ← Shared utilities
│
├── tests/
│   └── test_all.py              ← Full test suite (39 tests)
│
├── requirements.txt
└── README.md                    ← This file

⚙️ Tech Stack

Layer	Technology
Backend API	FastAPI + Uvicorn
Database	SQLite
Parser	Hand-written Recursive Descent (Python)
NLP	Rule-based + OpenAI API (optional)
Frontend	Streamlit

🚀 Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Initialise the database

python database/init_db.py

3. Start the FastAPI backend

uvicorn backend.api.main:app --reload --port 8000

4. Start the Streamlit frontend (new terminal)

streamlit run frontend/app.py

5. Run tests

python tests/test_all.py

Open http://localhost:8501 in your browser.

🔌 API Endpoints

Method	Endpoint	Description
GET	`/health`	Health check
GET	`/schema`	Live database schema
GET	`/tables/{name}`	Preview table rows
POST	`/analyze`	Full pipeline (master endpoint)
POST	`/tokenize`	Lexical analysis only
POST	`/parse`	Parser → AST
POST	`/semantic`	Semantic validation
POST	`/optimize`	Query optimizer
POST	`/execute`	Execute SQL
POST	`/explain`	Human-readable explanation
POST	`/nl-to-sql`	Natural Language → SQL

🧩 Module Explanations

1. Lexer (`tokenizer.py`)

Converts raw SQL into a stream of typed tokens using regex.

Token types: KEYWORD, IDENTIFIER, OPERATOR, LITERAL_NUM, LITERAL_STR, PUNCTUATION, WILDCARD

Example:

Input:  SELECT name FROM students WHERE marks > 80
Tokens: [KEYWORD:SELECT] [IDENTIFIER:name] [KEYWORD:FROM]
        [IDENTIFIER:students] [KEYWORD:WHERE] [IDENTIFIER:marks]
        [OPERATOR:>] [LITERAL_NUM:80]

2. Parser (`sql_parser.py`)

Implements a Recursive Descent Parser using the token stream. Builds an Abstract Syntax Tree (AST) as a nested Python dict.

Supports: SELECT, INSERT, UPDATE, DELETE, CREATE TABLE Features: JOINs, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, subexpressions, functions

Example AST (simplified):

{
  "node": "SELECT",
  "columns": [{"node": "COLUMN", "expr": {"node": "IDENTIFIER", "name": "name"}}],
  "from_clause": {"node": "TABLE_REF", "name": "students"},
  "where": {"node": "COMPARE", "op": ">", "left": {"node": "IDENTIFIER", "name": "marks"}, "right": {"node": "LITERAL", "value": "80"}}
}

3. Semantic Analyser (`semantic_analyzer.py`)

Walks the AST and validates against the live database schema:

Table existence check
Column existence check (per table, with alias resolution)
Warns about SELECT * usage
Suggests corrections for errors

4. Query Optimizer (`query_optimizer.py`)

Rule-Based Optimization (RBO):

Rule 1: Expand SELECT * → explicit columns
Rule 2: Remove tautologies (WHERE 1=1, WHERE TRUE)
Rule 3: Confirm predicate pushdown (before JOIN)
Rule 4: Warn if no LIMIT on potentially large scans
Rule 5: Flag NOT IN → suggest NOT EXISTS
Rule 6: Rewrite col=X OR col=Y → col IN (X,Y)

Cost-Based Optimization (CBO):

Estimates full-scan cost = row_count × 1.0
Estimates index-scan cost = 5 + row_count × 0.1
Recommends cheaper access path

Index Recommender:

Detects unindexed WHERE / JOIN / ORDER BY columns
Generates ready-to-run CREATE INDEX DDL

5. Execution Engine (`executor.py`)

Executes SQL on SQLite safely
Measures wall-clock time (ms)
Runs EXPLAIN QUERY PLAN for each query
Compares original vs optimized execution side-by-side

6. NL → SQL (`nl_to_sql.py`)

Two-tier approach:

OpenAI GPT (if OPENAI_API_KEY is set) – high accuracy
Rule-based fallback – covers 80% of common academic queries

Detects: table name, column names, WHERE conditions, aggregate functions, ORDER BY, LIMIT.

7. Explainer (`explainer.py`)

Converts SQL → plain English at two levels:

Simple: One-sentence summary
Detailed: Clause-by-clause walkthrough with emojis

🎤 Viva Questions & Answers

Compiler Design Theory

Q1. What is a compiler? How does this project relate to compiler design? A: A compiler translates source code (high-level) into target code through phases: Lexical Analysis → Syntax Analysis → Semantic Analysis → Optimization → Code Generation. This project implements exactly those phases for SQL: the tokenizer is the lexer, the recursive-descent parser builds an AST (syntax analysis), the semantic analyzer validates meaning, and the optimizer transforms the query. Execution is analogous to code generation/interpretation.

Q2. What is a token? What are token types in your lexer? A: A token is the smallest meaningful unit of a language. Our tokenizer recognises: KEYWORD (SELECT, WHERE…), IDENTIFIER (table/column names), OPERATOR (=, >, <…), LITERAL_NUM (numbers), LITERAL_STR (quoted strings), PUNCTUATION, and WILDCARD (*).

Q3. What is an Abstract Syntax Tree (AST)? A: An AST is a tree representation of the syntactic structure of source code. Unlike a parse tree, it omits unnecessary grammar details. Each node represents a language construct (SELECT node, COMPARE node, LITERAL node, etc.). Our parser produces a JSON-serializable dict tree.

Q4. What parsing technique did you use and why? A: Recursive Descent Parsing. Each grammar rule maps to a Python method (_parse_select, _parse_condition, etc.). It's easy to understand, implement, and debug. It naturally handles operator precedence through method call hierarchy.

Q5. What is semantic analysis? What does your semantic analyzer check? A: Semantic analysis checks meaning, not just syntax. Syntactically SELECT xyz FROM abc is valid but semantically wrong if table abc or column xyz don't exist. Our analyzer checks: table existence, column existence per table, alias resolution, and warns about anti-patterns like SELECT *.

Q6. What is predicate pushdown? A: Moving WHERE filter conditions as close to the data source as possible, ideally before a JOIN. This reduces the number of rows that flow through the JOIN, improving performance. SQLite does this automatically; our optimizer confirms it and documents it.

Q7. What is the difference between Rule-Based and Cost-Based optimization? A: Rule-Based Optimization (RBO) applies fixed transformation rules regardless of data (e.g., always expand SELECT *). Cost-Based Optimization (CBO) uses statistics about the data (row counts, index availability) to estimate the cost of different execution plans and picks the cheapest. Our project implements both.

Q8. What is an index? Why do indexes speed up queries? A: An index is a separate data structure (usually a B-tree) that maps column values to row locations. Without an index, a query must scan every row (O(n)). With an index, it can jump directly to matching rows (O(log n)). Our optimizer detects unindexed filter/join columns and generates CREATE INDEX statements.

Q9. What does EXPLAIN QUERY PLAN show? A: It shows how SQLite intends to execute the query: which scan type (full table scan vs index scan), which index (if any) will be used, and how JOINs will be ordered. We surface this in the UI for both original and optimized queries.

Q10. How does your NL-to-SQL work? A: Two tiers. The rule-based tier uses regex patterns to detect table names (matched against live schema), column names, comparison operators in natural language phrases ("greater than" → >), aggregate intents (count, average), and ordering cues. The OpenAI tier sends the schema and question to GPT-3.5 for higher accuracy.

Q11. What grammar does your parser implement? A: A subset of SQL grammar:

statement   → select_stmt | insert_stmt | update_stmt | delete_stmt | create_stmt
select_stmt → SELECT [DISTINCT] column_list FROM table_ref [join]* [where] [group_by] [having] [order_by] [limit]
condition   → comparison { (AND|OR) comparison }*
comparison  → expression operator expression | expression IS [NOT] NULL | expression [NOT] IN (list) | expression BETWEEN e AND e
expression  → literal | identifier | table.column | function(args) | *

Q12. What is the role of the symbol table in compilers? What's its equivalent here? A: In traditional compilers, the symbol table stores variable declarations, types, and scope. In our SQL compiler, get_schema_info() serves as the symbol table — it stores table names, column names, and data types. The semantic analyzer uses it the same way a compiler uses a symbol table: to validate identifiers.

Q13. What optimizations does your optimizer NOT implement that a production optimizer would? A: Join reordering (choosing which table to scan first), materialized views, partition pruning, query caching, histogram-based selectivity estimation, adaptive query execution, and parallel execution plans.

Q14. Explain the pipeline flow for a Natural Language query. A:

User enters "find students with marks above 80"
NL→SQL converts it → SELECT * FROM students WHERE marks > 80;
Tokenizer → [SELECT][*][FROM][students][WHERE][marks][>][80]
Parser → AST (SELECT node with WHERE COMPARE node)
Semantic Analyzer → validates 'students' exists, 'marks' exists
Optimizer → expands SELECT *, warns about missing LIMIT, estimates cost
Executor → runs original + optimized, measures time, gets EXPLAIN PLAN
Explainer → "Retrieves all columns from 'students' where marks > 80"
UI displays all of the above in tabs

Q15. What data structures does your AST use? A: Nested Python dictionaries. Each node is {"node": "TYPE", key: value, ...}. Lists represent multiple children (e.g., column list, join list). This is serializable to JSON, easy to traverse recursively, and human-readable. A production implementation might use typed dataclasses or a dedicated AST node class hierarchy.

📊 Sample Queries to Demo

-- Basic select with filter
SELECT name, marks FROM students WHERE marks > 80 ORDER BY marks DESC;

-- Join query
SELECT s.name, c.course_name FROM students s
JOIN courses c ON s.course = c.course_name WHERE s.marks > 85;

-- Aggregate
SELECT course, COUNT(*) as count, AVG(marks) as avg_marks
FROM students GROUP BY course HAVING AVG(marks) > 75;

-- SELECT * (triggers optimization)
SELECT * FROM students;

-- Tautology (triggers optimization)
SELECT name FROM students WHERE 1=1;

NL Queries:

"Show all students"
"Find students where marks greater than 80"
"Count students in Computer Science"
"Average marks of students"
"Who has the highest marks?"
"List students from Delhi"

👥 Team Contributions

Member	Module
Member 1	Lexer + Parser + Database
Member 2	Semantic Analyzer + AST Visualization
Member 3	Query Optimizer + Index Recommender
Member 4	NL→SQL + Explainer + FastAPI + Streamlit

🔑 Key Compiler Concepts Used

Concept	Where Used
Lexical Analysis	`tokenizer.py` – regex-based tokenizer
Syntax Analysis	`sql_parser.py` – recursive descent
Abstract Syntax Tree	AST dict returned by parser
Semantic Analysis	`semantic_analyzer.py` – schema checks
Symbol Table	`get_schema_info()` in `init_db.py`
Rule-Based Opt.	`RuleBasedOptimizer` class
Cost-Based Opt.	`CostBasedOptimizer` class
Code Generation	`executor.py` – SQL execution
Error Recovery	Errors + suggestions in semantic module
Intermediate Repr.	JSON AST (equivalent to IR in compilers)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 SQL AI Engine

Advanced SQL Query Analyzer, Optimizer & Natural Language-to-SQL Engine

📁 Project Structure

⚙️ Tech Stack

🚀 Quick Start

1. Install dependencies

2. Initialise the database

3. Start the FastAPI backend

4. Start the Streamlit frontend (new terminal)

5. Run tests

🔌 API Endpoints

🧩 Module Explanations

1. Lexer (`tokenizer.py`)

2. Parser (`sql_parser.py`)

3. Semantic Analyser (`semantic_analyzer.py`)

4. Query Optimizer (`query_optimizer.py`)

5. Execution Engine (`executor.py`)

6. NL → SQL (`nl_to_sql.py`)

7. Explainer (`explainer.py`)

🎤 Viva Questions & Answers

Compiler Design Theory

📊 Sample Queries to Demo

👥 Team Contributions

🔑 Key Compiler Concepts Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
database		database
frontend		frontend
tests		tests
utils		utils
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧠 SQL AI Engine

Advanced SQL Query Analyzer, Optimizer & Natural Language-to-SQL Engine

📁 Project Structure

⚙️ Tech Stack

🚀 Quick Start

1. Install dependencies

2. Initialise the database

3. Start the FastAPI backend

4. Start the Streamlit frontend (new terminal)

5. Run tests

🔌 API Endpoints

🧩 Module Explanations

1. Lexer (tokenizer.py)

2. Parser (sql_parser.py)

3. Semantic Analyser (semantic_analyzer.py)

4. Query Optimizer (query_optimizer.py)

5. Execution Engine (executor.py)

6. NL → SQL (nl_to_sql.py)

7. Explainer (explainer.py)

🎤 Viva Questions & Answers

Compiler Design Theory

📊 Sample Queries to Demo

👥 Team Contributions

🔑 Key Compiler Concepts Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Lexer (`tokenizer.py`)

2. Parser (`sql_parser.py`)

3. Semantic Analyser (`semantic_analyzer.py`)

4. Query Optimizer (`query_optimizer.py`)

5. Execution Engine (`executor.py`)

6. NL → SQL (`nl_to_sql.py`)

7. Explainer (`explainer.py`)

Packages