Compiler Design Group Project – B.Tech CSE
sql_ai_engine/
│
├── backend/
│ ├── api/
│ │ └── main.py ← FastAPI REST API (all endpoints)
│ ├── lexer/
│ │ └── tokenizer.py ← Lexical Analyser (tokenizer)
│ ├── parser/
│ │ └── sql_parser.py ← Recursive-Descent Parser → AST
│ ├── semantic/
│ │ └── semantic_analyzer.py ← Semantic Validator
│ ├── optimizer/
│ │ └── query_optimizer.py ← Rule-Based + Cost-Based Optimizer
│ ├── execution/
│ │ └── executor.py ← SQLite Execution Engine
│ └── nlp/
│ ├── nl_to_sql.py ← Natural Language → SQL
│ └── explainer.py ← SQL → Human Explanation
│
├── frontend/
│ └── app.py ← Streamlit UI
│
├── database/
│ ├── schema.sql ← Table definitions + seed data
│ └── init_db.py ← DB initializer + schema inspector
│
├── utils/
│ └── helpers.py ← Shared utilities
│
├── tests/
│ └── test_all.py ← Full test suite (39 tests)
│
├── requirements.txt
└── README.md ← This file
| Layer | Technology |
|---|---|
| Backend API | FastAPI + Uvicorn |
| Database | SQLite |
| Parser | Hand-written Recursive Descent (Python) |
| NLP | Rule-based + OpenAI API (optional) |
| Frontend | Streamlit |
pip install -r requirements.txtpython database/init_db.pyuvicorn backend.api.main:app --reload --port 8000streamlit run frontend/app.pypython tests/test_all.pyOpen http://localhost:8501 in your browser.
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check |
| GET | /schema |
Live database schema |
| GET | /tables/{name} |
Preview table rows |
| POST | /analyze |
Full pipeline (master endpoint) |
| POST | /tokenize |
Lexical analysis only |
| POST | /parse |
Parser → AST |
| POST | /semantic |
Semantic validation |
| POST | /optimize |
Query optimizer |
| POST | /execute |
Execute SQL |
| POST | /explain |
Human-readable explanation |
| POST | /nl-to-sql |
Natural Language → SQL |
Converts raw SQL into a stream of typed tokens using regex.
Token types: KEYWORD, IDENTIFIER, OPERATOR, LITERAL_NUM, LITERAL_STR, PUNCTUATION, WILDCARD
Example:
Input: SELECT name FROM students WHERE marks > 80
Tokens: [KEYWORD:SELECT] [IDENTIFIER:name] [KEYWORD:FROM]
[IDENTIFIER:students] [KEYWORD:WHERE] [IDENTIFIER:marks]
[OPERATOR:>] [LITERAL_NUM:80]
Implements a Recursive Descent Parser using the token stream. Builds an Abstract Syntax Tree (AST) as a nested Python dict.
Supports: SELECT, INSERT, UPDATE, DELETE, CREATE TABLE Features: JOINs, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, subexpressions, functions
Example AST (simplified):
{
"node": "SELECT",
"columns": [{"node": "COLUMN", "expr": {"node": "IDENTIFIER", "name": "name"}}],
"from_clause": {"node": "TABLE_REF", "name": "students"},
"where": {"node": "COMPARE", "op": ">", "left": {"node": "IDENTIFIER", "name": "marks"}, "right": {"node": "LITERAL", "value": "80"}}
}Walks the AST and validates against the live database schema:
- Table existence check
- Column existence check (per table, with alias resolution)
- Warns about SELECT * usage
- Suggests corrections for errors
Rule-Based Optimization (RBO):
- Rule 1: Expand
SELECT *→ explicit columns - Rule 2: Remove tautologies (
WHERE 1=1,WHERE TRUE) - Rule 3: Confirm predicate pushdown (before JOIN)
- Rule 4: Warn if no LIMIT on potentially large scans
- Rule 5: Flag
NOT IN→ suggestNOT EXISTS - Rule 6: Rewrite
col=X OR col=Y→col IN (X,Y)
Cost-Based Optimization (CBO):
- Estimates full-scan cost =
row_count × 1.0 - Estimates index-scan cost =
5 + row_count × 0.1 - Recommends cheaper access path
Index Recommender:
- Detects unindexed WHERE / JOIN / ORDER BY columns
- Generates ready-to-run
CREATE INDEXDDL
- Executes SQL on SQLite safely
- Measures wall-clock time (ms)
- Runs
EXPLAIN QUERY PLANfor each query - Compares original vs optimized execution side-by-side
Two-tier approach:
- OpenAI GPT (if
OPENAI_API_KEYis set) – high accuracy - Rule-based fallback – covers 80% of common academic queries
Detects: table name, column names, WHERE conditions, aggregate functions, ORDER BY, LIMIT.
Converts SQL → plain English at two levels:
- Simple: One-sentence summary
- Detailed: Clause-by-clause walkthrough with emojis
Q1. What is a compiler? How does this project relate to compiler design? A: A compiler translates source code (high-level) into target code through phases: Lexical Analysis → Syntax Analysis → Semantic Analysis → Optimization → Code Generation. This project implements exactly those phases for SQL: the tokenizer is the lexer, the recursive-descent parser builds an AST (syntax analysis), the semantic analyzer validates meaning, and the optimizer transforms the query. Execution is analogous to code generation/interpretation.
Q2. What is a token? What are token types in your lexer? A: A token is the smallest meaningful unit of a language. Our tokenizer recognises: KEYWORD (SELECT, WHERE…), IDENTIFIER (table/column names), OPERATOR (=, >, <…), LITERAL_NUM (numbers), LITERAL_STR (quoted strings), PUNCTUATION, and WILDCARD (*).
Q3. What is an Abstract Syntax Tree (AST)? A: An AST is a tree representation of the syntactic structure of source code. Unlike a parse tree, it omits unnecessary grammar details. Each node represents a language construct (SELECT node, COMPARE node, LITERAL node, etc.). Our parser produces a JSON-serializable dict tree.
Q4. What parsing technique did you use and why?
A: Recursive Descent Parsing. Each grammar rule maps to a Python method (_parse_select, _parse_condition, etc.). It's easy to understand, implement, and debug. It naturally handles operator precedence through method call hierarchy.
Q5. What is semantic analysis? What does your semantic analyzer check?
A: Semantic analysis checks meaning, not just syntax. Syntactically SELECT xyz FROM abc is valid but semantically wrong if table abc or column xyz don't exist. Our analyzer checks: table existence, column existence per table, alias resolution, and warns about anti-patterns like SELECT *.
Q6. What is predicate pushdown? A: Moving WHERE filter conditions as close to the data source as possible, ideally before a JOIN. This reduces the number of rows that flow through the JOIN, improving performance. SQLite does this automatically; our optimizer confirms it and documents it.
Q7. What is the difference between Rule-Based and Cost-Based optimization? A: Rule-Based Optimization (RBO) applies fixed transformation rules regardless of data (e.g., always expand SELECT *). Cost-Based Optimization (CBO) uses statistics about the data (row counts, index availability) to estimate the cost of different execution plans and picks the cheapest. Our project implements both.
Q8. What is an index? Why do indexes speed up queries? A: An index is a separate data structure (usually a B-tree) that maps column values to row locations. Without an index, a query must scan every row (O(n)). With an index, it can jump directly to matching rows (O(log n)). Our optimizer detects unindexed filter/join columns and generates CREATE INDEX statements.
Q9. What does EXPLAIN QUERY PLAN show? A: It shows how SQLite intends to execute the query: which scan type (full table scan vs index scan), which index (if any) will be used, and how JOINs will be ordered. We surface this in the UI for both original and optimized queries.
Q10. How does your NL-to-SQL work? A: Two tiers. The rule-based tier uses regex patterns to detect table names (matched against live schema), column names, comparison operators in natural language phrases ("greater than" → >), aggregate intents (count, average), and ordering cues. The OpenAI tier sends the schema and question to GPT-3.5 for higher accuracy.
Q11. What grammar does your parser implement? A: A subset of SQL grammar:
statement → select_stmt | insert_stmt | update_stmt | delete_stmt | create_stmt
select_stmt → SELECT [DISTINCT] column_list FROM table_ref [join]* [where] [group_by] [having] [order_by] [limit]
condition → comparison { (AND|OR) comparison }*
comparison → expression operator expression | expression IS [NOT] NULL | expression [NOT] IN (list) | expression BETWEEN e AND e
expression → literal | identifier | table.column | function(args) | *
Q12. What is the role of the symbol table in compilers? What's its equivalent here?
A: In traditional compilers, the symbol table stores variable declarations, types, and scope. In our SQL compiler, get_schema_info() serves as the symbol table — it stores table names, column names, and data types. The semantic analyzer uses it the same way a compiler uses a symbol table: to validate identifiers.
Q13. What optimizations does your optimizer NOT implement that a production optimizer would? A: Join reordering (choosing which table to scan first), materialized views, partition pruning, query caching, histogram-based selectivity estimation, adaptive query execution, and parallel execution plans.
Q14. Explain the pipeline flow for a Natural Language query. A:
- User enters "find students with marks above 80"
- NL→SQL converts it →
SELECT * FROM students WHERE marks > 80; - Tokenizer →
[SELECT][*][FROM][students][WHERE][marks][>][80] - Parser → AST (SELECT node with WHERE COMPARE node)
- Semantic Analyzer → validates 'students' exists, 'marks' exists
- Optimizer → expands SELECT *, warns about missing LIMIT, estimates cost
- Executor → runs original + optimized, measures time, gets EXPLAIN PLAN
- Explainer → "Retrieves all columns from 'students' where marks > 80"
- UI displays all of the above in tabs
Q15. What data structures does your AST use?
A: Nested Python dictionaries. Each node is {"node": "TYPE", key: value, ...}. Lists represent multiple children (e.g., column list, join list). This is serializable to JSON, easy to traverse recursively, and human-readable. A production implementation might use typed dataclasses or a dedicated AST node class hierarchy.
-- Basic select with filter
SELECT name, marks FROM students WHERE marks > 80 ORDER BY marks DESC;
-- Join query
SELECT s.name, c.course_name FROM students s
JOIN courses c ON s.course = c.course_name WHERE s.marks > 85;
-- Aggregate
SELECT course, COUNT(*) as count, AVG(marks) as avg_marks
FROM students GROUP BY course HAVING AVG(marks) > 75;
-- SELECT * (triggers optimization)
SELECT * FROM students;
-- Tautology (triggers optimization)
SELECT name FROM students WHERE 1=1;NL Queries:
- "Show all students"
- "Find students where marks greater than 80"
- "Count students in Computer Science"
- "Average marks of students"
- "Who has the highest marks?"
- "List students from Delhi"
| Member | Module |
|---|---|
| Member 1 | Lexer + Parser + Database |
| Member 2 | Semantic Analyzer + AST Visualization |
| Member 3 | Query Optimizer + Index Recommender |
| Member 4 | NL→SQL + Explainer + FastAPI + Streamlit |
| Concept | Where Used |
|---|---|
| Lexical Analysis | tokenizer.py – regex-based tokenizer |
| Syntax Analysis | sql_parser.py – recursive descent |
| Abstract Syntax Tree | AST dict returned by parser |
| Semantic Analysis | semantic_analyzer.py – schema checks |
| Symbol Table | get_schema_info() in init_db.py |
| Rule-Based Opt. | RuleBasedOptimizer class |
| Cost-Based Opt. | CostBasedOptimizer class |
| Code Generation | executor.py – SQL execution |
| Error Recovery | Errors + suggestions in semantic module |
| Intermediate Repr. | JSON AST (equivalent to IR in compilers) |