Live app: https://cs5542-analytics-copilot.streamlit.app
Video demo: https://vimeo.com/1180190697
- Ben Blake (GenAI & Backend Lead) - @ben-blake
- Tina Nguyen (Data & Frontend Lead) - @tinana2k
CS 5542 Big Data Analytics & Applications · University of Missouri–Kansas City
Non-technical users depend on data analysts to write SQL queries, creating bottlenecks that slow decision-making. Analytics Copilot removes that barrier: type a business question in plain English, and the system translates it to Snowflake SQL, executes it, and returns a result table with an auto-selected chart — in about 7 seconds.
The core is a three-agent pipeline running on Snowflake Cortex:
- Schema Linker — semantic RAG (Cortex Search) identifies the relevant tables from 9, with automatic FK-partner supplementation so joins are never missing
- SQL Generator — sends the filtered schema to
llama3.1-70bvia a 15-rule structured prompt and few-shot example bank - Validator — runs
EXPLAINon the output and feeds errors back to the generator for up to 3 self-correction retries
A LoRA-adapted CodeLlama-7B (82 Olist-specific training examples, QLoRA 4-bit) provides an offline-deployable alternative inference path.
No setup required — open the live app:
https://cs5542-analytics-copilot.streamlit.app
The app runs in demo mode without Snowflake credentials — all UI components are visible. To run live queries, configure credentials (see Setup).
Example questions to try:
- "What is the average delivery time by customer state?"
- "Which product categories generate the most revenue?"
- "Show me the top 10 sellers by total order volume."
- "What percentage of orders were delivered late in 2017?"
Each response shows the generated SQL (collapsible), a result table, an auto-selected chart, and a Pipeline Trace panel with per-agent timing.
Kaggle CSVs (9 tables)
→ Snowflake RAW Schema (~100k rows)
→ METADATA Schema (LLM-generated column descriptions + Cortex Search index)
→ Schema Linker Agent (RAG + FK-partner supplementation)
→ SQL Generator Agent (structured prompt + few-shot examples)
→ Validator Agent (EXPLAIN + self-correction loop, up to 3 retries)
→ Streamlit Application (Chat tab + Monitor tab)
| Component | Technology |
|---|---|
| Data warehouse | Snowflake (ANALYTICS_COPILOT.RAW) |
| LLM inference | Snowflake Cortex (llama3.1-70b) |
| Semantic search | Snowflake Cortex Search |
| Domain-adapted model | CodeLlama-7B-Instruct + LoRA/QLoRA |
| Application | Streamlit (Streamlit Community Cloud) |
| Orchestration | Python + Snowflake Snowpark |
Brazilian E-Commerce Public Dataset by Olist — Kaggle (CC BY-NC-SA 4.0)
~100,000 orders across 9 relational tables, 2016–2018, ingested into ANALYTICS_COPILOT.RAW:
| Table | Rows | Description |
|---|---|---|
ORDERS |
99,441 | Order lifecycle — anchor table |
CUSTOMERS |
99,441 | Demographics and location |
ORDER_ITEMS |
112,650 | Line items and pricing |
ORDER_PAYMENTS |
103,886 | Payment method and value |
ORDER_REVIEWS |
99,224 | Star ratings and comments |
PRODUCTS |
32,951 | Category and dimensions |
SELLERS |
3,095 | Seller location |
GEOLOCATION |
1,000,163 | Lat/lon by zip code |
CATEGORY_TRANSLATION |
71 | English category names |
A semantic metadata layer (METADATA.TABLE_DESCRIPTIONS) stores LLM-generated column descriptions, synonyms, and sample values for every column, indexed by Cortex Search for retrieval.
analytics-copilot/
├── src/
│ ├── agents/
│ │ ├── schema_linker.py # Cortex Search RAG + FK supplementation
│ │ ├── sql_generator.py # Structured prompt + few-shot examples
│ │ └── validator.py # EXPLAIN-based self-correction loop
│ ├── utils/
│ │ ├── snowflake_conn.py # Snowpark connection (RSA / st.secrets)
│ │ ├── config.py # Config loader (config.yaml)
│ │ ├── trace.py # PipelineTrace: per-agent timing + status
│ │ ├── logger.py # Dual file/console structured logging
│ │ └── viz.py # Auto-visualization (bar, line, scatter)
│ └── app.py # Streamlit app — Chat + Monitor tabs
├── scripts/
│ ├── ingest_data.py # CSV → Snowflake RAW via Snowpark
│ ├── build_metadata.py # LLM-generated semantic metadata
│ ├── generate_golden.py # Golden query benchmark generation
│ ├── evaluate.py # 50-query evaluation harness
│ ├── evaluate_adaptation.py # Baseline vs. LoRA comparison
│ ├── create_instruction_dataset.py # 82-example Alpaca instruction dataset
│ ├── fine_tune.py # LoRA/QLoRA fine-tuning script
│ └── api_server.py # FastAPI server for fine-tuned model
├── snowflake/
│ ├── 01_setup.sql # Database, schemas, warehouse, file format
│ ├── 02_olist_tables.sql # 9 RAW tables with PK/FK constraints
│ ├── 04_metadata.sql # TABLE_DESCRIPTIONS table
│ └── 05_cortex_search.sql # Cortex Search service
├── notebooks/
│ └── fine_tune_colab.ipynb # LoRA fine-tuning notebook (Colab T4 GPU)
├── data/
│ ├── olist/ # Olist CSVs (download from Kaggle)
│ ├── golden_queries.json # 50 golden queries (20 / 20 / 10 easy/med/hard)
│ ├── instruction_dataset.json # 82-example fine-tuning dataset
│ ├── instruction_train.json # Training split (74 examples)
│ └── instruction_val.json # Validation split (8 examples)
├── tests/
│ └── test_smoke.py # 13 offline smoke tests (no Snowflake required)
├── reproducibility/
│ ├── reproduce.sh # Single-command full pipeline script
│ └── README.md # Setup guide and troubleshooting
├── docs/
│ ├── architecture.png # System architecture diagram
│ ├── architecture.mmd # Mermaid source
│ └── reports/ # Phase reports, poster, presentation
├── .streamlit/
│ ├── config.toml # Streamlit theme and server config
│ └── secrets.toml.example # Credential template for Streamlit Cloud
├── config.yaml # All runtime parameters (centralized)
├── requirements.txt # Pinned Python dependencies
└── .env.example # Credential template for local development
git clone https://github.com/ben-blake/analytics-copilot.git
cd analytics-copilot
cp .env.example .env # fill in Snowflake credentials
bash reproducibility/reproduce.sh| Mode | Command | What it does |
|---|---|---|
| Full pipeline | reproduce.sh |
Installs deps, ingests data, builds metadata, runs evaluation |
| Offline only | reproduce.sh --smoke |
Validates imports and config — no Snowflake required |
| Tests only | reproduce.sh --test |
Runs the 13-test smoke suite |
| Eval only | reproduce.sh --eval |
Runs 50-query benchmark against an existing Snowflake setup |
1. Clone and install
git clone https://github.com/ben-blake/analytics-copilot.git
cd analytics-copilot
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt2. Configure credentials
cp .env.example .envEdit .env with your Snowflake credentials:
SNOWFLAKE_ACCOUNT=your_account_identifier
SNOWFLAKE_USER=your_username
SNOWFLAKE_PRIVATE_KEY_PATH=./rsa_key.p8
SNOWFLAKE_ROLE=TRAINING_ROLE
SNOWFLAKE_WAREHOUSE=COPILOT_WH
SNOWFLAKE_DATABASE=ANALYTICS_COPILOT
SNOWFLAKE_SCHEMA=RAW3. Download the dataset
Download the Olist dataset from Kaggle and unzip into data/olist/.
4. Initialize Snowflake
snowsql -f snowflake/01_setup.sql
snowsql -f snowflake/02_olist_tables.sql
snowsql -f snowflake/04_metadata.sql
snowsql -f snowflake/05_cortex_search.sql5. Ingest data and build metadata
python scripts/ingest_data.py
python scripts/build_metadata.py6. Launch the application
streamlit run src/app.pyOpen http://localhost:8501.
streamlit run src/app.py- Chat tab — type a natural language question, receive SQL + result table + auto-chart + Pipeline Trace
- Monitor tab — session-level success rate, average latency, per-query history, agent step breakdown chart
# 50-query golden benchmark (easy / medium / hard)
python scripts/evaluate.py
# Baseline vs. LoRA fine-tuned model side-by-side
# (requires api_server.py running — see Domain Adaptation)
python scripts/evaluate_adaptation.pypython -m pytest tests/test_smoke.py -vValidates imports, configuration contracts, and agent initialization entirely offline.
All runtime parameters are centralized in config.yaml — no magic numbers in code:
seed: 42
llm:
model: llama3.1-70b
temperature: 0.0
schema_linker:
limit: 5
sql_generator:
max_retries: 3| Control | Mechanism |
|---|---|
| Pinned dependencies | requirements.txt with == versions |
| Deterministic LLM | temperature: 0.0, seed: 42 |
| Config-driven | All parameters in config.yaml — no magic numbers in code |
| Structured logging | logs/pipeline.log + console output via tee |
| Offline validation | 13 smoke tests run with no Snowflake connection |
| Single-command | reproducibility/reproduce.sh orchestrates the full pipeline end-to-end |
See reproducibility/README.md for the full audit checklist and troubleshooting guide.
- Open https://cs5542-analytics-copilot.streamlit.app
- The app loads in demo mode without credentials — UI is fully functional, queries require a Snowflake connection
- To enable live queries: add Snowflake credentials via Streamlit Cloud → Settings → Secrets (use
.streamlit/secrets.toml.exampleas a template)
streamlit run src/app.pyWithout a .env or st.secrets, the app starts with a "Disconnected" banner. The Monitor tab and all UI components still render.
- Type: "Which product categories generate the most revenue?"
- Schema Linker retrieves
ORDER_ITEMS,PRODUCTS,CATEGORY_TRANSLATION - SQL Generator writes a
GROUP BY+SUM(PRICE)query with a 3-tableJOIN - Validator runs
EXPLAIN, confirms valid, executes - Results appear as a table + bar chart; expand Pipeline Trace to see per-agent timing
| Phase | Queries | Accuracy | Avg Latency | Notes |
|---|---|---|---|---|
| Project 2 baseline | 50 | ~75% | 5–8 s | Prompt engineering only |
| Lab 7 hardened | 50 | 100% | 17.5 s | FK supplementation + isolation filter |
| Lab 8 LoRA | 15 | 100% | 7.4 s | Domain-adapted CodeLlama-7B |
| Lab 8 Baseline | 15 | 93% | 20.2 s | Zero-shot CodeLlama-7B |
A LoRA-adapted CodeLlama-7B-Instruct model is available as an alternative inference backend.
Start the model server (GPU required; tested on Colab T4):
python scripts/api_server.pyFine-tune from scratch (optional — pre-trained adapter in artifacts/):
Open notebooks/fine_tune_colab.ipynb in Google Colab and run all cells.
The fine-tuned model vs. zero-shot baseline:
| Metric | Baseline | LoRA Fine-tuned |
|---|---|---|
| Valid SQL generated | 93% | 100% |
| Qualified table names | 87% | 100% |
| Avg inference latency | 20.2 s | 7.4 s |
The application is deployed on Streamlit Community Cloud.
To deploy your own instance:
- Fork the repository and connect it to Streamlit Cloud
- Copy
.streamlit/secrets.toml.example→.streamlit/secrets.tomland fill in credentials - Add secrets in the Streamlit Cloud dashboard: Settings → Secrets
The app starts in demo mode if no credentials are configured, rather than crashing.
- X-SQL: Expert Schema Linking and Understanding of Text-to-SQL with Multi-LLMs (NeurIPS 2025) — arXiv:2509.05899 — validates multi-agent decomposition; FK-partner supplementation and structured error feedback adopted directly
- Chatting With Your Data: LLM-Enabled Data Transformation for Enterprise Text to SQL (NeurIPS 2025) — motivates the Semantic Metadata Layer approach
- OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale (NeurIPS 2025) — arXiv:2503.02240 — informs synthetic golden query generation
- Anthropic Claude Code (
claude-opus-4-6) — code generation, pipeline debugging,reproduce.shscaffolding, instruction dataset drafting. All generated code was reviewed and validated by the team.
