DataAnalyst Auto 🤖📊

Automated data analysis system powered by fine-tuned LLMs for Natural Language → SQL → Visualization pipeline

🎯 Overview

DataAnalyst Auto is an end-to-end system that automates data analysis workflows using fine-tuned Large Language Models (LLMs). The system consists of three specialized models that work together to transform natural language questions into SQL queries, execute them, generate visualizations, and provide natural language commentary on the results.

Key Features

🗣️ Natural Language to SQL: Convert questions in plain English to SQL queries
📈 SQL to Visualization: Automatically generate Python visualization code from SQL results
💬 Table to Commentary: Generate insightful natural language summaries of data results
🎓 Fine-tuned Models: Specialized Qwen 2.5 3B models trained on domain-specific data
🐳 Containerized Deployment: Ready-to-deploy Docker setup with multi-LoRA serving
☁️ Cloud-Ready: Optimized for deployment on Google Cloud Run with GPU support

🏗️ Architecture

The system implements a three-stage pipeline:

Natural Language Question
         ↓
    [NL → SQL Model]
         ↓
    SQL Query Execution
         ↓
    [SQL → Visualization Model]
         ↓
    Visualization Code + Plot
         ↓
    [Table → Commentary Model]
         ↓
    Natural Language Insights

📦 Project Structure

dataanalyst_auto/
├── src/                               # Source code
│   ├── data_prep/                     # Data preparation modules
│   │   ├── nlsql_data_prep.py         # NL→SQL dataset preparation
│   │   ├── sqlvis_data_prep.py        # SQL→Vis dataset preparation
│   │   └── tabletocomm_data_prep.py   # Table→Commentary dataset preparation
│   ├── training/                      # Training modules
│   │   ├── nlsql_sft.py               # NL→SQL supervised fine-tuning
│   │   ├── sqlvis_sft.py              # SQL→Vis supervised fine-tuning
│   │   ├── tabletocomm_sft.py         # Table→Commentary supervised fine-tuning
│   │   └── sql_table_commentary.py    # Alternative commentary training
│   └── inference/                     # Inference & execution modules
│       ├── run_qwen_inference.py      # Run inference with HuggingFace API
│       ├── execute_queries_and_plot.py # Execute SQL and generate plots
│       └── plot_functions.py          # Visualization utilities
│
├── data/                              # Data files
│   ├── raw/                           # Raw CSV datasets
│   │   ├── pottery_transactions.csv   # Sample pottery sales dataset
│   │   └── linked_nl_sql_visualizations.csv
│   └── processed/                     # Training & evaluation data
│       ├── nl_to_sql_training_data.json
│       ├── nl_to_sql_eval_data.json
│       ├── sql_to_vis_training_data.json
│       ├── sql_to_vis_eval_data.json
│       ├── table_to_commentary_training_data.json
│       ├── table_to_commentary_eval_data.json
│       └── sql_table_commentary_training_data.json
│
├── outputs/                           # Generated outputs
│   ├── commentaries/                  # 60 generated table commentaries
│   └── visualizations/                # 39 generated plots
│
├── Dockerfile                         # Multi-LoRA vLLM deployment
├── pyproject.toml                     # Project dependencies & config
├── README.md                          # This file
├── INFERENCE_USAGE.md                 # Inference guide
└── .gitignore                         # Git ignore rules

🚀 Quick Start

Prerequisites

Python 3.9 or higher
uv package manager (recommended) or pip
HuggingFace account (for inference API access)

Installation

# Clone the repository
git clone https://github.com/axeld5/dataanalyst_auto.git
cd dataanalyst_auto

# Install dependencies with uv (recommended)
uv sync

# Or with pip
pip install -e .

Environment Setup

Create a .env file in the project root:

HF_TOKEN=your_huggingface_token_here

Get your free HuggingFace token at: https://huggingface.co/settings/tokens

📖 Usage

1. Quick Inference Test

Test the fine-tuned models using the HuggingFace Inference API:

# Test NL→SQL conversion
python src/inference/run_qwen_inference.py --prefix nl_to_sql --file eval --max_samples 5

# Test SQL→Visualization
python src/inference/run_qwen_inference.py --prefix sql_to_vis --file eval --max_samples 5

# Test Table→Commentary
python src/inference/run_qwen_inference.py --prefix table_to_commentary --file eval --max_samples 5

For detailed inference options, see INFERENCE_USAGE.md.

2. Data Preparation

Prepare training datasets from your own data:

# Prepare NL→SQL dataset
python src/data_prep/nlsql_data_prep.py

# Prepare SQL→Visualization dataset
python src/data_prep/sqlvis_data_prep.py

# Prepare Table→Commentary dataset
python src/data_prep/tabletocomm_data_prep.py

3. Fine-tuning Models

Fine-tune the models on your prepared datasets:

# Fine-tune NL→SQL model
python src/training/nlsql_sft.py

# Fine-tune SQL→Visualization model
python src/training/sqlvis_sft.py

# Fine-tune Table→Commentary model
python src/training/tabletocomm_sft.py

Models will be saved in *_sft/merged/ directories.

4. Execute Full Pipeline

Run SQL queries and generate visualizations:

python src/inference/execute_queries_and_plot.py

🎯 Example Dataset

The project includes data/raw/pottery_transactions.csv, a sample dataset containing:

ID: Transaction identifier
Date: Transaction date
Material: Pottery material (Clay, Porcelain, Ceramic)
Container: Type of pottery item (Mug, Bowl, Vase, etc.)
Size: Item size (Small, Medium, Large)
TransactionAmount: Transaction amount in USD
Sex: Customer gender (M/F)
Age: Customer age

This dataset demonstrates the system's capabilities across various analytical tasks including aggregations, time series analysis, segmentation, and complex multi-dimensional queries.

🤖 Pre-trained Models

Fine-tuned models are available on HuggingFace:

NL→SQL: axel-darmouni/qwen-2.5-3b-it-nltosql-sft
SQL→Vis: axel-darmouni/qwen-2.5-3b-it-sqltovis-sft
Table→Commentary: axel-darmouni/qwen-2.5-3b-it-tabletocomm-sft

All models are based on Qwen 2.5 3B Instruct and optimized for low-resource deployment.

🐳 Docker Deployment

Deploy the system with multi-LoRA serving using vLLM:

# Build the Docker image
docker build --build-arg HF_TOKEN=$HF_TOKEN -t dataanalyst-auto:latest .

# Run locally
docker run -p 8080:8080 --gpus all dataanalyst-auto:latest

The container serves all three models simultaneously using LoRA adapters, enabling efficient multi-task inference on a single GPU.

Google Cloud Run Deployment

# Configure your project
PROJECT=your-project-id
REGION=europe-west1
REPO=dataanalyst-auto
IMAGE=$REGION-docker.pkg.dev/$PROJECT/$REPO/vllm-multilora:latest

# Create artifact repository
gcloud artifacts repositories create $REPO \
  --repository-format=docker \
  --location=$REGION

# Build and push
docker build --build-arg HF_TOKEN=$HF_TOKEN -t $IMAGE .
docker push $IMAGE

# Deploy to Cloud Run with GPU
gcloud run deploy dataanalyst-auto \
  --image $IMAGE \
  --region $REGION \
  --gpu 1 \
  --cpu 8 \
  --memory 24Gi \
  --allow-unauthenticated \
  --execution-environment gen2

📊 Sample Outputs

The repository includes 60 generated table commentaries in outputs/commentaries/ and 39 visualizations in outputs/visualizations/, demonstrating the system's capabilities across various analytical scenarios.

🔧 Development

Running Tests

# Install dev dependencies
uv sync --all-extras

# Run tests
pytest

# Run with coverage
pytest --cov=. --cov-report=html

Code Formatting

# Format code
black .

# Sort imports
isort .

# Type checking
mypy .

📝 Training Data Format

NL→SQL Format

{
  "messages": [
    {
      "role": "user",
      "content": "System prompt + natural language question"
    },
    {
      "role": "assistant",
      "content": "```sql\nSELECT ...\n```"
    }
  ]
}

SQL→Visualization Format

{
  "messages": [
    {
      "role": "user",
      "content": "System prompt + SQL query + table results"
    },
    {
      "role": "assistant",
      "content": "```python\nimport matplotlib.pyplot as plt\n...\n```"
    }
  ]
}

Table→Commentary Format

{
  "messages": [
    {
      "role": "user",
      "content": "System prompt + table data"
    },
    {
      "role": "assistant",
      "content": "Natural language analysis and insights"
    }
  ]
}

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with Qwen 2.5 3B Instruct by Alibaba Cloud
Fine-tuning powered by Unsloth
Inference serving with vLLM
Training framework: TRL

📧 Contact

Axel Darmouni - axeldarmouni@gmail.com

Project Link: https://github.com/axeld5/dataanalyst_auto

Note: This project demonstrates an end-to-end approach to automated data analysis. The models are optimized for the pottery transactions dataset but can be adapted to other domains by preparing custom training data.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
data		data
outputs		outputs
src		src
.gitignore		.gitignore
INFERENCE_USAGE.md		INFERENCE_USAGE.md
README.md		README.md
dockerfile		dockerfile
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

DataAnalyst Auto 🤖📊

🎯 Overview

Key Features

🏗️ Architecture

📦 Project Structure

🚀 Quick Start

Prerequisites

Installation

Environment Setup

📖 Usage

1. Quick Inference Test

2. Data Preparation

3. Fine-tuning Models

4. Execute Full Pipeline

🎯 Example Dataset

🤖 Pre-trained Models

🐳 Docker Deployment

Google Cloud Run Deployment

📊 Sample Outputs

🔧 Development

Running Tests

Code Formatting

📝 Training Data Format

NL→SQL Format

SQL→Visualization Format

Table→Commentary Format

🤝 Contributing

📄 License

🙏 Acknowledgments

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages