Skip to content

axeld5/dataanalyst_auto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataAnalyst Auto 🤖📊

Automated data analysis system powered by fine-tuned LLMs for Natural Language → SQL → Visualization pipeline

Python 3.9+ License: MIT

🎯 Overview

DataAnalyst Auto is an end-to-end system that automates data analysis workflows using fine-tuned Large Language Models (LLMs). The system consists of three specialized models that work together to transform natural language questions into SQL queries, execute them, generate visualizations, and provide natural language commentary on the results.

Key Features

  • 🗣️ Natural Language to SQL: Convert questions in plain English to SQL queries
  • 📈 SQL to Visualization: Automatically generate Python visualization code from SQL results
  • 💬 Table to Commentary: Generate insightful natural language summaries of data results
  • 🎓 Fine-tuned Models: Specialized Qwen 2.5 3B models trained on domain-specific data
  • 🐳 Containerized Deployment: Ready-to-deploy Docker setup with multi-LoRA serving
  • ☁️ Cloud-Ready: Optimized for deployment on Google Cloud Run with GPU support

🏗️ Architecture

The system implements a three-stage pipeline:

Natural Language Question
         ↓
    [NL → SQL Model]
         ↓
    SQL Query Execution
         ↓
    [SQL → Visualization Model]
         ↓
    Visualization Code + Plot
         ↓
    [Table → Commentary Model]
         ↓
    Natural Language Insights

📦 Project Structure

dataanalyst_auto/
├── src/                               # Source code
│   ├── data_prep/                     # Data preparation modules
│   │   ├── nlsql_data_prep.py         # NL→SQL dataset preparation
│   │   ├── sqlvis_data_prep.py        # SQL→Vis dataset preparation
│   │   └── tabletocomm_data_prep.py   # Table→Commentary dataset preparation
│   ├── training/                      # Training modules
│   │   ├── nlsql_sft.py               # NL→SQL supervised fine-tuning
│   │   ├── sqlvis_sft.py              # SQL→Vis supervised fine-tuning
│   │   ├── tabletocomm_sft.py         # Table→Commentary supervised fine-tuning
│   │   └── sql_table_commentary.py    # Alternative commentary training
│   └── inference/                     # Inference & execution modules
│       ├── run_qwen_inference.py      # Run inference with HuggingFace API
│       ├── execute_queries_and_plot.py # Execute SQL and generate plots
│       └── plot_functions.py          # Visualization utilities
│
├── data/                              # Data files
│   ├── raw/                           # Raw CSV datasets
│   │   ├── pottery_transactions.csv   # Sample pottery sales dataset
│   │   └── linked_nl_sql_visualizations.csv
│   └── processed/                     # Training & evaluation data
│       ├── nl_to_sql_training_data.json
│       ├── nl_to_sql_eval_data.json
│       ├── sql_to_vis_training_data.json
│       ├── sql_to_vis_eval_data.json
│       ├── table_to_commentary_training_data.json
│       ├── table_to_commentary_eval_data.json
│       └── sql_table_commentary_training_data.json
│
├── outputs/                           # Generated outputs
│   ├── commentaries/                  # 60 generated table commentaries
│   └── visualizations/                # 39 generated plots
│
├── Dockerfile                         # Multi-LoRA vLLM deployment
├── pyproject.toml                     # Project dependencies & config
├── README.md                          # This file
├── INFERENCE_USAGE.md                 # Inference guide
└── .gitignore                         # Git ignore rules

🚀 Quick Start

Prerequisites

  • Python 3.9 or higher
  • uv package manager (recommended) or pip
  • HuggingFace account (for inference API access)

Installation

# Clone the repository
git clone https://github.com/axeld5/dataanalyst_auto.git
cd dataanalyst_auto

# Install dependencies with uv (recommended)
uv sync

# Or with pip
pip install -e .

Environment Setup

Create a .env file in the project root:

HF_TOKEN=your_huggingface_token_here

Get your free HuggingFace token at: https://huggingface.co/settings/tokens

📖 Usage

1. Quick Inference Test

Test the fine-tuned models using the HuggingFace Inference API:

# Test NL→SQL conversion
python src/inference/run_qwen_inference.py --prefix nl_to_sql --file eval --max_samples 5

# Test SQL→Visualization
python src/inference/run_qwen_inference.py --prefix sql_to_vis --file eval --max_samples 5

# Test Table→Commentary
python src/inference/run_qwen_inference.py --prefix table_to_commentary --file eval --max_samples 5

For detailed inference options, see INFERENCE_USAGE.md.

2. Data Preparation

Prepare training datasets from your own data:

# Prepare NL→SQL dataset
python src/data_prep/nlsql_data_prep.py

# Prepare SQL→Visualization dataset
python src/data_prep/sqlvis_data_prep.py

# Prepare Table→Commentary dataset
python src/data_prep/tabletocomm_data_prep.py

3. Fine-tuning Models

Fine-tune the models on your prepared datasets:

# Fine-tune NL→SQL model
python src/training/nlsql_sft.py

# Fine-tune SQL→Visualization model
python src/training/sqlvis_sft.py

# Fine-tune Table→Commentary model
python src/training/tabletocomm_sft.py

Models will be saved in *_sft/merged/ directories.

4. Execute Full Pipeline

Run SQL queries and generate visualizations:

python src/inference/execute_queries_and_plot.py

🎯 Example Dataset

The project includes data/raw/pottery_transactions.csv, a sample dataset containing:

  • ID: Transaction identifier
  • Date: Transaction date
  • Material: Pottery material (Clay, Porcelain, Ceramic)
  • Container: Type of pottery item (Mug, Bowl, Vase, etc.)
  • Size: Item size (Small, Medium, Large)
  • TransactionAmount: Transaction amount in USD
  • Sex: Customer gender (M/F)
  • Age: Customer age

This dataset demonstrates the system's capabilities across various analytical tasks including aggregations, time series analysis, segmentation, and complex multi-dimensional queries.

🤖 Pre-trained Models

Fine-tuned models are available on HuggingFace:

All models are based on Qwen 2.5 3B Instruct and optimized for low-resource deployment.

🐳 Docker Deployment

Deploy the system with multi-LoRA serving using vLLM:

# Build the Docker image
docker build --build-arg HF_TOKEN=$HF_TOKEN -t dataanalyst-auto:latest .

# Run locally
docker run -p 8080:8080 --gpus all dataanalyst-auto:latest

The container serves all three models simultaneously using LoRA adapters, enabling efficient multi-task inference on a single GPU.

Google Cloud Run Deployment

# Configure your project
PROJECT=your-project-id
REGION=europe-west1
REPO=dataanalyst-auto
IMAGE=$REGION-docker.pkg.dev/$PROJECT/$REPO/vllm-multilora:latest

# Create artifact repository
gcloud artifacts repositories create $REPO \
  --repository-format=docker \
  --location=$REGION

# Build and push
docker build --build-arg HF_TOKEN=$HF_TOKEN -t $IMAGE .
docker push $IMAGE

# Deploy to Cloud Run with GPU
gcloud run deploy dataanalyst-auto \
  --image $IMAGE \
  --region $REGION \
  --gpu 1 \
  --cpu 8 \
  --memory 24Gi \
  --allow-unauthenticated \
  --execution-environment gen2

📊 Sample Outputs

The repository includes 60 generated table commentaries in outputs/commentaries/ and 39 visualizations in outputs/visualizations/, demonstrating the system's capabilities across various analytical scenarios.

🔧 Development

Running Tests

# Install dev dependencies
uv sync --all-extras

# Run tests
pytest

# Run with coverage
pytest --cov=. --cov-report=html

Code Formatting

# Format code
black .

# Sort imports
isort .

# Type checking
mypy .

📝 Training Data Format

NL→SQL Format

{
  "messages": [
    {
      "role": "user",
      "content": "System prompt + natural language question"
    },
    {
      "role": "assistant",
      "content": "```sql\nSELECT ...\n```"
    }
  ]
}

SQL→Visualization Format

{
  "messages": [
    {
      "role": "user",
      "content": "System prompt + SQL query + table results"
    },
    {
      "role": "assistant",
      "content": "```python\nimport matplotlib.pyplot as plt\n...\n```"
    }
  ]
}

Table→Commentary Format

{
  "messages": [
    {
      "role": "user",
      "content": "System prompt + table data"
    },
    {
      "role": "assistant",
      "content": "Natural language analysis and insights"
    }
  ]
}

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📧 Contact

Axel Darmouni - axeldarmouni@gmail.com

Project Link: https://github.com/axeld5/dataanalyst_auto


Note: This project demonstrates an end-to-end approach to automated data analysis. The models are optimized for the pottery transactions dataset but can be adapted to other domains by preparing custom training data.

About

automatization of data analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages