Automated data analysis system powered by fine-tuned LLMs for Natural Language → SQL → Visualization pipeline
DataAnalyst Auto is an end-to-end system that automates data analysis workflows using fine-tuned Large Language Models (LLMs). The system consists of three specialized models that work together to transform natural language questions into SQL queries, execute them, generate visualizations, and provide natural language commentary on the results.
- 🗣️ Natural Language to SQL: Convert questions in plain English to SQL queries
- 📈 SQL to Visualization: Automatically generate Python visualization code from SQL results
- 💬 Table to Commentary: Generate insightful natural language summaries of data results
- 🎓 Fine-tuned Models: Specialized Qwen 2.5 3B models trained on domain-specific data
- 🐳 Containerized Deployment: Ready-to-deploy Docker setup with multi-LoRA serving
- ☁️ Cloud-Ready: Optimized for deployment on Google Cloud Run with GPU support
The system implements a three-stage pipeline:
Natural Language Question
↓
[NL → SQL Model]
↓
SQL Query Execution
↓
[SQL → Visualization Model]
↓
Visualization Code + Plot
↓
[Table → Commentary Model]
↓
Natural Language Insights
dataanalyst_auto/
├── src/ # Source code
│ ├── data_prep/ # Data preparation modules
│ │ ├── nlsql_data_prep.py # NL→SQL dataset preparation
│ │ ├── sqlvis_data_prep.py # SQL→Vis dataset preparation
│ │ └── tabletocomm_data_prep.py # Table→Commentary dataset preparation
│ ├── training/ # Training modules
│ │ ├── nlsql_sft.py # NL→SQL supervised fine-tuning
│ │ ├── sqlvis_sft.py # SQL→Vis supervised fine-tuning
│ │ ├── tabletocomm_sft.py # Table→Commentary supervised fine-tuning
│ │ └── sql_table_commentary.py # Alternative commentary training
│ └── inference/ # Inference & execution modules
│ ├── run_qwen_inference.py # Run inference with HuggingFace API
│ ├── execute_queries_and_plot.py # Execute SQL and generate plots
│ └── plot_functions.py # Visualization utilities
│
├── data/ # Data files
│ ├── raw/ # Raw CSV datasets
│ │ ├── pottery_transactions.csv # Sample pottery sales dataset
│ │ └── linked_nl_sql_visualizations.csv
│ └── processed/ # Training & evaluation data
│ ├── nl_to_sql_training_data.json
│ ├── nl_to_sql_eval_data.json
│ ├── sql_to_vis_training_data.json
│ ├── sql_to_vis_eval_data.json
│ ├── table_to_commentary_training_data.json
│ ├── table_to_commentary_eval_data.json
│ └── sql_table_commentary_training_data.json
│
├── outputs/ # Generated outputs
│ ├── commentaries/ # 60 generated table commentaries
│ └── visualizations/ # 39 generated plots
│
├── Dockerfile # Multi-LoRA vLLM deployment
├── pyproject.toml # Project dependencies & config
├── README.md # This file
├── INFERENCE_USAGE.md # Inference guide
└── .gitignore # Git ignore rules
- Python 3.9 or higher
- uv package manager (recommended) or pip
- HuggingFace account (for inference API access)
# Clone the repository
git clone https://github.com/axeld5/dataanalyst_auto.git
cd dataanalyst_auto
# Install dependencies with uv (recommended)
uv sync
# Or with pip
pip install -e .Create a .env file in the project root:
HF_TOKEN=your_huggingface_token_hereGet your free HuggingFace token at: https://huggingface.co/settings/tokens
Test the fine-tuned models using the HuggingFace Inference API:
# Test NL→SQL conversion
python src/inference/run_qwen_inference.py --prefix nl_to_sql --file eval --max_samples 5
# Test SQL→Visualization
python src/inference/run_qwen_inference.py --prefix sql_to_vis --file eval --max_samples 5
# Test Table→Commentary
python src/inference/run_qwen_inference.py --prefix table_to_commentary --file eval --max_samples 5For detailed inference options, see INFERENCE_USAGE.md.
Prepare training datasets from your own data:
# Prepare NL→SQL dataset
python src/data_prep/nlsql_data_prep.py
# Prepare SQL→Visualization dataset
python src/data_prep/sqlvis_data_prep.py
# Prepare Table→Commentary dataset
python src/data_prep/tabletocomm_data_prep.pyFine-tune the models on your prepared datasets:
# Fine-tune NL→SQL model
python src/training/nlsql_sft.py
# Fine-tune SQL→Visualization model
python src/training/sqlvis_sft.py
# Fine-tune Table→Commentary model
python src/training/tabletocomm_sft.pyModels will be saved in *_sft/merged/ directories.
Run SQL queries and generate visualizations:
python src/inference/execute_queries_and_plot.pyThe project includes data/raw/pottery_transactions.csv, a sample dataset containing:
- ID: Transaction identifier
- Date: Transaction date
- Material: Pottery material (Clay, Porcelain, Ceramic)
- Container: Type of pottery item (Mug, Bowl, Vase, etc.)
- Size: Item size (Small, Medium, Large)
- TransactionAmount: Transaction amount in USD
- Sex: Customer gender (M/F)
- Age: Customer age
This dataset demonstrates the system's capabilities across various analytical tasks including aggregations, time series analysis, segmentation, and complex multi-dimensional queries.
Fine-tuned models are available on HuggingFace:
- NL→SQL: axel-darmouni/qwen-2.5-3b-it-nltosql-sft
- SQL→Vis: axel-darmouni/qwen-2.5-3b-it-sqltovis-sft
- Table→Commentary: axel-darmouni/qwen-2.5-3b-it-tabletocomm-sft
All models are based on Qwen 2.5 3B Instruct and optimized for low-resource deployment.
Deploy the system with multi-LoRA serving using vLLM:
# Build the Docker image
docker build --build-arg HF_TOKEN=$HF_TOKEN -t dataanalyst-auto:latest .
# Run locally
docker run -p 8080:8080 --gpus all dataanalyst-auto:latestThe container serves all three models simultaneously using LoRA adapters, enabling efficient multi-task inference on a single GPU.
# Configure your project
PROJECT=your-project-id
REGION=europe-west1
REPO=dataanalyst-auto
IMAGE=$REGION-docker.pkg.dev/$PROJECT/$REPO/vllm-multilora:latest
# Create artifact repository
gcloud artifacts repositories create $REPO \
--repository-format=docker \
--location=$REGION
# Build and push
docker build --build-arg HF_TOKEN=$HF_TOKEN -t $IMAGE .
docker push $IMAGE
# Deploy to Cloud Run with GPU
gcloud run deploy dataanalyst-auto \
--image $IMAGE \
--region $REGION \
--gpu 1 \
--cpu 8 \
--memory 24Gi \
--allow-unauthenticated \
--execution-environment gen2The repository includes 60 generated table commentaries in outputs/commentaries/ and 39 visualizations in outputs/visualizations/, demonstrating the system's capabilities across various analytical scenarios.
# Install dev dependencies
uv sync --all-extras
# Run tests
pytest
# Run with coverage
pytest --cov=. --cov-report=html# Format code
black .
# Sort imports
isort .
# Type checking
mypy .{
"messages": [
{
"role": "user",
"content": "System prompt + natural language question"
},
{
"role": "assistant",
"content": "```sql\nSELECT ...\n```"
}
]
}{
"messages": [
{
"role": "user",
"content": "System prompt + SQL query + table results"
},
{
"role": "assistant",
"content": "```python\nimport matplotlib.pyplot as plt\n...\n```"
}
]
}{
"messages": [
{
"role": "user",
"content": "System prompt + table data"
},
{
"role": "assistant",
"content": "Natural language analysis and insights"
}
]
}Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Qwen 2.5 3B Instruct by Alibaba Cloud
- Fine-tuning powered by Unsloth
- Inference serving with vLLM
- Training framework: TRL
Axel Darmouni - axeldarmouni@gmail.com
Project Link: https://github.com/axeld5/dataanalyst_auto
Note: This project demonstrates an end-to-end approach to automated data analysis. The models are optimized for the pottery transactions dataset but can be adapted to other domains by preparing custom training data.