Skip to content

dataxbi/fablab-sql-endpoint

Repository files navigation

FabLab SQL Endpoint Benchmark

Performance benchmark comparing Microsoft Fabric SQL endpoints using TPC-DS-inspired queries.

What this benchmarks

Four endpoints are tested under identical conditions:

Endpoint Type Delta config
lakehouse_default Lakehouse SQL endpoint No partition, no V-Order
lakehouse_partitioned Lakehouse SQL endpoint PARTITION BY ss_sold_date_sk
lakehouse_vorder Lakehouse SQL endpoint V-Order enabled at write time
warehouse Fabric Warehouse Standard configuration

Five TPC-DS-inspired queries are run at SF100 (~100 GB) scale in two cache modes:

  • Cold — first run after capacity resume (true cold cache, guaranteed by pause/resume cycle)
  • Warm — 3 repetitions on a hot, active capacity

Repository layout

fablab-sql-endpoint/
├── provision/          ← One-time Fabric resource setup + capacity manager
├── data_generation/    ← TPC-DS data generation (dsdgen) + OneLake upload
├── ingestion/          ← Spark notebooks to load CSVs → Delta / Warehouse
├── sql/                ← Five benchmark SQL queries (q01–q05)
├── benchmark/          ← Main runner, config, connection and utilities
├── analysis/           ← Results analysis notebook (charts + statistics)
├── results/            ← Output CSV/JSON files (committed to repo)
├── specs/              ← Authoritative project specification (Spanish)
├── .env.example        ← Environment variable template
└── requirements.txt

Documentation by folder

Folder README What it covers
provision/ Create Fabric workspace/Lakehouse/Warehouse; capacity pause/resume
data_generation/ Generate TPC-DS CSVs with dsdgen; split, gzip and upload to OneLake
ingestion/ Load CSVs into 3 Lakehouse schemas and the Warehouse via Spark
benchmark/ Run the benchmark; config reference; output format
results/ Output CSV/JSON files — committed to the repo

End-to-end workflow

1. provision/setup_fabric.py       → create workspace, Lakehouse, Warehouse
2. data_generation/generate_csv.py → generate TPC-DS data with dsdgen
3. data_generation/upload_to_onelake.py → split, gzip, upload to OneLake
4. ingestion/01_lakehouse_ingest.ipynb  → load data into 3 Lakehouse schemas
5. ingestion/02_warehouse_ingest.sql    → load data into Warehouse (CTAS)
6. benchmark/runner.py             → run benchmark (cold + warm blocks)
7. analysis/analyze_results.ipynb  → compare results across endpoints

Prerequisites

# Install Python dependencies
pip install -r requirements.txt

# Authenticate
az login

Copy .env.example to .env and fill in the values printed by provision/setup_fabric.py.


Results

Benchmark output files are committed to results/:

File Scale Description
benchmark_20260410T160604.csv SF100 Raw results — all 4 endpoints, Q1–Q5, cold + warm
benchmark_20260410T160604.json SF100 Same data in JSON format

Open analysis/analyze_results.ipynb to generate comparison charts from any results file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages