A modular data engineering pipeline designed to transform raw Olist e-commerce data into business-ready analytical assets. The project implements a Medallion Architecture to ensure data integrity, traceability, and performance.
- Medallion Data Architecture: Orchestrates data through three distinct layers:
- Bronze (Raw): SQLite-based relational staging of raw CSV data.
- Silver (Cleaned): Parquet-backed layer with standardized schemas, deduplicated records, and normalized formatting.
- Gold (Business): Aggregated analytical reports optimized for executive insights and BI consumption.
- Automated Quality Gating: Integrates
Great Expectationsto perform rigorous validation on the Bronze layer, catching nulls, schema drifts, and relational inconsistencies before cleaning. - High-Performance ETL: Built with
Polarsto leverage multi-threaded processing and LazyFrame execution for efficient data manipulation. - Automated Verification: Includes a post-processing verification suite that audits the existence, volume, and integrity of the generated Parquet files.
- Structured Logging & Auditing: Features a centralized logging system and generates a Markdown-based
data_health_summary.mdafter every run to track pipeline health.
- Language: Python 3.x
- Data Processing: Polars
- Quality Assurance: Great Expectations
- Storage: SQLite (Staging) & Apache Parquet (Processing/Output)
- Orchestration: Custom Python-based Orchestrator (
main.py) - Environment:
pathlibfor robust cross-platform path management
The system follows a strict logical flow to ensure reliability:
- Ingestion:
Database_Managerloads raw CSVs into a relational SQLite database. - Validation:
Quality_Guardianruns pre-defined expectation suites to audit data health. - Cleaning:
Data_Cleanerstandardizes types (e.g., datetime), handles missing values, and persists to Parquet. - Modeling:
Data_Modelerjoins disparate tables to create complex reports likedelivery_reliabilityandseller_rankings. - Verification:
DataVerifierperforms a final sanity check on all output files.
PROJECT/
├── data/
│ ├── olist_raw.db # Bronze Layer (SQLite)
│ ├── gold/ # Gold Layer (Aggregated Parquet)
│ ├── processed/ # Silver Layer (Cleaned Parquet)
│ ├── raw/ # Source CSV files
│ └── reports/ # Quality audit results (MD)
├── scripts/
│ ├── Database_Manager.py # Ingestion logic
│ ├── Quality_Guardian.py # Great Expectations integration
│ ├── Data_Cleaner.py # Silver transformation logic
│ ├── Data_Modeler.py # Gold aggregation logic
│ ├── Verification.py # Integrity testing suite
│ └── constants.py # Configuration and path management
├── main.py # Pipeline Orchestrator
└── requirements.txt # Project dependencies
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt
- Configure environment: Update paths in
constants.pyif necessary. - Run the pipeline:
The orchestrator will execute all six stages of the pipeline in sequence and output a summary to the console.
python main.py
- Visualization Suite: Integration of Seaborn/Matplotlib for automated executive dashboards.
- Advanced Polars Optimization: Transitioning all cleaning operations to 100% LazyFrame execution for improved memory efficiency.
- Live Quality Guard: Implementing the Quality Guardian as a persistent decorator-based service.
This project is for educational and portfolio purposes, utilizing the public Olist dataset available on Kaggle.