This project implements a comprehensive data pipeline with two main components:
-
Data Management Backbone: A robust data pipeline using GitHub Workflows that handles data ingestion, transformation, and storage across multiple zones (landing, formatted, trusted, and exploitation).
-
Data Analysis Backbone: An advanced analytics layer built on top of the data management backbone that provides machine learning capabilities, including feature engineering, model training, prediction, and data governance with full traceability.
The project includes targets for installing dependencies, code formatting, code linting, data discovery, zone operations, model predictions, and environment management.
Before you begin, ensure that you have the following prerequisites installed on your system:
-
Make: Make sure you have
Makeinstalled on your system. -
Python version 3.10.12: Make sure you have Python installed on your system. I recommend using
pyenv. -
Poetry: This project uses Poetry for dependency management. Try running
pip install poetry
-
Clone the Repository:
git clone https://github.com/akossch0/adsdb.git cd adsdb -
Install Dependencies:
make install
This command will use Poetry to install the project dependencies.
-
Install Dependencies:
make install
This target installs project dependencies using Poetry.
-
Format Code:
make format
This target uses Black to format the code in the
scripts/directory. -
Lint Code:
make lint
This target runs Flake8 for code linting to ensure code quality.
-
Run data discovery for deaths, population and gini
make discover
This target executes data discovery for deaths, population and gini datasets.
-
Run Landing Zone:
make run-landing
This target executes the landing-zone script for initial data ingestion.
-
Run Formatted Zone:
make run-formatted
This target executes the formatted-zone script for data standardization and initial cleaning.
-
Run Trusted Zone:
make run-trusted
This target executes the trusted-zone script for data validation and quality assurance.
-
Run Exploitation Zone:
make run-exploitation
This target executes the exploitation-zone script for final data preparation and analytics-ready datasets.
-
Run All Zones:
make run
This target runs the complete data management pipeline: landing, formatted, trusted, and exploitation zones.
- Predict on test set
This target executes model prediction operations on data from
make predict
datasets/predict/input. The prediction system includes:- Model loading from trained pickle files
- Data governance integration with traceability
- Automated prediction on the latest input data
- Timestamped output generation
- Clean Datasets:
This target cleans up datasets in landing, formatted, trusted, and exploitation zones.
make clean-datasets
-
Display Help:
make helpThis target displays information about the available targets and their descriptions.
-
Clean Up:
make clean
This target cleans up the virtual environment and any generated files. Use this when you want to reset the project.
The data management backbone follows a multi-zone architecture:
- Landing Zone: Raw data ingestion and initial storage
- Formatted Zone: Data standardization and format conversion
- Trusted Zone: Data validation, quality checks, and cleansing
- Exploitation Zone: Analytics-ready datasets for consumption
Built on top of the data management backbone, the analysis component provides:
- Analytical Sandboxes: Subset of exploitation zone data prepared for specific analytical use cases
- Feature Engineering: Data preparation including encoding, discretization, null value handling, and feature selection
- Model Training & Governance: Comprehensive traceability for model training with feature stores
- Prediction Engine: Production-ready model inference with full data lineage tracking
The analysis backbone includes:
- Data Governance Database: SQLite database (
datasets/trace/data-governance.db) for tracking model training metadata - Model Storage: Pickle-based model persistence in
datasets/predict/model/ - Prediction Pipeline: Automated prediction workflow with timestamped outputs
-
Virtual Environment: The project uses Poetry to manage a virtual environment. The virtual environment is created and managed by Poetry to isolate project dependencies.
-
Code Formatting: The project uses Black for code formatting. You can run
make formatto automatically format the code according to the Black style. -
Code Linting: Flake8 is used for code linting to ensure adherence to coding standards and identify potential issues in the code.
-
Data Governance: The analysis backbone includes comprehensive data governance with model training traceability stored in the trace database.
-
Cleaning Up: If needed, you can run
make cleanto remove the virtual environment and any generated files, providing a clean slate for the project.