Skip to content

akossch0/adsdb

Repository files navigation

Project Setup and Prerequisites for ADSDB - Data Management & Analysis Backbone Operations

Project Overview

This project implements a comprehensive data pipeline with two main components:

  1. Data Management Backbone: A robust data pipeline using GitHub Workflows that handles data ingestion, transformation, and storage across multiple zones (landing, formatted, trusted, and exploitation).

  2. Data Analysis Backbone: An advanced analytics layer built on top of the data management backbone that provides machine learning capabilities, including feature engineering, model training, prediction, and data governance with full traceability.

The project includes targets for installing dependencies, code formatting, code linting, data discovery, zone operations, model predictions, and environment management.

Prerequisites

Before you begin, ensure that you have the following prerequisites installed on your system:

  • Make: Make sure you have Make installed on your system.

  • Python version 3.10.12: Make sure you have Python installed on your system. I recommend using pyenv.

  • Poetry: This project uses Poetry for dependency management. Try running

    pip install poetry

Project Setup

  1. Clone the Repository:

    git clone https://github.com/akossch0/adsdb.git
    cd adsdb
  2. Install Dependencies:

    make install

    This command will use Poetry to install the project dependencies.

Available Targets

General Operations

  • Install Dependencies:

    make install

    This target installs project dependencies using Poetry.

  • Format Code:

    make format

    This target uses Black to format the code in the scripts/ directory.

  • Lint Code:

    make lint

    This target runs Flake8 for code linting to ensure code quality.

Data Management Backbone Operations

  • Run data discovery for deaths, population and gini

    make discover

    This target executes data discovery for deaths, population and gini datasets.

  • Run Landing Zone:

    make run-landing

    This target executes the landing-zone script for initial data ingestion.

  • Run Formatted Zone:

    make run-formatted

    This target executes the formatted-zone script for data standardization and initial cleaning.

  • Run Trusted Zone:

    make run-trusted

    This target executes the trusted-zone script for data validation and quality assurance.

  • Run Exploitation Zone:

    make run-exploitation

    This target executes the exploitation-zone script for final data preparation and analytics-ready datasets.

  • Run All Zones:

    make run

    This target runs the complete data management pipeline: landing, formatted, trusted, and exploitation zones.

Data Analysis Backbone Operations

  • Predict on test set
    make predict
    This target executes model prediction operations on data from datasets/predict/input. The prediction system includes:
    • Model loading from trained pickle files
    • Data governance integration with traceability
    • Automated prediction on the latest input data
    • Timestamped output generation

Data Management

  • Clean Datasets:
    make clean-datasets
    This target cleans up datasets in landing, formatted, trusted, and exploitation zones.

Help and Maintenance

  • Display Help:

    make help

    This target displays information about the available targets and their descriptions.

  • Clean Up:

    make clean

    This target cleans up the virtual environment and any generated files. Use this when you want to reset the project.

Data Pipeline Architecture

Data Management Backbone

The data management backbone follows a multi-zone architecture:

  1. Landing Zone: Raw data ingestion and initial storage
  2. Formatted Zone: Data standardization and format conversion
  3. Trusted Zone: Data validation, quality checks, and cleansing
  4. Exploitation Zone: Analytics-ready datasets for consumption

Data Analysis Backbone

Built on top of the data management backbone, the analysis component provides:

  1. Analytical Sandboxes: Subset of exploitation zone data prepared for specific analytical use cases
  2. Feature Engineering: Data preparation including encoding, discretization, null value handling, and feature selection
  3. Model Training & Governance: Comprehensive traceability for model training with feature stores
  4. Prediction Engine: Production-ready model inference with full data lineage tracking

The analysis backbone includes:

  • Data Governance Database: SQLite database (datasets/trace/data-governance.db) for tracking model training metadata
  • Model Storage: Pickle-based model persistence in datasets/predict/model/
  • Prediction Pipeline: Automated prediction workflow with timestamped outputs

Additional Notes

  • Virtual Environment: The project uses Poetry to manage a virtual environment. The virtual environment is created and managed by Poetry to isolate project dependencies.

  • Code Formatting: The project uses Black for code formatting. You can run make format to automatically format the code according to the Black style.

  • Code Linting: Flake8 is used for code linting to ensure adherence to coding standards and identify potential issues in the code.

  • Data Governance: The analysis backbone includes comprehensive data governance with model training traceability stored in the trace database.

  • Cleaning Up: If needed, you can run make clean to remove the virtual environment and any generated files, providing a clean slate for the project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors