Project Setup and Prerequisites for ADSDB - Data Management & Analysis Backbone Operations

Project Overview

This project implements a comprehensive data pipeline with two main components:

Data Management Backbone: A robust data pipeline using GitHub Workflows that handles data ingestion, transformation, and storage across multiple zones (landing, formatted, trusted, and exploitation).
Data Analysis Backbone: An advanced analytics layer built on top of the data management backbone that provides machine learning capabilities, including feature engineering, model training, prediction, and data governance with full traceability.

The project includes targets for installing dependencies, code formatting, code linting, data discovery, zone operations, model predictions, and environment management.

Prerequisites

Before you begin, ensure that you have the following prerequisites installed on your system:

Make: Make sure you have Make installed on your system.
Python version 3.10.12: Make sure you have Python installed on your system. I recommend using pyenv.
Poetry: This project uses Poetry for dependency management. Try running
```
pip install poetry
```

Project Setup

Clone the Repository:

git clone https://github.com/akossch0/adsdb.git
cd adsdb

Install Dependencies:
```
make install
```
This command will use Poetry to install the project dependencies.

Available Targets

General Operations

Install Dependencies:
```
make install
```
This target installs project dependencies using Poetry.
Format Code:
```
make format
```
This target uses Black to format the code in the scripts/ directory.
Lint Code:
```
make lint
```
This target runs Flake8 for code linting to ensure code quality.

Data Management Backbone Operations

Run data discovery for deaths, population and gini
```
make discover
```
This target executes data discovery for deaths, population and gini datasets.
Run Landing Zone:
```
make run-landing
```
This target executes the landing-zone script for initial data ingestion.
Run Formatted Zone:
```
make run-formatted
```
This target executes the formatted-zone script for data standardization and initial cleaning.
Run Trusted Zone:
```
make run-trusted
```
This target executes the trusted-zone script for data validation and quality assurance.
Run Exploitation Zone:
```
make run-exploitation
```
This target executes the exploitation-zone script for final data preparation and analytics-ready datasets.
Run All Zones:
```
make run
```
This target runs the complete data management pipeline: landing, formatted, trusted, and exploitation zones.

Data Analysis Backbone Operations

Predict on test set
```
make predict
```
This target executes model prediction operations on data from datasets/predict/input. The prediction system includes:
- Model loading from trained pickle files
- Data governance integration with traceability
- Automated prediction on the latest input data
- Timestamped output generation

Data Management

Clean Datasets:
```
make clean-datasets
```
This target cleans up datasets in landing, formatted, trusted, and exploitation zones.

Help and Maintenance

Display Help:
```
make help
```
This target displays information about the available targets and their descriptions.
Clean Up:
```
make clean
```
This target cleans up the virtual environment and any generated files. Use this when you want to reset the project.

Data Pipeline Architecture

Data Management Backbone

The data management backbone follows a multi-zone architecture:

Landing Zone: Raw data ingestion and initial storage
Formatted Zone: Data standardization and format conversion
Trusted Zone: Data validation, quality checks, and cleansing
Exploitation Zone: Analytics-ready datasets for consumption

Data Analysis Backbone

Built on top of the data management backbone, the analysis component provides:

Analytical Sandboxes: Subset of exploitation zone data prepared for specific analytical use cases
Feature Engineering: Data preparation including encoding, discretization, null value handling, and feature selection
Model Training & Governance: Comprehensive traceability for model training with feature stores
Prediction Engine: Production-ready model inference with full data lineage tracking

The analysis backbone includes:

Data Governance Database: SQLite database (datasets/trace/data-governance.db) for tracking model training metadata
Model Storage: Pickle-based model persistence in datasets/predict/model/
Prediction Pipeline: Automated prediction workflow with timestamped outputs

Additional Notes

Virtual Environment: The project uses Poetry to manage a virtual environment. The virtual environment is created and managed by Poetry to isolate project dependencies.
Code Formatting: The project uses Black for code formatting. You can run make format to automatically format the code according to the Black style.
Code Linting: Flake8 is used for code linting to ensure adherence to coding standards and identify potential issues in the code.
Data Governance: The analysis backbone includes comprehensive data governance with model training traceability stored in the trace database.
Cleaning Up: If needed, you can run make clean to remove the virtual environment and any generated files, providing a clean slate for the project.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
datasets		datasets
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
notes-data-analysis-backbone.txt		notes-data-analysis-backbone.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Setup and Prerequisites for ADSDB - Data Management & Analysis Backbone Operations

Project Overview

Prerequisites

Project Setup

Available Targets

General Operations

Data Management Backbone Operations

Data Analysis Backbone Operations

Data Management

Help and Maintenance

Data Pipeline Architecture

Data Management Backbone

Data Analysis Backbone

Additional Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Setup and Prerequisites for ADSDB - Data Management & Analysis Backbone Operations

Project Overview

Prerequisites

Project Setup

Available Targets

General Operations

Data Management Backbone Operations

Data Analysis Backbone Operations

Data Management

Help and Maintenance

Data Pipeline Architecture

Data Management Backbone

Data Analysis Backbone

Additional Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages