Data Science Monorepo

Monorepo example for Data Science teams.

Check out the original article here to understand the underlying principles of the design.
Check out the getting started guide for more information on how to use this project.

Getting Started

Requirements

Python >= 3.10 (can be installed with pyenv or with asdf)
Pipenv == 2022.9.24
direnv
Docker

Setup

The project is automated using make and the Makefile

In the first run, use make setup to bootstrap the project setup in your local. This will install dependencies using pipenv and create the necessary files. It will also run make all for you so you can verify that everything is working in your local.

Project Structure

This is a ML Monorepo. Multiple modules live in the ./src directory. Current ones are:

src
├── config             # Base configuration objects
├── data_access_layer  # Data Access layer helpers
├── datasets           # Schema for different datasets using pandera
├── feature_store      # Common interfaces for model features
└── models             # Code to generate different models

Run project

Each module can implement different runs, but in general they should be runable as a module. For eample:

python -m models.diabetes.features

When running the project as a docker image, you must specify the module to run as the docker command. We have implemented a helper script to run the docker build image with the latest project files and AWS credentials setup

./scripts/docker-run models.diabetes.features

End-to-end example: Diabetes prediction

Install the project dependencies in the pipenv environment.

Run mlflow in local with model registry:

$ pipenv run mlflow
[INFO] Starting gunicorn 20.1.0
[INFO] Listening at: http://127.0.0.1:5000

Run the feature extraction:

$ pipenv run python -m models.diabetes.features \
    --dst tmp/data/diabetes_features.parquet
features.py:main:14 INFO: Start | run
features.py:main:14 INFO: End | run | (result=sklearn_dataset='diabetes' dst='tmp/data/diabetes_features.parquet')

Run the data preprocessing:

$ pipenv run python -m models.preprocess \
    --src_features=tmp/data/diabetes_features.parquet \
    --dst_x_train=tmp/data/x_train.parquet \
    --dst_y_train=tmp/data/y_train.parquet \
    --dst_x_test=tmp/data/x_test.parquet \
    --dst_y_test=tmp/data/y_test.parquet
preprocess.py:<module>:46 INFO: Start | run
preprocess.py:<module>:46 INFO: End | run | (result=src_features='tmp/data/diabetes_features.parquet' dst_x_train='tmp/data/x_train.parquet' dst_y_train='tmp/data/y_train.parquet' dst_x_test='tmp/data/x_test.parquet' dst_y_test='tmp/data/y_test.parquet')

Run the model training (note the output model uri)

$ pipenv run python -m models.diabetes.train \
    --src_x_train=tmp/data/x_train.parquet \
    --src_y_train=tmp/data/y_train.parquet \
    --src_x_test=tmp/data/x_test.parquet \
    --src_y_test=tmp/data/y_test.parquet
train.py:<module>:72 INFO: Start | run
train.py:<module>:72 INFO: End | run
train.py:<module>:75 INFO: Model saved to runs:/2099249145894ae3b16b7a37653cec06/model

Run a predictions using the previous logged model in mlflow

$ pipenv run python -m models.predict \
    --src_features=tmp/data/x_test.parquet \
    --src_model=runs:/2099249145894ae3b16b7a37653cec06/model \
    --dst_y_hat=tmp/data/y_hat_test.parquet
predict.py:<module>:126 INFO: src_features='tmp/data/x_test.parquet' src_model='runs:/2099249145894ae3b16b7a37653cec06/model' flavour='sklearn' parallel_backend='threading' n_jobs=-1 batch_predictions=False batch_size=10000 progress_bar=True dst_y_hat='tmp/data/y_hat_test.parquet' mlflow=MLFlowConfig(experiment_name='predict', run_name='run-2023-06-20T18-09-48', tracking_uri=SecretStr('**********'), flavor='sklearn', tags=None) execution_date='2023-06-20T18:09:48Z'
predict.py:<module>:131 INFO: Reading data from tmp/data/x_test.parquet
predict.py:<module>:133 INFO: Loading model from runs:/2099249145894ae3b16b7a37653cec06/model
predict.py:<module>:137 INFO: Writing y_hat to tmp/data/y_hat_test.parquet

Documentation

Documentation uses mkdocs.

To run the documentation locally, run make mkdocs and open http://localhost:8000

To expand the documentation, edit the files in the ./docs directory. Any markdown file can be added there and will be rendered in the documentation with navigation and search support.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
docs		docs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.tool-versions		.tool-versions
CODEOWNERS		CODEOWNERS
Dockerfile		Dockerfile
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
docker-compose.yaml		docker-compose.yaml
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Monorepo

Getting Started

Requirements

Setup

Project Structure

Run project

End-to-end example: Diabetes prediction

Documentation

About

Releases

Packages

Languages

clarityai-eng/datascience-monorepo-example

Folders and files

Latest commit

History

Repository files navigation

Data Science Monorepo

Getting Started

Requirements

Setup

Project Structure

Run project

End-to-end example: Diabetes prediction

Documentation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages