Skip to content

MLOps with dbt + python to orchestrate a ML pipeline beating bookies odds

Notifications You must be signed in to change notification settings

datarootsio/your-best-bet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

🎲 Your Best Bet

MLOps demo with Python models in dbt on the European Soccer Database

About The Project

Welcome to the high-octane world of production ML pipelines! We're thrilled to present an epic demonstration showcasing numerous MLOps concepts packed into a single dbt project. Strap in as we unveil this treasure trove of tools, tailored to empower data teams within organizations, speeding up the journey of ML models to production!

Imagine a scenario of daily (or weekly) sports betting where you're on a quest to outsmart the bookies. This project houses the code for a data warehouse powered by the European Soccer Database. Utilizing team and player statistics, performance metrics, FIFA stats, and bookie odds, we'll hunt down opportunities where our model paints a more accurate picture than at least one bookie. When our odds stack up better against theirs, it's our chance to strike gold! 💰

Within our pipeline, you can:

  • Version Your Dataset: run preprocessing to (re)generate your ML dataset
  • Experiment & Store: run and save experiments
  • Model Management: save and compare models
  • Reproducibility: ensure inference pipelines run without train/serving skew (run simulations)
  • Feature Store: house all input features with the available KPIs at that time
  • Prediction Audit: maintain a log of all predictions

(back to top)

Getting Started

Prerequisites

This thrilling adventure requires:

  • Python
  • Access to a Databricks cluster (e.g., Azure free account)
  • A firm grasp on dbt for seamless execution of these examples

Installation (Azure)

Buckle up for the setup ride:

  1. install virtual environment
    virtualenv venv
    source venv/bin/activate
    pip install -r requirements.txt
  2. Download data from here -> you need a Kaggle account. Drop the resulting database.sqlite file in the data folder.
  3. Convert data to parquet and csv files
    python scripts/convert_data.py
  4. Databricks
    1. Create a SQL warehouse -> check the connection details for your profile in the next step
    2. Create a personal access token, keep this token close and use to connect dbt to your sql warehouse.
    3. Upload data (parquet files) to warehouse, into the default schema in the hive_metastore catalog. Your catalog should look something like this
    4. Create a compute cluster
    5. check the cluster id (you can find in the SparkUI), and set as env var: COMPUTE_CLUSTER_ID=...
  5. dbt
    1. initialise and install dependencies.
    cd dbt_your_latest_bet
    dbt deps
    1. setup your dbt profile, should look something like this:
    databricks:
     outputs:
         dev:
             catalog: hive_metastore 
             host: xxx.cloud.databricks.com
             http_path: /sql/1.0/warehouses/$SQL_WAREHOUSE_ID
             schema: football
             threads: 4 # max number of parallel processes
             token: $PERSONAL_ACCESS_TOKEN
             type: databricks
     target: dev
  6. riskrover python package, managed with poetry
    1. build and install the package in your local environment
    cd riskrover
    poetry build
    pip install dist/riskrover-x.y.z.tar.gz 
    1. Install the resulting riskrover whl file on your databricks compute cluster

You should now be able to run the entire pipeline without any trained models (i.e. the preprocessing):

dbt build --selector gold

(back to top)

Usage

Explore and command the powers of our pipeline.

For these examples to work -> you need to move to the root dir of the dbt project, i.e. dbt_your_best_bet.

MWE for a simulation

The default variables are stored in dbt_project.yaml. We find ourselves on 2016-01-01 in our simulation, with the option to run until 2016-05-25.

cd dbt_your_best_bet

# Preprocessing
dbt build --selector gold

# Experimentation (by default -> training set to 2015-07-31, and trains a simple logistic regression with cross validation)
dbt build --selector ml_experiment

# Inference on test set (2015-08-01 -> 2015-12-31)
dbt build --selector ml_predict_run

# moving forward in time, for example with a weekly run
dbt build --vars '{"run_date": "2016-01-08"}'
dbt build --vars '{"run_date": "2016-01-15"}'
dbt build --vars '{"run_date": "2016-01-22"}'
...

Checking the data catalog

cd dbt_your_best_bet

dbt docs generate
dbt docs serve

It's like a grand lineage tale with no models documented yet—stay tuned! We can already check the lineage:

(back to top)

Roadmap

Mostly maintenance, no plans on new features unless requested.

  • Documentation
  • Tests
  • Extra sql analysis models

(back to top)

Contributing

All contributions are welcome!

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License.

Contact

(back to top)

About

MLOps with dbt + python to orchestrate a ML pipeline beating bookies odds

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages