Skip to content

Example multi-class classification machine learning project to showcase sklearn, MLflow, Prefect and Sphinx

Notifications You must be signed in to change notification settings

dmrauch/mlops-prefect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Pipelines and Workflow Orchestration with Prefect

This is an example multi-class classification machine learning project to showcase tools and best practices in the areas of

  • data science: scikit-learn
  • machine learning operations (MLOps): MLflow, Prefect
  • software development: Sphinx

In the end, this repository will contain and showcase the following aspects of an end-to-end machine learning project:

  • there will be a pipeline to
    • generate data: cluster IDs and cartesian coordinates
      • 2D
      • 3D
    • transform the data
      • 2D cartesian -> polar coordinates
      • 3D cartesian -> spherical coordinates
    • train an ML model: classify the coordinates to the cluster IDs
    • evaluate the model
      • performance metrics
      • global feature importance
        • permutation importance
        • mean Shapley values
      • local feature importance
        • Shapley values
    • advanced features
      • add derived features and run automatic feature selection
        • scikit-learn
        • tsfresh
      • hyperparameter optimisation by means of cross validation
      • probability calibration: scikit-learn CalibratedClassifierCV and calibration_curve
      • multiple (calibrated) classifiers combined with a scikit-learn VotingClassifier
      • configurable alternative: best-model selection from list of specified algorithms
      • try out counterfactuals: mlextend.evaluate.create_counterfactual
      • probabilistic / conformal predictions
  • the pipeline will be implemented with Prefect
  • experiments will be tracked with MLflow
    • using not the local filesystem, but rather the SQLite backend store option, in order to support model serving
  • best practices
    • use test-driven development
    • add Black and other linting and code formatting tools
    • automatically check test coverage
    • an automatic source code documentation built using Sphinx
    • development environments and installation requirements should be handled in a clean and consistent way
    • the code will be built into a package using Poetry
    • everything should run locally, but also in a Docker container

Resources

Questions

  • How to run Prefect-integrated code without Prefect (e.g. in an environment where this is not supported)?
  • Does Prefect have a concept of configuration files to pass parameters to the pipeline or to override default parameters of individual tasks?
  • How best to generate visualisations and dataframe printouts during intermediate steps of the pipeline and transport them outside?
    • I'm not sure all of this diagnostic information should be logged to MLflow
    • Perhaps that's what artifact mechanism is for

Thoughts and Notes

  • Adding 3D coordinates gives an opportunity to use tSNE for creating a 2D visualisation
  • New features can be defined in preprocessing steps in the pipeline using Pandas or as part of a feature engineering step in the model itself using scikit-learn. My personal thoughts on this are the following:
    • If all of the feature engineering can be done in scikit-learn, then this is preferable because simply exporting the model (as a scikit-learn pipeline) will include the additional features
    • If there are elements that have to be implemented outside of the scikit-learn model pipeline, then the proper outer pipeline (i.e. the part that is implemented in Prefect in this demo) has to be deployed anyways and it is preferable to make the pipeline as clean and consistent as possible - which may mean limiting the amount of feature engineering done with scikit-learn.

Getting Started

Prepare the Development Environment

  • In this package, the runtime/deployment dependencies are listed in the requirements.txt file, whereas additional development dependencies are collected in the playground-prefect.yml file.
  • The requirements.txt file, however, is included in the playground-prefect.yml
  • Therefore, to create the development environment, it is sufficient to run
    $ conda env create -f playground-prefect.yml
    
    or, alternatively, using the faster mamba package manager
    $ mamba env create -f playground-prefect.yml
    
    which will install the packages listed in the requirements.txt into the same environment as well.
  • The development environment can then be activated with
    $ conda activate playground-prefect
    
  • The advantage of this structure is that a requirements.txt file is provided, which can be used for packaging, while at the same time avoiding having to maintain two partially overlapping dependency lists.

Use Jupytext to Turn Notebooks Into Equivalent Python Scripts

  • Use jupytext to convert the Python scripts in the root folder (such as 1-main.py) to Jupyter notebooks:

    $ jupytext --set-formats ipynb,py:percent 1-main.py
    
  • After modifying a notebook, sync the .ipynb and the .py files with

    $ jupytext --sync 1-main.ipynb
    

Start Up Prefect

  • Spin up a Prefect server:

    $ prefect server start
    

    This will by default start the web UI at http://127.0.0.1:4200

Prefect Cheat Sheet

Settings and Profiles

  • Specify or change a setting:
    $ prefect config set PREFECT_TASKS_REFRESH_CACHE='True'
    
  • Reset to the default value:
    $ prefect config unset PREFECT_TASKS_REFRESH_CACHE
    
  • View the currently active settings:
    $ prefect config view
    

Profiles

  • List all available profiles:
    $ prefect profile ls
    
  • View the settings associated with the currently active profile:
    $ prefect profile inspect
    

About

Example multi-class classification machine learning project to showcase sklearn, MLflow, Prefect and Sphinx

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages