This is an example multi-class classification machine learning project to showcase tools and best practices in the areas of
- data science: scikit-learn
- machine learning operations (MLOps): MLflow, Prefect
- software development: Sphinx
In the end, this repository will contain and showcase the following aspects of an end-to-end machine learning project:
- there will be a pipeline to
- generate data: cluster IDs and cartesian coordinates
- 2D
- 3D
- transform the data
- 2D cartesian -> polar coordinates
- 3D cartesian -> spherical coordinates
- train an ML model: classify the coordinates to the cluster IDs
- evaluate the model
- performance metrics
- global feature importance
- permutation importance
- mean Shapley values
- local feature importance
- Shapley values
- advanced features
- add derived features and run automatic feature selection
- scikit-learn
- tsfresh
- hyperparameter optimisation by means of cross validation
- probability calibration: scikit-learn CalibratedClassifierCV and calibration_curve
- multiple (calibrated) classifiers combined with a scikit-learn VotingClassifier
- configurable alternative: best-model selection from list of specified algorithms
- try out counterfactuals: mlextend.evaluate.create_counterfactual
- probabilistic / conformal predictions
- add derived features and run automatic feature selection
- generate data: cluster IDs and cartesian coordinates
- the pipeline will be implemented with Prefect
- use caching of intermediate pipeline results
- add Prefect/Juypter integration
- try out parallelisation
- need to use
.submit()
and.result()
as per doc: Tutorials/Configuration and doc: Tutorials/Execution
- need to use
- distributed calculation: doc: Guides/Dask & Ray
- experiments will be tracked with MLflow
- using not the local filesystem, but rather the SQLite backend store option, in order to support model serving
- best practices
- use test-driven development
- add Black and other linting and code formatting tools
- automatically check test coverage
- an automatic source code documentation built using Sphinx
- development environments and installation requirements should be handled in a clean and consistent way
- the code will be built into a package using Poetry
- everything should run locally, but also in a Docker container
- Prefect
- How to run Prefect-integrated code without Prefect (e.g. in an environment where this is not supported)?
- Does Prefect have a concept of configuration files to pass parameters to the pipeline or to override default parameters of individual tasks?
- How best to generate visualisations and dataframe printouts during
intermediate steps of the pipeline and transport them outside?
- I'm not sure all of this diagnostic information should be logged to MLflow
- Perhaps that's what artifact mechanism is for
- Adding 3D coordinates gives an opportunity to use tSNE for creating a 2D visualisation
- New features can be defined in preprocessing steps in the pipeline using
Pandas or as part of a feature engineering step in the model itself using
scikit-learn. My personal thoughts on this are the following:
- If all of the feature engineering can be done in scikit-learn, then this is preferable because simply exporting the model (as a scikit-learn pipeline) will include the additional features
- If there are elements that have to be implemented outside of the scikit-learn model pipeline, then the proper outer pipeline (i.e. the part that is implemented in Prefect in this demo) has to be deployed anyways and it is preferable to make the pipeline as clean and consistent as possible - which may mean limiting the amount of feature engineering done with scikit-learn.
- In this package, the runtime/deployment dependencies are listed in the
requirements.txt
file, whereas additional development dependencies are collected in theplayground-prefect.yml
file. - The
requirements.txt
file, however, is included in theplayground-prefect.yml
- Therefore, to create the development environment, it is sufficient to run
or, alternatively, using the faster mamba package manager
$ conda env create -f playground-prefect.yml
which will install the packages listed in the$ mamba env create -f playground-prefect.yml
requirements.txt
into the same environment as well. - The development environment can then be activated with
$ conda activate playground-prefect
- The advantage of this structure is that a
requirements.txt
file is provided, which can be used for packaging, while at the same time avoiding having to maintain two partially overlapping dependency lists.
-
Use
jupytext
to convert the Python scripts in the root folder (such as1-main.py
) to Jupyter notebooks:$ jupytext --set-formats ipynb,py:percent 1-main.py
-
After modifying a notebook, sync the
.ipynb
and the.py
files with$ jupytext --sync 1-main.ipynb
-
Spin up a Prefect server:
$ prefect server start
This will by default start the web UI at http://127.0.0.1:4200
- Specify or change a setting:
$ prefect config set PREFECT_TASKS_REFRESH_CACHE='True'
- Reset to the default value:
$ prefect config unset PREFECT_TASKS_REFRESH_CACHE
- View the currently active settings:
$ prefect config view
- List all available profiles:
$ prefect profile ls
- View the settings associated with the currently active profile:
$ prefect profile inspect