# Continuous Machine Learning

As a data science team, we want to collaborate and integrate our work as often as possible. To ensure that our work does not only run locally (*"but it works on my machine"*) and is compatible with the others' work, we want to run our data and training pipeline as part of our **continuous integration** (CI) pipeline.

In this showcase we want to demonstrate how you can setup a simple CI pipeline that runs via **Github Actions** and trains the model on each pushed commit group. As a nice bonus feature, we will use [cml.dev](cml.dev) to post training metrics into a pull request. 

### Disclaimer
This showcase is based on the official [cml.dev](cml.dev) introductory example :)

## Prerequisites

To follow this showcase, you will need
- basic familiarity with git
- a Github account

# Setup the repository

1. Create a new Github repository via the Github UI (top-right corner, "+" symbol, "New repository"

In [None]:
# initialize git repo locally
!git init

In [None]:
# set our author profile
!git config user.email "you@example.com"
!git config user.name "Your Name"

In [None]:
# add the Github repo as a remote to the local repository
#!git remote add origin https://github.com/YOUR_GH_NAME/YOUR_GH_REPO.git

## Add the code to the repo
We've already prepared a simple data science project. It consists of the file `get_data.py`, which imports some tabular data, as well as `train.py`, which trains a RandomForestClassifier. The `requirements.txt` contains a list of the python packages necessary to run our code. In addition, we also prepared a Github Actions workflow definition at `.github/workflows/pipeline.yaml`. Let's have a look at these files.

In [None]:
!git add get_data.py train.py requirements.txt .github/workflows/pipeline.yaml

In [None]:
!git commit -m "Add core files"

In [None]:
# we have to open a terminal for this, since it asks for input
# !git push -u origin master

View the pipeline running now automatically in your Github repo via Github Actions.

## Let's change something
Now that we established the CI pipeline, we want to change some hyperparameters and view the changes. We will create a new branch and push our changes to the new branch. Then, we will create a pull request.

First, let's change the hyperparameter `depth` in our training file.

In [None]:
!git checkout -b changed-depth

In [None]:
!git add train.py

In [None]:
!git commit -m "Try depth of 5"

In [None]:
# we have to open a terminal for this, since it asks for input
# !git push -u origin changed-depth

Open a pull request and wait for the pipeline to finish.

## Remarks

Our showcase trained a model using a very basic python script. Of course, you can also combine this idea with more advanced pipelines and frameworks, such as dvc or mlflow.

The more complicated and resource-intensive your pipeline gets (e.g. because your training requires GPUs), it might be worth looking into running the pipeline not directly in the Github runner, but remotely in an orchestrator such as Airflow or Kubeflow Pipelines.