# Install stuff

First, we need to install all the packages and software we need:
This inlcudes:
- python
- pip
- virtual env
- dvc
- mlflow
- docker

# Get started

We create a new directory, I will call mine MLOps

Let's first create a new virtual environment in the terminal

<code>python3 -m venv mlflow-venv
source mlflow-venv/bin/activate</code>

Next, we install dvc

<code>pip install dvc</code>

## Init git repo
First, we need to init a git repo

<code>git init</code>

## Init DVC repo

Once DVC ist installed, we init a new DVC repo

<code>
dvc init
dvc remote add -d dvc-remote /tmp/dvc-storage
git add .
git commit -m "configure remote storage"
</code>

Next, we copy our dataset csv into the directory and add it to DVC

<code>dvc add BostonHousing.csv</code>

Add the dvc file to the git repo

<code>
git add .
git commit -m "Add Boston Housing data to DVC"
git tag -a "v1" -m "original data"
dvc push
</code>

# Creating a docker for the mlflow server

create a new file within the directory and name it "Dockerfile"

open the file and add the following:

<code>
# Dockerfile

FROM continuumio/miniconda3

RUN conda install -c conda-forge mlflow
EXPOSE 5000

CMD ["mlflow", "server", "--host", "0.0.0.0", "--port", "5000"]
</code>

Build the dockerfile and run it, preferably from another terminal:

<code>sudo docker build -t mlflow-server .</code>

<code>sudo docker run -p 5000:5000 mlflow-server</code>

Next, we can access the mlflow UI at http://localhost:5000

![alt text](mlflow_ui_fresh.png "mlflow UI")

# Run some experiments

So far, everything is set up and we can start with some model training. Create a new python file <code>train.py</code> and add the following to it:

import dvc.api
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from numpy import genfromtxt


# Set the tracking uri and the experiment where the runs should be logged to
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("Boston Housing Regression")

# Define dataset using DVC
version = "v1"

data_url = dvc.api.get_url(
    path = "BostonHousing.csv",
    repo = "/home/torsten/projects/MLOps",
    rev = version
)

# Load dataset into numpy array
data = genfromtxt(data_url, delimiter=',')
labels = data [:1,:]
X = data[1:, :-1]
y = data[1:, -1] 

# split data into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# iterate over a set of parameters to simulate different experiments
for nodes in range(10,31,10):
    # define parameters to be logged with mlflow
    params = {
        "layers": 1,
        "number of nodes": nodes,
        "model": "MLPRegressor"
    }

    # train a Multi Layer Perceptron with the defined number of nodes
    regr = MLPRegressor(max_iter=500, hidden_layer_sizes=(nodes,), random_state=42)
    regr.fit(X_train, y_train)
    
    # predict on test set
    y_pred = regr.predict(X_test)
    
    # calculate MSE metric
    mse = mean_squared_error(y_test, y_pred)
    
    #log to mlflow
    with mlflow.start_run():
        # set name of the current run
        mlflow.set_tag("mlflow.runName", "boston_MLP_" + str(nodes))
        # log model
        mlflow.sklearn.log_model(regr, "model_MLP_" + str(nodes))
        # log training configurations
        mlflow.log_metric("MSE", mse)
        # log parameters
        mlflow.log_params(params)
        # Log data as parameters
        mlflow.log_param('data_url', data_url)
        mlflow.log_param('data_version', version)    

print("All experiments done, go grab a coffee...")

Add and commit everything to git:

<code>
git add .
git commit -m "code file added"
</code>

## run the pyhton script

<code>
python3 train.py
</code>

We can see the results at the mlflow webUI

![alt text](mlflow_ui_runs.png "mlflow UI")

Note that we logged the hyperparameters, the data version, metrics and even the complete model that. The model can be restored from mlflow, the dataset in its used version can be restored from DVC.

# Version of the dataset

If we make changes to the dataset, like preprocessing or removing datapints, we need to create a new version of it. Let's assume we removed some outlier datapoints

<code>
dvc add BostonHousing.csv
dvc push
git add .
git commit -m "data: removed 20 outliers"
git tag -a "v2" -m "removed 20 outliers"
</code>