# MLFlow Intro 

https://github.com/danpisq/mlflow-workshop.git <br>
Install mlflow: <h3> pip install mlflow </h3>

## ML development challanges
<ul>
    <li><h4>Lots of develpment tools</h4></li>
    <li><h4>Hard to track and reproduce results</h4></li>
</ul>

## MLFlow 
<ul>
    <li><h4>Opensource machine learning platform</h4></li>
    <li><h4>Works with any ML library</h4></li>
    <li><h4>Runs the same everywhere</h4></li>
    <li><h4>Allows for collaboration</h4></li>
</ul>

<img src="https://databricks.com/wp-content/uploads/2018/06/mlflow.png">

<h3>repo : https://github.com/danpisq/mlflow-workshop.git</h3>

# MLFlow Tracking 

<h3>Key components in tracking</h3>
<ul>
    <li><h4>Parameters - input to code</h4></li>
    <li><h4>Metrics - can change over time</h4></li>
    <li><h4>Artifacts - files, inlcuding models</h4></li>
    <li><h4>Source - what produced the run</h4></li>
</ul>


<h4>Working with remote server (OPTIONAL)</h4>
<p>MlFlow allows us to work with one centralized tracking server</p>

<p>to start server use command:</p>
<code>mlflow server -p 5050</code>
<p>than you'll need to set env variable</p>
<code>export MLFLOW_TRACKING_URI=http://127.0.0.1:5050</code>
<p>By doing so all runs will end in one place</p>

# Ex 1 - MLFlow Tracking
<p>Let's build a simple model and querry it using mlflow.</p>
<p>We are going to use <b>Wine Quality Dataset</b> the goal is to predict quality of a wine based on given features, such as: amout of sugar, alcohol etc. </p>
<p>We are going to use ElasticNet model from scikit-learn library</p>


<h4> Imports </h4>


In [39]:
import mlflow
import mlflow.sklearn

import numpy as np
import pandas as pd

from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

## Data
<p>Lets load csv file using pandas</p>

In [40]:
df = pd.read_csv("wine-quality.csv")
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909
std,0.843868,0.100795,0.12102,5.072058,0.021848,17.007137,42.498065,0.002991,0.151001,0.114126,1.230621,0.885639
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,3.0
25%,6.3,0.21,0.27,1.7,0.036,23.0,108.0,0.991723,3.09,0.41,9.5,5.0
50%,6.8,0.26,0.32,5.2,0.043,34.0,134.0,0.99374,3.18,0.47,10.4,6.0
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2,9.0


Split the data for training and validation

In [41]:
train, test = train_test_split(df)

X_train = train.drop(['quality'], axis=1)
X_test = test.drop(['quality'], axis=1)
y_train = train['quality']
y_test = test['quality']

We are going to define simple function that will perform the validation

In [3]:
def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

## Hyperparameters

In [42]:
ALPHA = 0.05
L1_RATIO = 0.5

## Main workflow 

In [22]:
with mlflow.start_run(nested=True):
    
    #build and train the model
    model = ElasticNet(alpha=ALPHA, l1_ratio=L1_RATIO)
    model.fit(X_train, y_train)

    #logging parameters
    mlflow.log_param('alpha', ALPHA)
    mlflow.log_param('l1-ratio', L1_RATIO)

    #eval
    y_predicted = model.predict(X_test)
    (rmse, mae, r2) = eval_metrics(y_test, y_predicted)

    #logging metrics
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    #logging artifacts
    mlflow.log_artifact('wine-quality.csv')

    mlflow.sklearn.log_model(model, "model")

<p>We can see mlflow created new folder <b>mlruns</b>, all data will be stored there.</p>

In [43]:
!ls 

conda.yaml         [34mmlflow[m[m             [34mmlruns[m[m
introduction.ipynb [34mmlflow2[m[m            wine-quality.csv


<p>Let's start mlflow app using command:</p>
<code>mlflow ui</code>
<p>By default it will run on <a href="http://localhost:5000">http://localhost:5000</a> <p>

<h3>Cleanup</h3>
<p>We will remove <b>mlruns</b> foler before next example</p>

In [44]:
!rm -rf mlruns
!ls

conda.yaml         [34mmlflow[m[m             wine-quality.csv
introduction.ipynb [34mmlflow2[m[m


# EX 2 - MLFlow Projects

<p>Projects gives us high level format for reproducing runs on different platforms</p>

<h3>Setup</h3>
<p>Let's clone repository with next example</p>

In [45]:
!git clone https://github.com/greghop/mlflow.git
!ls mlflow

fatal: destination path 'mlflow' already exists and is not an empty directory.
MLproject        conda.yaml       wine-quality.csv
README.md        train.py


<h3>MLproject</h3>
<p>MLproject is a definition of our project. It is written in YAML< /p>
<p> It contains: </p>
    <ul>
        <li>name</li>
        <li>information about conda/docker environment</li>
        <li>information about entrypoints</li>
    </ul>


In [47]:
!cat mlflow/MLproject

name: tutorial

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      alpha: float
      l1_ratio: {type: float, default: 0.1}
    command: "python train.py {alpha} {l1_ratio}"


<b>conda.yaml </b>
<p>specification of our environment</p>

In [48]:
!cat mlflow/conda.yaml

name: tutorial
channels:
  - defaults
dependencies:
  - cloudpickle=0.6.1
  - python=3.6
  - numpy=1.14.3
  - pandas=0.22.0
  - scikit-learn=0.19.1
  - pip:
    - mlflow


<p>Our project is written in python so we can run it with following command, but to do so we would need to look at the source code to know parameters etc.</p>
<p> Also we need to create environment manually.</p> 


In [31]:
!python mlflow/train.py 0.8 0.4

Elasticnet fantastic 3 model (alpha=0.800000, l1_ratio=0.400000):
  RMSE: 0.8415661070696857
  MAE: 0.6369562819100203
  R2: 0.08526279091557487


<p>This is when MLflow Projects comes to rescue.</p>
<p>It gives us command line API to run the code</p>

<b>Running MLflow project</b> <br><br>
mlflow run URI [OPTIONS] <br>

<ul>
<li>-P NAME=VALUE</li>
<li>-e NAME</li>
</ul>

<p>Let's run the project using MLflow, It is going to create conda environment with all dependencies, and execute te source code</p>


In [32]:
!mlflow run mlflow/ -P alpha=0.6 -P l1_ratio=0.7

2019/04/10 16:08:42 INFO mlflow.projects: === Created directory /var/folders/nh/p1tcdzns0ns99399dh81m56r0000gn/T/tmpzcn4z3kh for downloading remote URIs passed to arguments of type 'path' ===
2019/04/10 16:08:42 INFO mlflow.projects: === Running command 'source activate mlflow-b93852916f9be8ee2359db52b5dfab5589743459 && python train.py 0.6 0.7' in run with ID 'a92fd48677524611b214d18760653efe' === 
  env = yaml.load(_conda_header)
Elasticnet fantastic 3 model (alpha=0.600000, l1_ratio=0.700000):
  RMSE: 0.8591606388045732
  MAE: 0.648352939480482
  R2: 0.04661433685958705
2019/04/10 16:08:43 INFO mlflow.projects: === Run (ID 'a92fd48677524611b214d18760653efe') succeeded ===


<p><b>URI</b> doesn't have to be local path, it can also be github repository, as long as it contains <b>MLproject</b> file in the root </p>

In [33]:
!mlflow run https://github.com/greghop/mlflow.git -P alpha=0.1 -P l1_ratio=0.5

2019/04/10 16:09:34 INFO mlflow.projects: === Fetching project from https://github.com/greghop/mlflow.git into /var/folders/nh/p1tcdzns0ns99399dh81m56r0000gn/T/tmpkn9ge01f ===
2019/04/10 16:09:37 INFO mlflow.projects: === Created directory /var/folders/nh/p1tcdzns0ns99399dh81m56r0000gn/T/tmph4dkg9hh for downloading remote URIs passed to arguments of type 'path' ===
2019/04/10 16:09:37 INFO mlflow.projects: === Running command 'source activate mlflow-b93852916f9be8ee2359db52b5dfab5589743459 && python train.py 0.1 0.5' in run with ID 'bb4b30aac8c5464dbf1583f56814742b' === 
  env = yaml.load(_conda_header)
Elasticnet fantastic 3 model (alpha=0.100000, l1_ratio=0.500000):
  RMSE: 0.7845017946547458
  MAE: 0.6150949836730213
  R2: 0.2051086790093516
2019/04/10 16:09:39 INFO mlflow.projects: === Run (ID 'bb4b30aac8c5464dbf1583f56814742b') succeeded ===


http://localhost:5000

Clean

In [50]:
rm -rf mlruns


# Ex 3 -  Multistep workflow
<p>This example will show project with two parts</p>
<ol>
    <li>Simple preprocessing</li>
    <li>Training</li>
</ol>

<h4>To get the source run following command:</h4>

In [51]:
!git clone https://github.com/greghop/mlflow2.git
!ls mlflow2

fatal: destination path 'mlflow2' already exists and is not an empty directory.
MLproject        conda.yaml       main.py          wine-quality.csv
README.md        etl.py           train.py


<h4>MLproject</h4>

In [52]:
!cat mlflow2/MLproject

name: multistep

conda_env: conda.yaml

entry_points:
  etl:
    parameters:
      scaler: {type: int, default: 1}
    command: "python etl.py --scaler {scaler}"
 
  train:
    parameters:
      run-id: string
      alpha: {type: float, default: 0.1}
      l1-ratio: {type: float, default: 0.1}
    command: "python train.py --run-id {run-id} --alpha {alpha} --l1-ratio {l1-ratio}"

  main: 
    parameters:
      alpha: {type: float, default: 0.1}
      l1-ratio: {type: float, default: 0.1}
    command: "python main.py  --alpha {alpha} --l1-ratio {l1-ratio}"

<p>We can see it is more complicated. Our projects contains three endpoints</p>
<ul>
    <li><h5>etl</h5> - reads csv data-file, normalizes it and logs it as MLflow artifact </li>
    <li><h5>train</h5> - trains ElasticNet on data from specified run (run-id) with given hyperparams  </li>
    <li><h5>main</h5> - runs both <b>etl</b> and <b>train</b> in semi inteligent way. Firstly it checks if there was a run in the past with specified parameters and git version. If yes it uses artifact from that run, and feeds it to the next step. If not it runs the step and than takes the artifacts.</li>
    </ul>

Let's run the whole project with default params

In [38]:
!mlflow run mlflow2/ 

2019/04/10 16:14:24 INFO mlflow.projects: === Created directory /var/folders/nh/p1tcdzns0ns99399dh81m56r0000gn/T/tmpnrr6w437 for downloading remote URIs passed to arguments of type 'path' ===
2019/04/10 16:14:24 INFO mlflow.projects: === Running command 'source activate mlflow-df90610eb3183421bcbf1eef16dc332bc5193c11 && python main.py  --alpha 0.1 --l1-ratio 0.1' in run with ID 'd149e9881be049f0902c624871cf1f19' === 
Launching new run for entrypoint=etl and parameters={'scaler': 1}
2019/04/10 16:14:27 INFO mlflow.projects: === Created directory /var/folders/nh/p1tcdzns0ns99399dh81m56r0000gn/T/tmp8jjn4w8r for downloading remote URIs passed to arguments of type 'path' ===
2019/04/10 16:14:27 INFO mlflow.projects: === Running command 'source activate mlflow-df90610eb3183421bcbf1eef16dc332bc5193c11 && python etl.py --scaler 1' in run with ID 'bf42743f0f984a938530b9069779b1c8' === 
2019/04/10 16:14:28 INFO mlflow.projects: === Run (ID 'bf42743f0f984a938530b9069779b1c8') succeeded ===
Launch

Now we can run it again with different hyperparameters for training. It will take data from previous <b>etl</b>, because parameters for this endpoint have not changed

In [55]:
!mlflow run mlflow2/ -P alpha=0.5 -P l1-ratio=0.2

2019/04/10 17:37:49 INFO mlflow.projects: === Created directory /var/folders/nh/p1tcdzns0ns99399dh81m56r0000gn/T/tmpdtm1_jt7 for downloading remote URIs passed to arguments of type 'path' ===
2019/04/10 17:37:49 INFO mlflow.projects: === Running command 'source activate mlflow-df90610eb3183421bcbf1eef16dc332bc5193c11 && python main.py  --alpha 0.5 --l1-ratio 0.2' in run with ID '7f1b1ae3ed5742338ea39ae45449bf21' === 
Launching new run for entrypoint=etl and parameters={'scaler': 1}
2019/04/10 17:37:52 INFO mlflow.projects: === Created directory /var/folders/nh/p1tcdzns0ns99399dh81m56r0000gn/T/tmphk_q8rk_ for downloading remote URIs passed to arguments of type 'path' ===
2019/04/10 17:37:52 INFO mlflow.projects: === Running command 'source activate mlflow-df90610eb3183421bcbf1eef16dc332bc5193c11 && python etl.py --scaler 1' in run with ID '70002da578a4446a8fa4be801961e0e9' === 
2019/04/10 17:37:53 INFO mlflow.projects: === Run (ID '70002da578a4446a8fa4be801961e0e9') succeeded ===
Launch

We can also specify endpoint we want to run:

In [49]:
!mlflow run -e etl -P scaler='standard'

/Users/dpiskors/mlflow-workshop
