## MLflow Diabetes Example  (with MLflow Registry)
This is a Quick Start notebook based on [MLflow's tutorial](https://mlflow.org/docs/latest/tutorial.html).  In this tutorial, we’ll:
* Install the MLflow library on a Databricks cluster
* Connect our notebook to an MLflow Tracking Server that is hosted by Databricks
* Log metrics, parameters, models and a .png plot to show how you can record arbitrary outputs from your MLflow job
* View our results on the MLflow tracking UI.

This notebook uses the `diabetes` dataset in scikit-learn and predicts the progression metric (a quantitative measure of disease progression after one year after) based on BMI, blood pressure, etc. It uses the scikit-learn ElasticNet linear regression model, where we vary the `alpha` and `l1_ratio` parameters for tuning. For more information on ElasticNet, refer to:
  * [Elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization)
  * [Regularization and Variable Selection via the Elastic Net](https://web.stanford.edu/~hastie/TALKS/enet_talk.pdf)

A good reference for MLflow in general is [Matei's Spark Summit 2018 Keynote](https://databricks.com/sparkaisummit/north-america/spark-summit-2018-keynotes).

To get started, you will first need to 

1. Be part of the Databricks Hosted MLflow early adopter program and have MLflow tracking server enabled on your shard.
2. Install the most recent version of MLflow and Python ML and math libraries on your Databricks cluster (see details in the next cell).

#### Write Your ML Code Based on the`train_diabetes.py` Code
This tutorial is based on the MLflow's [train_diabetes.py](https://github.com/databricks/mlflow/blob/master/example/tutorial/train_diabetes.py), which uses the `sklearn.diabetes` built-in dataset to predict disease progression based on various factors.

In [3]:
# Import various libraries including matplotlib, sklearn, mlflow
import os
import warnings
import sys

import pandas as pd
import numpy as np
from itertools import cycle
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets

# Import mlflow
import mlflow
import mlflow.sklearn

# Configure MLflow Tracking
mlflow.set_tracking_uri("databricks")
databricks_host = 'https://demo.cloud.databricks.com'
databricks_token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
os.environ['DATABRICKS_HOST'] = databricks_host
os.environ['DATABRICKS_TOKEN'] = databricks_token

# Load Diabetes datasets
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

# Create pandas DataFrame for sklearn ElasticNet linear_model
Y = np.array([y]).transpose()
d = np.concatenate((X, Y), axis=1)
cols = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'progression']
data = pd.DataFrame(d, columns=cols)

#### Plot the ElasticNet Descent Path
As an example of recording arbitrary output files in MLflow, we'll plot the [ElasticNet Descent Path](http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html) for the ElasticNet model by *alpha* for the specified *l1_ratio*.

The `plot_enet_descent_path` function below:
* Returns an image that can be displayed in our Databricks notebook via `display`
* As well as saves the figure `ElasticNet-paths.png` to the Databricks cluster's driver node
* This file is then uploaded to MLflow using the `log_artifact` within `train_diabetes`

In [5]:
def plot_enet_descent_path(X, y, l1_ratio):
    # Compute paths
    eps = 5e-3  # the smaller it is the longer is the path

    # Reference the global image variable
    global image
    
    print("Computing regularization path using the elastic net.")
    alphas_enet, coefs_enet, _ = enet_path(X, y, eps=eps, l1_ratio=l1_ratio, fit_intercept=False)

    # Display results
    fig = plt.figure(1)
    ax = plt.gca()

    colors = cycle(['b', 'r', 'g', 'c', 'k'])
    neg_log_alphas_enet = -np.log10(alphas_enet)
    for coef_e, c in zip(coefs_enet, colors):
        l1 = plt.plot(neg_log_alphas_enet, coef_e, linestyle='--', c=c)

    plt.xlabel('-Log(alpha)')
    plt.ylabel('coefficients')
    title = 'ElasticNet Path by alpha for l1_ratio = ' + str(l1_ratio)
    plt.title(title)
    plt.axis('tight')

    # Display images
    image = fig
    
    # Save figure
    fig.savefig("ElasticNet-paths.png")

    # Close plot
    plt.close(fig)

    # Return images
    return image    

#### Train the Diabetes Model
The next function trains Elastic-Net linear regression based on the input parameters of `alpha (in_alpha)` and `l1_ratio (in_l1_ratio)`.

In addition, this function uses MLflow Tracking to record its
* parameters
* metrics
* model
* arbitrary files, namely the above noted Lasso Descent Path plot.

**Tip:** We use `with mlflow.start_run:` in the Python code to create a new MLflow run. This is the recommended way to use MLflow in notebook cells. Whether your code completes or exits with an error, the `with` context will make sure that we close the MLflow run, so you don't have to call `mlflow.end_run` later in the code.

In [7]:
# train_diabetes
#   Uses the sklearn Diabetes dataset to predict diabetes progression using ElasticNet
#       The predicted "progression" column is a quantitative measure of disease progression one year after baseline
#       http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
#
#   Returns: The MLflow RunInfo associated with this training run, see
#            https://mlflow.org/docs/latest/python_api/mlflow.entities.html#mlflow.entities.RunInfo
#            We will use this later in the notebook to demonstrate ways to access the output of this
#            run and do useful things with it!
def train_diabetes(data, in_alpha, in_l1_ratio):
  # Evaluate metrics
  def eval_metrics(actual, pred):
      rmse = np.sqrt(mean_squared_error(actual, pred))
      mae = mean_absolute_error(actual, pred)
      r2 = r2_score(actual, pred)
      return rmse, mae, r2

  warnings.filterwarnings("ignore")
  np.random.seed(40)

  # Split the data into training and test sets. (0.75, 0.25) split.
  train, test = train_test_split(data)

  # The predicted column is "progression" which is a quantitative measure of disease progression one year after baseline
  train_x = train.drop(["progression"], axis=1)
  test_x = test.drop(["progression"], axis=1)
  train_y = train[["progression"]]
  test_y = test[["progression"]]

  if float(in_alpha) is None:
    alpha = 0.05
  else:
    alpha = float(in_alpha)
    
  if float(in_l1_ratio) is None:
    l1_ratio = 0.05
  else:
    l1_ratio = float(in_l1_ratio)
  
  # Start an MLflow run; the "with" keyword ensures we'll close the run even if this cell crashes
  #with mlflow.start_run() as run:
  #with mlflow.start_run(experiment_id = 4032369) as run:    
  with mlflow.start_run() as run:
    lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    lr.fit(train_x, train_y)

    predicted_qualities = lr.predict(test_x)

    (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

    # Print out ElasticNet model metrics
    print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
    print("  RMSE: %s" % rmse)
    print("  MAE: %s" % mae)
    print("  R2: %s" % r2)

    # Set tracking_URI first and then reset it back to not specifying port
    # Note, we had specified this in an earlier cell
    #mlflow.set_tracking_uri(mlflow_tracking_URI)

    # Log mlflow attributes for mlflow UI
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)
    mlflow.sklearn.log_model(lr, "model")
    
    # Call plot_enet_descent_path
    image = plot_enet_descent_path(X, y, l1_ratio)
    
    # Log artifacts (output files)
    mlflow.log_artifact("ElasticNet-paths.png")
    
    print("Inside MLflow Run with id %s" % run.info.run_uuid)
    
    # return our RunUUID so we can use it when we try out some other APIs later in this notebook.
    return run.info

![](https://docs.databricks.com/_static/images/mlflow/elasticnet-paths-by-alpha-per-l1-ratio.png)

#### Experiment with Different Parameters

Now that we have a `train_diabetes` function that records MLflow runs, we can simply call it with different parameters to explore them. Later, we'll be able to visualize all these runs on our MLflow tracking server.

In [10]:
# Start with alpha and l1_ratio values of 0.01, 0.01
run_info_1 = train_diabetes(data, 0.01, 0.01)

In [11]:
display(image)

In [12]:
# Start with alpha and l1_ratio values of 0.01, 0.75
run_info_2 = train_diabetes(data, 0.01, 0.75)

In [13]:
display(image)

In [14]:
# Start with alpha and l1_ratio values of 0.01, 1
run_info_3 = train_diabetes(data, 0.01, 1)

In [15]:
display(image)

In [16]:
# Start with alpha and l1_ratio values of 0.05, 0.05
run_info_4 = train_diabetes(data, 0.05, 0.05)

In [17]:
display(image)

In [18]:
# Start with alpha and l1_ratio values of 0.01, 1
run_info_5 = train_diabetes(data, 0.01, 0.99998)

In [19]:
display(image)

## Review the MLflow UI
Visit your tracking server in a web browser by going to `https://your_shard_id.cloud.databricks.com/mlflow`

The MLflow UI should look something similar to the animated GIF below. Inside the UI, you can:
* View your experiments and runs
* Review the parameters and metrics on each run
* Click each run for a detailed view to see the the model, images, and other artifacts produced.

<img src="https://docs.databricks.com/_static/images/mlflow/mlflow-ui.gif"/>

#### SIDE BAR: Organize MLflow Runs into Experiments

As you start using your MLflow server for more tasks, you may want to separate them out. MLflow allows you to create [experiments](https://mlflow.org/docs/latest/tracking.html#organizing-runs-in-experiments) to organize your runs. To report your run to a specific experiment, just pass an `experiment_id` parameter to the `mlflow.start_run`, as in `mlflow.start_run(experiment_id=1)`.

Note that the experiments we ran above did not specify an `experiment_id` parameter so they defaulted to the "Default Experiment" which has ID 0.

## Load MLflow model back as a Scikit-learn model
Here we demonstrate using the MLflow API to load model from the MLflow server that was produced by a given run. To do so we have to specify the run_id.

Once we load it back in, it is a just a scikitlearn model object like any other and we can explore it or use it.

In [24]:
# Loading the model from the path
import mlflow.sklearn
model = mlflow.sklearn.load_model("/dbfs/databricks/mlflow/6717803/4559abddb75648959a4f7b5e340db6ac/artifacts/model") #Use one of the run IDs we captured above
model.coef_

In [25]:
#Get a prediction for a row of the dataset
model.predict(data[0:1].drop(["progression"], axis=1))

## Load MLflow model back as a Scikit-learn model from MLflow Registry

In [27]:
import mlflow.pyfunc
model_name = "MLflow Diabetes Example"
model_production_uri = "models:/{model_name}/production".format(model_name=model_name)
print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_production_uri))
model_production = mlflow.pyfunc.load_model(model_production_uri)

# Loading the model from Model Registry
print (model_production_uri)
model_production.coef_

In [28]:
#Get a prediction for a row of the dataset
model_production.predict(data[0:1].drop(["progression"], axis=1))

## Load Production MLflow model back as a Scikit-learn model from MLflow Registry 
Go to MLflow Model Registry:
* Go to MLflow and register model
* Transition model to production

In [30]:
import mlflow.pyfunc
model_name = "MLflow Diabetes Example"
model_production_uri = "models:/{model_name}/production".format(model_name=model_name)
print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_production_uri))
model_production = mlflow.pyfunc.load_model(model_production_uri)

# Loading the model from Model Registry
print (model_production_uri)
model_production.coef_

In [31]:
#Get a prediction for a row of the dataset
model_production.predict(data[0:1].drop(["progression"], axis=1))

## Use an MLflow Model for Batch inference
We can also get a pyspark UDF to do some batch inference suing one of the models you logged above. For more on this see https://mlflow.org/docs/latest/models.html#apache-spark

In [33]:
# First let's create a Spark DataFrame out of our original pandas
# DataFrame minus the column we want to predict. We'll use this
# to simulate what this would be like if we had a big data set
# that was regularly getting updated that we were routinely wanting
# to score, e.g. click logs.
dataframe = spark.createDataFrame(data.drop(["progression"], axis=1))

In [34]:
# Next we use the MLflow API to create a PySpark UDF given our run.
# See the API docs for this function call here:
# https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.spark_udf
# the spark_udf function takes our SparkSession, the path to the model within artifact
# repository, and the ID of the run that produced this model.
pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_production_uri)

In [35]:
# Predict Values
predicted_df = dataframe.withColumn("prediction", pyfunc_udf(
  'age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'))

# Original Values
original_df = spark.createDataFrame(data)

# Join original and predicted 
#  label: original_df.progression, predicted value: predicted_df.prediction
joined_df = (predicted_df.join(original_df,
              (predicted_df.age == original_df.age) &
              (predicted_df.sex == original_df.sex) &
              (predicted_df.bmi == original_df.bmi) &
              (predicted_df.bp == original_df.bp) &
              (predicted_df.s1 == original_df.s1) &
              (predicted_df.s2 == original_df.s2) &
              (predicted_df.s3 == original_df.s3) &
              (predicted_df.s4 == original_df.s4) &
              (predicted_df.s5 == original_df.s5) &
              (predicted_df.s6 == original_df.s6)
            ).select(
              predicted_df.age, 
              predicted_df.sex, 
              predicted_df.bmi,
              predicted_df.bp,
              predicted_df.s1,
              predicted_df.s2,
              predicted_df.s3,
              predicted_df.s4,
              predicted_df.s5,
              predicted_df.s6,
              original_df.progression,
              predicted_df.prediction  
            ))

# Show the values
display(joined_df)

age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,progression,prediction
0.0235457526293458,-0.044641636506989,0.0703187031097357,0.0253152256886921,-0.0345918284170385,-0.014466112821379,-0.0323559322397657,-0.0025922619981828,-0.0191970476139445,-0.0093619113301358,288.0,209.2987016076584
0.0744012909436196,-0.044641636506989,0.0315174684500233,0.10105838095089,0.0465893902168282,0.0368902349121043,0.0155053592133662,-0.0025922619981828,0.0336568129023847,0.0444854785627154,296.0,215.3485735386213
-0.0418399394890061,-0.044641636506989,0.128520555099304,0.063186803319791,-0.0332158755588373,-0.0326287236051719,0.0118237214092792,-0.0394933828740919,-0.0159982677581387,-0.0507829804784829,259.0,242.69177718857287
-0.0418399394890061,-0.044641636506989,-0.0493184370910443,-0.0366564467985606,-0.0070727712530158,-0.0226079728279068,0.0854564774910206,-0.0394933828740919,-0.0664881482228354,0.007206516329203,128.0,67.41250472272267
0.0489735217864827,0.0506801187398187,0.088641508365711,0.0872868981759448,0.0355817673512192,0.0215459602844172,-0.0249926566315915,0.0343088588777263,0.0660482061630984,0.131469723774244,310.0,253.9840510196008
0.0344433679824045,0.0506801187398187,-0.0299178197611881,0.0046580015262745,0.0933717873956666,0.0869939887984295,0.0339135482338016,-0.0025922619981828,0.024052583226893,-0.0383566597339788,69.0,108.0064762561344
0.0526060602375023,-0.044641636506989,-0.0212953231701409,-0.0745280244296595,-0.040095639849843,-0.0376390989938044,-0.0065844676111561,-0.0394933828740919,-0.000609254186102297,-0.0549250873933176,131.0,132.60910987793412
-0.0382074010379866,0.0506801187398187,0.0045721666030007,0.0356438377699009,-0.0112006298276192,0.0058885371949406,-0.0470824834561139,0.0343088588777263,0.0163049527999418,-0.0010776975004663,107.0,176.0644573940246
-0.0200447087828888,-0.044641636506989,-0.084886235529114,-0.0263278347173518,-0.0359677812752396,-0.0341944659141195,0.0412768238419757,-0.0516707527631419,-0.0823814832581028,-0.0466408735636482,90.0,52.78422493051308
0.0162806757273067,-0.044641636506989,-0.0288400076873072,-0.0091134812486705,-0.0043208655366135,-0.0097688858945359,0.0449584616460628,-0.0394933828740919,-0.0307512098645563,-0.0424987666488135,179.0,113.40945904790084


## Congrats, you finished this tutorial!