-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

## MLFlow

As you might have noticed, throughout the day you tried different model architectures. But how do you remember which one worked best? That's where [MLFlow](https://mlflow.org/) comes into play!

[MLFlow](https://mlflow.org/docs/latest/concepts.html) seeks to address these three core issues:

* It’s difficult to keep track of experiments
* It’s difficult to reproduce code
* There’s no standard way to package and deploy models

In this notebook, we will show how to do experiment tracking with MLFlow on Azure Databricks! We will start with logging the metrics from the models we created with the California housing dataset today.

### Install MLflow on Your Databricks Cluster

1. Ensure you are using or [create a cluster](https://docs.azuredatabricks.net/user-guide/clusters/create.html#cluster-create) specifying 
  * **Databricks Runtime Version:** Databricks Runtime 4.1 (ML)
  * **Python Version:** Python 3
2. Add `mlflow` as a PyPi library in Databricks, and install it on your cluster
  * Follow [Upload a Python PyPI package or Python Egg](https://docs.azuredatabricks.net/user-guide/libraries.html#upload-a-python-pypi-package-or-python-egg) to create a library
  * Choose **PyPi** and enter `mlflow==0.5.0` (this notebook was tested with `mlflow` version 0.5.0)

### Set up a Remote MLflow Tracking Server (already done here)

To run a long-lived, shared MLflow tracking server, we'll launch a Linux VM instance to run the [MLflow Tracking server](https://mlflow.org/docs/latest/tracking.html). To do this:

* Create a *Linux VM* instance
  * Open port 5000 for MLflow server; an example of how to do this via [How to open ports to a virtual machine with the Azure portal](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nsg-quickstart-portal). Opening up port 5000 to the Internet will allow anyone to access your server, so it is recommended to only open up the port within an [Azure VPC](https://azure.microsoft.com/en-us/services/virtual-network/) that your Databricks clusters have access to.
  * Install conda onto your Linux instance via [Conda > Installing on Linux](https://conda.io/docs/user-guide/install/linux.html)

* Run your Tracking Server
  * Run the `server` command in MLflow passing it `--host 0.0.0.0`, e.g. `mlflow server --host 0.0.0.0`.
    * For more information, refer to [MLflow > Running a Tracking Server](https://mlflow.org/docs/latest/tracking.html?highlight=server#running-a-tracking-server).
  * To test connectivity of your tracking server:
    * Get the hostname of your instance
    * Go to http://$TRACKING_SERVER$:5000; it should look similar to this [MLflow UI](https://databricks.com/wp-content/uploads/2018/06/mlflow-web-ui.png)

* **NOTE**: Currently we can only save parameters and metrics to remote tracking server

In [5]:
from sklearn.datasets.california_housing import fetch_california_housing
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(0)
import tensorflow as tf
tf.set_random_seed(42) # For reproducibility

cal_housing = fetch_california_housing()

# split 80/20 train-test
X_train, X_test, y_train, y_test = train_test_split(cal_housing.data,
                                                        cal_housing.target,
                                                        test_size=0.2,
                                                        random_state=1)

print(cal_housing.DESCR)

Build model architecture as before.

In [7]:
from keras.models import Sequential
from keras.layers import Dense

def build_model():
  return Sequential([Dense(20, input_dim=8, activation='relu'),
                    Dense(20, activation='relu'),
                    Dense(1, activation='linear')]) # Keep the last layer as linear because this is a regression problem

### Start Using MLflow in a Notebook

The first step is to import call `mlflow.set_tracking_uri` to point to your server:

In [9]:
# Set this variable to your MLflow server's DNS name
mlflow_server = '40.118.203.191'

# Tracking URI
mlflow_tracking_URI = 'http://' + mlflow_server + ':5000'
print ("MLflow Tracking URI: {}".format(mlflow_tracking_URI))

# Import MLflow and set the Tracking UI
import mlflow
mlflow.set_tracking_uri(mlflow_tracking_URI)

### Track experiments!

In [11]:
# Note issue with **kwargs https://github.com/keras-team/keras/issues/9805

def trackExperiments(model, compile_kwargs, fit_kwargs, optional_params={}):
  '''
  This is a wrapper function for tracking expirements with MLFlow
    
  Parameters
  ----------
  model: Keras model
    The model to track
    
  compile_kwargs: dict
    Keyword arguments to compile model with
  
  fit_kwargs: dict
    Keyword arguments to fit model with
  '''
  with mlflow.start_run():
    model = model()
    model.compile(**compile_kwargs)
    history = model.fit(**fit_kwargs)
    
    for param_key, param_value in {**compile_kwargs, **fit_kwargs, **optional_params}.items():
      if param_key not in ["x", "y", "X_val", "y_val"]:
        mlflow.log_param(param_key, param_value)
    
    for key, values in history.history.items():
      for v in values:
          mlflow.log_metric(key, v)

    for i, layer in enumerate(model.layers):
      mlflow.log_param("hidden_layer_" + str(i) + "_units", layer.output_shape)

Let's recall what happened when we used SGD.

In [13]:
compile_kwargs = {
  "optimizer": "sgd", 
  "loss": "mse",
  "metrics": ["mse", "mae"],
}

fit_kwargs = {
  "x": X_train, 
  "y": y_train,
  "epochs": 10,
  "verbose": 2
}

trackExperiments(build_model, compile_kwargs, fit_kwargs)

Now let's change the optimizer

In [15]:
compile_kwargs["optimizer"] = "adam" 

trackExperiments(build_model, compile_kwargs, fit_kwargs)

Now let's add some data normalization, as well as a validation dataset.

In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

fit_kwargs["validation_split"] = 0.2
fit_kwargs["x"] = X_train_scaled

optional_params = {
  "normalize_data": "true"
}

trackExperiments(build_model, compile_kwargs, fit_kwargs, optional_params)

## Review the MLflow UI
Open the URL of your tracking server in a web browser. In case you forgot it, you can get it from `mlflow.get_tracking_uri()`:

In [19]:
# Identify the location of the runs
mlflow.tracking.get_tracking_uri()

The MLflow UI should look something similar to the animated GIF below. Inside the UI, you can:
* View your experiments and runs
* Review the parameters and metrics on each run
* Click each run for a detailed view to see the the model, images, and other artifacts produced.

<img src="https://brookewenig.github.io/img/DL/mlflow-ui-azure.gif"/>

Now, go back and add MLFlow to your experiments from the Boston Housing Dataset!

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>