-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

## Add MLFlow to your experiments from the Boston Housing Dataset!

### Set up a Remote MLflow Tracking Server (already done here)

To run a long-lived, shared MLflow tracking server, we'll launch a Linux VM instance to run the [MLflow Tracking server](https://mlflow.org/docs/latest/tracking.html). To do this:

* Create a *Linux VM* instance
  * Open port 5000 for MLflow server; an example of how to do this via [How to open ports to a virtual machine with the Azure portal](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nsg-quickstart-portal). Opening up port 5000 to the Internet will allow anyone to access your server, so it is recommended to only open up the port within an [Azure VPC](https://azure.microsoft.com/en-us/services/virtual-network/) that your Databricks clusters have access to.
  * Install conda onto your Linux instance via [Conda > Installing on Linux](https://conda.io/docs/user-guide/install/linux.html)

* Run your Tracking Server
  * Run the `server` command in MLflow passing it `--host 0.0.0.0`, e.g. `mlflow server --host 0.0.0.0`.
    * For more information, refer to [MLflow > Running a Tracking Server](https://mlflow.org/docs/latest/tracking.html?highlight=server#running-a-tracking-server).
  * To test connectivity of your tracking server:
    * Get the hostname of your instance
    * Go to http://$TRACKING_SERVER$:5000; it should look similar to this [MLflow UI](https://databricks.com/wp-content/uploads/2018/06/mlflow-web-ui.png)

* **NOTE**: Currently we can only save parameters and metrics to remote tracking server

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import numpy as np
np.random.seed(0)

boston_housing = load_boston()

# split 80/20 train-test
X_train, X_test, y_train, y_test = train_test_split(boston_housing.data,
                                                        boston_housing.target,
                                                        test_size=0.2,
                                                        random_state=1)

print(boston_housing.DESCR)

## Build_model
Create a `build_model()` function. Because Keras models are stateful, we want to get a fresh model every time we are trying out a new experiment.

In [6]:
# ANSWER
import tensorflow as tf
tf.set_random_seed(42) # For reproducibility

from keras.models import Sequential
from keras.layers import Dense

def build_model():
  return Sequential([Dense(50, input_dim=13, activation='relu'),
                    Dense(20, activation='relu'),
                    Dense(1, activation='linear')]) # Keep the last layer as linear because this is a regression problem

### Start Using MLflow in a Notebook

The first step is to import call `mlflow.set_tracking_uri` to point to your server:

In [8]:
# Set this variable to your MLflow server's DNS name
mlflow_server = '40.118.203.191'

# Tracking URI
mlflow_tracking_URI = 'http://' + mlflow_server + ':5000'
print ("MLflow Tracking URI: %s" % (mlflow_tracking_URI))

# Import MLflow and set the Tracking UI
import mlflow
mlflow.set_tracking_uri(mlflow_tracking_URI)

### Track experiments!

In [10]:
# ANSWER

def trackExperiments(build_model=build_model, optimizer="adam", loss="mse", metrics=["mse"], epochs=10, batch_size=32, validation_split=0.0, validation_data=None, verbose=2, normalize_data=False, callbacks=None):
  with mlflow.start_run():
    
    model = build_model()
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
    history = model.fit(X_train, y_train, epochs=epochs, verbose=verbose, validation_split=validation_split, validation_data=validation_data, callbacks=callbacks)

    mlflow.log_param("loss", loss)
    mlflow.log_param("optimizer", optimizer)
    mlflow.log_param("epochs", epochs)
    mlflow.log_param("batch_size", batch_size)
    mlflow.log_param("validation_split", validation_split)
    
    if normalize_data:
      mlflow.log_param("normalize_data", "true")
    
    for key, values in history.history.items():
      for v in values:
          mlflow.log_metric(key, v)

    for i, layer in enumerate(model.layers):
      mlflow.log_param("hidden_layer_" + str(i) + "_units", layer.output_shape)

In [11]:
# ANSWER
trackExperiments(optimizer="adam", loss='mse', metrics=["mse"], epochs=100, batch_size = 32)

In [12]:
# ANSWER
from sklearn.preprocessing import StandardScaler
from keras.callbacks import ModelCheckpoint, EarlyStopping

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train, X_val, y_train, y_val = train_test_split(X_train,
                                                  y_train,
                                                  test_size=0.25,
                                                  random_state=1)

filepath = '/tmp/02KerasLab_checkpoint_weights.hdf5'
dbutils.fs.rm(filepath, recurse=True)
checkpointer = ModelCheckpoint(filepath=filepath, verbose=1, save_best_only=True)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=2, mode='auto')

trackExperiments(metrics=["mae", "mse"], validation_data=(X_val, y_val), epochs=30, batch_size=32, verbose=2, callbacks=[checkpointer, earlyStopping], normalize_data=True)

## Review the MLflow UI
Open the URL of your tracking server in a web browser. In case you forgot it, you can get it from `mlflow.get_tracking_uri()`:

In [14]:
# Identify the location of the runs
mlflow.tracking.get_tracking_uri()

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>