d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# MLflow

How do you remember which network architecture and hyperparameters performed the worked best? That's where [MLflow](https://mlflow.org/) comes into play!

[MLflow](https://mlflow.org/docs/latest/concepts.html) seeks to address these three core issues:

* It’s difficult to keep track of experiments
* It’s difficult to reproduce code
* There’s no standard way to package and deploy models

In this notebook, we will show how to do experiment tracking with MLflow! We will start with logging the metrics from the models we created with the California housing dataset today.


## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Log experiments with MLflow
 - View MLflow UI
 - Generate a UDF with MLflow and apply to a Spark DataFrame

MLflow is pre-installed on the Databricks Runtime for ML.

In [3]:
%run "./Includes/Classroom-Setup"

In [4]:
from sklearn.datasets.california_housing import fetch_california_housing
from sklearn.model_selection import train_test_split
import tensorflow as tf
tf.random.set_seed(42)

cal_housing = fetch_california_housing()

# split 80/20 train-test
X_train, X_test, y_train, y_test = train_test_split(cal_housing.data,
                                                    cal_housing.target,
                                                    test_size=0.2,
                                                    random_state=1)

print(cal_housing.DESCR)

Build model architecture as before.

In [6]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

def build_model():
  return Sequential([Dense(20, input_dim=8, activation="relu"),
                     Dense(20, activation="relu"),
                     Dense(1, activation="linear")]) # Keep the last layer as linear because this is a regression problem

#two dense layers with 20 units

-sandbox
### Start Using MLflow in a Notebook

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlflow-tracking.png" style="height: 300px; margin: 20px"/></div>

Helper method to plot our training loss using matplotlib.

In [9]:
import matplotlib.pyplot as plt

def viewModelLoss(history):
  plt.clf()
  plt.semilogy(history.history["loss"], label="train_loss")
  plt.title("Model Loss")
  plt.ylabel("Loss")
  plt.xlabel("Epoch")
  plt.legend()
  return plt

### Track experiments!

You can use [mlflow.set_experiment()](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.set_experiment) to set an experiment, but if you do not specify an experiment, it will automatically be scoped to this notebook.

In [11]:
from mlflow.keras import log_model
import mlflow

# Note issue with **kwargs https://github.com/keras-team/keras/issues/9805
def trackExperiments(run_name, model, compile_kwargs, fit_kwargs, optional_params={}):
  """
  This is a wrapper function for tracking experiments with MLflow
    
  Parameters
  ----------
  model: Keras model
    The model to track
    
  compile_kwargs: dict
    Keyword arguments to compile model with
  
  fit_kwargs: dict
    Keyword arguments to fit model with
  """
  with mlflow.start_run(run_name=run_name) as run:
    model = model()
    model.compile(**compile_kwargs)
    history = model.fit(**fit_kwargs)
    
    for param_key, param_value in {**compile_kwargs, **fit_kwargs, **optional_params}.items():
      if param_key not in ["x", "y"]:
        mlflow.log_param(param_key, param_value)
    
    for key, values in history.history.items():
      for i, v in enumerate(values):
        mlflow.log_metric(key, v, step=i)

    for i, layer in enumerate(model.layers):
      mlflow.log_param(f"hidden_layer_{i}_units", layer.output_shape)
      
    log_model(model, "model")
    
    fig = viewModelLoss(history)
    fig.savefig("train-validation-loss.png")
    mlflow.log_artifact("train-validation-loss.png")
    return run

Let's recall what happened when we used SGD.

In [13]:
compile_kwargs = {
  "optimizer": "sgd", 
  "loss": "mse",
  "metrics": ["mse", "mae"]
}

fit_kwargs = {
  "x": X_train, 
  "y": y_train,
  "epochs": 10,
  "verbose": 2
}

run_name = "SGD" #will get NanN's due to SGD alg
run = trackExperiments(run_name, build_model, compile_kwargs, fit_kwargs)

Now let's change the optimizer.

In [15]:
compile_kwargs["optimizer"] = "adam" 

run_name = "ADAM"
run = trackExperiments(run_name, build_model, compile_kwargs, fit_kwargs)

Now let's add some data standardization, as well as a validation dataset.

In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

fit_kwargs["x"] = X_train_scaled
fit_kwargs["validation_split"] = 0.2

optional_params = {
  "standardize_data": "true"
}

run_name = "StandardizedValidation"
run = trackExperiments(run_name, build_model, compile_kwargs, fit_kwargs, optional_params)

### Querying Past Runs

You can query past runs programatically in order to use this data back in Python.  The pathway to doing this is an `MlflowClient` object.

In [19]:
from mlflow.tracking import MlflowClient

client = MlflowClient()

client.list_experiments()

You can also use [search_runs](https://mlflow.org/docs/latest/search-syntax.html) to find all runs for a given experiment.

In [21]:
runs_df = mlflow.search_runs(run.info.experiment_id)

display(runs_df)

run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.mae,metrics.val_loss,metrics.loss,metrics.val_mae,metrics.mse,metrics.val_mse,params.validation_split,params.loss,params.optimizer,params.verbose,params.metrics,params.hidden_layer_1_units,params.epochs,params.hidden_layer_2_units,params.hidden_layer_0_units,params.standardize_data,tags.mlflow.user,tags.mlflow.databricks.notebookRevisionID,tags.mlflow.source.name,tags.mlflow.databricks.notebookPath,tags.mlflow.runName,tags.mlflow.source.type,tags.mlflow.log-model.history,tags.mlflow.databricks.notebookID,tags.mlflow.databricks.webappURL
f505de0734824525bdffd69700183040,1391719663531009,FINISHED,dbfs:/databricks/mlflow/1391719663531009/f505de0734824525bdffd69700183040/artifacts,2020-06-22T18:24:37.480+0000,2020-06-22T18:24:50.457+0000,0.4138793647289276,0.3583203852176666,0.3419573307037353,0.4244694411754608,0.3419573307037353,0.3583203852176666,0.2,mse,adam,2,"['mse', 'mae']","(None, 20)",10,"(None, 1)","(None, 20)",True,odl_user_195841@databrickslabs.onmicrosoft.com,1592850290556,/Users/odl_user_195841@databrickslabs.onmicrosoft.com/Lessons/Python/DL 04 - MLflow,/Users/odl_user_195841@databrickslabs.onmicrosoft.com/Lessons/Python/DL 04 - MLflow,StandardizedValidation,NOTEBOOK,"[{""run_id"":""f505de0734824525bdffd69700183040"",""artifact_path"":""model"",""utc_time_created"":""2020-06-22 18:24:49.768497"",""flavors"":{""keras"":{""keras_module"":""tensorflow.keras"",""keras_version"":""2.3.0-tf"",""data"":""data""},""python_function"":{""loader_module"":""mlflow.keras"",""python_version"":""3.7.6"",""data"":""data"",""env"":""conda.yaml""}}}]",1391719663531009,https://australiaeast.azuredatabricks.net
f6a9ceea8f2b4f399229374edeed534a,1391719663531009,FINISHED,dbfs:/databricks/mlflow/1391719663531009/f6a9ceea8f2b4f399229374edeed534a/artifacts,2020-06-22T18:23:39.780+0000,2020-06-22T18:23:50.353+0000,0.7952521443367004,,1.1730527877807615,,1.1730527877807615,,,mse,adam,2,"['mse', 'mae']","(None, 20)",10,"(None, 1)","(None, 20)",,odl_user_195841@databrickslabs.onmicrosoft.com,1592850230451,/Users/odl_user_195841@databrickslabs.onmicrosoft.com/Lessons/Python/DL 04 - MLflow,/Users/odl_user_195841@databrickslabs.onmicrosoft.com/Lessons/Python/DL 04 - MLflow,ADAM,NOTEBOOK,"[{""run_id"":""f6a9ceea8f2b4f399229374edeed534a"",""artifact_path"":""model"",""utc_time_created"":""2020-06-22 18:23:49.672065"",""flavors"":{""keras"":{""keras_module"":""tensorflow.keras"",""keras_version"":""2.3.0-tf"",""data"":""data""},""python_function"":{""loader_module"":""mlflow.keras"",""python_version"":""3.7.6"",""data"":""data"",""env"":""conda.yaml""}}}]",1391719663531009,https://australiaeast.azuredatabricks.net
1186f1a7eb2e452c92fb39c9c2e413da,1391719663531009,FINISHED,dbfs:/databricks/mlflow/1391719663531009/1186f1a7eb2e452c92fb39c9c2e413da/artifacts,2020-06-22T18:20:36.033+0000,2020-06-22T18:20:45.560+0000,,,,,,,,mse,sgd,2,"['mse', 'mae']","(None, 20)",10,"(None, 1)","(None, 20)",,odl_user_195841@databrickslabs.onmicrosoft.com,1592850045653,/Users/odl_user_195841@databrickslabs.onmicrosoft.com/Lessons/Python/DL 04 - MLflow,/Users/odl_user_195841@databrickslabs.onmicrosoft.com/Lessons/Python/DL 04 - MLflow,SGD,NOTEBOOK,"[{""run_id"":""1186f1a7eb2e452c92fb39c9c2e413da"",""artifact_path"":""model"",""utc_time_created"":""2020-06-22 18:20:44.752947"",""flavors"":{""keras"":{""keras_module"":""tensorflow.keras"",""keras_version"":""2.3.0-tf"",""data"":""data""},""python_function"":{""loader_module"":""mlflow.keras"",""python_version"":""3.7.6"",""data"":""data"",""env"":""conda.yaml""}}}]",1391719663531009,https://australiaeast.azuredatabricks.net
ea81f2b739ab422eb17060766689535a,1391719663531009,FINISHED,dbfs:/databricks/mlflow/1391719663531009/ea81f2b739ab422eb17060766689535a/artifacts,2020-06-22T18:20:15.321+0000,2020-06-22T18:20:28.180+0000,,,,,,,,mse,sgd,2,"['mse', 'mae']","(None, 20)",10,"(None, 1)","(None, 20)",,odl_user_195841@databrickslabs.onmicrosoft.com,1592850028272,/Users/odl_user_195841@databrickslabs.onmicrosoft.com/Lessons/Python/DL 04 - MLflow,/Users/odl_user_195841@databrickslabs.onmicrosoft.com/Lessons/Python/DL 04 - MLflow,SGD,NOTEBOOK,"[{""run_id"":""ea81f2b739ab422eb17060766689535a"",""artifact_path"":""model"",""utc_time_created"":""2020-06-22 18:20:27.261045"",""flavors"":{""keras"":{""keras_module"":""tensorflow.keras"",""keras_version"":""2.3.0-tf"",""data"":""data""},""python_function"":{""loader_module"":""mlflow.keras"",""python_version"":""3.7.6"",""data"":""data"",""env"":""conda.yaml""}}}]",1391719663531009,https://australiaeast.azuredatabricks.net


Pull the last run and look at metrics.

In [23]:
runs = client.search_runs(run.info.experiment_id, order_by=["attributes.start_time desc"], max_results=1)
runs[0].data.metrics

## User Defined Function

Let's now register our Keras model as a Spark UDF to apply to rows in parallel.

In [25]:
import pandas as pd

predict = mlflow.pyfunc.spark_udf(spark, runs[0].info.artifact_uri + "/model")

X_test_DF = spark.createDataFrame(pd.concat([pd.DataFrame(X_test_scaled, columns=cal_housing.feature_names), 
                                             pd.DataFrame(y_test, columns=["label"])], axis=1))

display(X_test_DF.withColumn("prediction", predict(*cal_housing.feature_names)))

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,label,prediction
-0.3310285766191358,0.8259818066546623,-0.35885057761057,-0.0510288012032661,-0.2814369848467366,-0.1136260574172399,-0.7356935499969531,0.6048708421580643,3.55,3.071263551712036
-1.0032898983325969,0.6670708583760996,-0.1731410006441295,-0.1198538136249555,-0.2511804116872083,-0.0418538939251884,0.5371054944326057,-0.1024706744613222,0.707,0.6674821376800537
0.0724550980676474,1.3821701256296317,-0.3676159382329025,-0.1721008621226848,0.0967701796473668,0.0589248356565718,0.9816492783326336,-1.4175281419790535,2.294,2.3468172550201416
-1.2452109275997243,1.85890297046532,-0.5865626855766726,0.0302566580563388,-1.0903553672588309,-0.0686485042354789,1.0190845443452703,-1.3477902459743252,1.125,1.6191164255142212
0.6890470920236157,0.6670708583760996,-0.013296962547823,-0.1476120269598186,-0.6356168706553323,-0.0502371931492983,-0.8479993480348564,0.7194402427372626,2.254,3.187736749649048
1.8478995748766407,-0.9220386244095272,-0.1664403698254752,-0.2084067625723654,-1.0645482901521743,-0.1464707971114809,-0.6935788757327416,0.704496407879106,2.63,3.830686569213867
0.7522235845879117,-1.319315995105934,0.678675384674832,-0.1027023047950442,-0.2004561566844697,-0.03277028428904,-0.3379438486127185,-0.4312350413407533,2.268,3.2692878246307373
-0.8186770849145876,0.0314270652618488,-0.2716656834394383,0.0314304929144287,-0.2680884966881212,-0.0738055695294335,1.2530549569242333,-1.4474158116953668,1.662,1.4875677824020386
-0.2015035500228807,0.5081599100975368,-0.185408430509539,-0.2899065318148905,-0.6195986848649938,-0.0184139797411736,-0.7263347334937965,0.9635228787538086,1.18,1.4377541542053225
-0.3630133280847418,1.3027146514903505,0.0352837721589479,0.1322448482265668,-0.999585647780246,0.1671427087747675,-0.7450523665001132,0.704496407879106,1.563,1.7743419408798218


Register the Vectorized UDF `predict` into the SQL namespace.

In [27]:
spark.udf.register("predictUDF", predict)
X_test_DF.createOrReplaceGlobalTempView("X_test_DF")

In [28]:
%sql
select *, predictUDF(MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude) as prediction 
from global_temp.X_test_DF

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,label,prediction
-0.3310285766191358,0.8259818066546623,-0.35885057761057,-0.0510288012032661,-0.2814369848467366,-0.1136260574172399,-0.7356935499969531,0.6048708421580643,3.55,3.071263551712036
-1.0032898983325969,0.6670708583760996,-0.1731410006441295,-0.1198538136249555,-0.2511804116872083,-0.0418538939251884,0.5371054944326057,-0.1024706744613222,0.707,0.6674821376800537
0.0724550980676474,1.3821701256296317,-0.3676159382329025,-0.1721008621226848,0.0967701796473668,0.0589248356565718,0.9816492783326336,-1.4175281419790535,2.294,2.3468172550201416
-1.2452109275997243,1.85890297046532,-0.5865626855766726,0.0302566580563388,-1.0903553672588309,-0.0686485042354789,1.0190845443452703,-1.3477902459743252,1.125,1.6191164255142212
0.6890470920236157,0.6670708583760996,-0.013296962547823,-0.1476120269598186,-0.6356168706553323,-0.0502371931492983,-0.8479993480348564,0.7194402427372626,2.254,3.187736749649048
1.8478995748766407,-0.9220386244095272,-0.1664403698254752,-0.2084067625723654,-1.0645482901521743,-0.1464707971114809,-0.6935788757327416,0.704496407879106,2.63,3.830686569213867
0.7522235845879117,-1.319315995105934,0.678675384674832,-0.1027023047950442,-0.2004561566844697,-0.03277028428904,-0.3379438486127185,-0.4312350413407533,2.268,3.2692878246307373
-0.8186770849145876,0.0314270652618488,-0.2716656834394383,0.0314304929144287,-0.2680884966881212,-0.0738055695294335,1.2530549569242333,-1.4474158116953668,1.662,1.4875677824020386
-0.2015035500228807,0.5081599100975368,-0.185408430509539,-0.2899065318148905,-0.6195986848649938,-0.0184139797411736,-0.7263347334937965,0.9635228787538086,1.18,1.4377541542053225
-0.3630133280847418,1.3027146514903505,0.0352837721589479,0.1322448482265668,-0.999585647780246,0.1671427087747675,-0.7450523665001132,0.704496407879106,1.563,1.7743419408798218


Now, go back and add MLflow to your experiments from the Boston Housing Dataset!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>