# Chapter 11: Managing, Development, and Scaling Machine Learning Pipelines
Christoph Windheuser    
July, 2022   
Python examples of chapter 11 (page 323 ff) in the book *Learning Spark*

## Installing MLflow
To run these code, you first have to install MLflow. Run the followinf code in your terminal on your local machine:
````
pip install mlflow
````


In [1]:
# Import required python spark libraries
import pyspark
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator


In [2]:
# create a SparkSession
# This requires access to the internet. If executed offline, an error is thrown

spark = (SparkSession \
         .builder \
         .appName("Chapter_11") \
         .getOrCreate())


In [3]:
filePath = "../DB_Spark/LearningSparkV2/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet"
airbnbDF = spark.read.parquet(filePath)

(trainDF, testDF) = airbnbDF.randomSplit([0.8, 0.2], seed=42)


In [4]:
categoricalCols = [field for (field, dataType) in trainDF.dtypes if dataType == "string"]
indexOutputCols = [x + "Index" for x in categoricalCols]
stringIndexer   = StringIndexer(inputCols = categoricalCols,
                                outputCols = indexOutputCols,
                                handleInvalid="skip")


In [5]:
numericCols     = [field for (field, dataType) in trainDF.dtypes
                   if ((dataType == "double") & (field != "price"))]
assemblerInputs = indexOutputCols + numericCols
vecAssembler    = VectorAssembler(inputCols=assemblerInputs, outputCol="features")


In [6]:
rf = RandomForestRegressor(labelCol="price", maxBins=40, maxDepth=7, numTrees=150, seed=42)


In [7]:
pipeline = Pipeline(stages=[stringIndexer, vecAssembler, rf])

# Run MLflow

In [8]:
import mlflow
import mlflow.spark
import pandas as pd

In [9]:
# Start MLflow:
run = mlflow.start_run(run_name="random_forest")


In [10]:
# Log params:
mlflow.log_param("num_trees", rf.getNumTrees())
mlflow.log_param("max_depth", rf.getMaxDepth())

In [11]:
# Run model
pipelineModel = pipeline.fit(trainDF)

In [12]:
# Log model:
mlflow.spark.log_model(pipelineModel, "model")



ModelInfo(artifact_path='model', flavors={'spark': {'pyspark_version': '3.2.1', 'model_data': 'sparkml', 'code': None}, 'python_function': {'loader_module': 'mlflow.spark', 'python_version': '3.8.8', 'data': 'sparkml', 'env': 'conda.yaml'}}, model_uri='runs:/269124bda55c492c8ab2872196271497/model', model_uuid='82fac650d7234bd4a5e5f0a99d91e519', run_id='269124bda55c492c8ab2872196271497', saved_input_example_info=None, signature_dict=None, utc_time_created='2022-07-26 19:19:54.669919', mlflow_version='1.27.0')

In [13]:
# Log metrics: RMSE and R2
predDF = pipelineModel.transform(testDF)
regressionEvaluator = RegressionEvaluator(predictionCol = "prediction",
                                          labelCol      = "price")
rmse = regressionEvaluator.setMetricName("rmse").evaluate(predDF)
r2   = regressionEvaluator.setMetricName("r2").evaluate(predDF)
mlflow.log_metrics({"rmse": rmse, "r2": r2})


In [14]:
# Log artifacts: Feature importance scores
rfModel  = pipelineModel.stages[-1]
pandasDF = (pd.DataFrame(list(zip(vecAssembler.getInputCols(),
                                  rfModel.featureImportances)),
                         columns=["feature", "importance"])
            .sort_values(by="importance", ascending=False))


In [15]:
# Write to local file system, then tell MLflow where to find that file
pandasDF.to_csv("feature-importance.csv", index=False)
mlflow.log_artifact("feature-importance.csv")


### Display MLflow in your browser
run the following comand in your terminal:     
````
mlflow ui
````
Then navigate in your browser to http://localhost:5000/ (or http://127.0.0.1:5000/) to see the output of MLflow.

### Query the MLflow tracking server by an API (MlflowClient)

In [16]:
from mlflow.tracking import MlflowClient

client = MlflowClient()
runs   = client.search_runs(run.info.experiment_id,
                            order_by=["attributes.start_time desc"],
                            max_results=1)

run_id = runs[0].info.run_id
runs[0].data.metrics


{'r2': 0.2097921402566152, 'rmse': 213.9814486105752}

# Running experiments in the command line
Experiments from a github repo can be run from the command line. The results are automatically incluced in MLflow.    
Run the folloing command in your terminal:
````
mlflow run https://github.com/databricks/LearningSparkV2/#mlflow-project-example
-P max_depth=5 -P num_trees=100
````
After finishing the run, you see the results of the run in your MLflow UI in the browser.

# Stopping MLflow

In [17]:
mlflow.end_run()

# Model Deployment Options with MLlib
Page 330 ff