# Description

----
https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch11.html#managingcomma_deployingcomma_and_scaling

# Setup

In [12]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

%load_ext blackcellmagic
# in a cell, type %%black

%load_ext autoreload
# in a cell, type %autoreload


The blackcellmagic extension is already loaded. To reload it, use:
  %reload_ext blackcellmagic
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Imports

In [24]:
import os
import os.path as path

import src.data.utils as uts

# Spark

In [27]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = (SparkSession
         .builder
         .master('local[*]')
         .appName("spark-ml-ch-11")
         .config('ui.showConsoleProgress', 'false')
         .getOrCreate())

# Functions

See `src.data.utils`

# Model Management

* Library versioning
* Data evolution
* Order of execution
* Parallel operations

# MLflow
https://mlflow.org

MLflow is an open source platform that helps developers reproduce and share experiments, manage models, and much more. It provides interfaces in Python, R, and Java/Scala, as well as a REST API. As shown in Figure 11-1, MLflow has four main components:

* Tracking
    * Provides APIs to record parameters, metrics, code versions, models, and artifacts such as plots, and text.
* Projects
    * A standardized format to package your data science projects and their dependencies to run on other platforms. It helps you manage the model training process.
* Models
    * A standardized format to package models to deploy to diverse execution environments. It provides a consistent API for loading and applying models, regardless of the algorithm or library used to build the model.
* Registry
    * A repository to keep track of model lineage, model versions, stage transitions, and annotations.


## Let’s examine a few things that can be logged to the tracking server:

* Parameters
    * Key/value inputs to your code—e.g., hyperparameters like num_trees or max_depth in your random forest
* Metrics
    * Numeric values (can update over time)—e.g., RMSE or accuracy values
* Artifacts
    * Files, data, and models—e.g., matplotlib images, or Parquet files
* Metadata
    * Information about the run, such as the source code that executed the run or the version of the code (e.g., the Git commit hash string for the code version)
* Models
    * The model(s) you trained


By default, the tracking server records everything to the filesystem, but you can specify a database for faster querying, such as for the parameters and metrics. Let’s add MLflow tracking to our random forest code from Chapter 10:

In [28]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

filePath = db_fname("sf-airbnb/sf-airbnb-clean.parquet")
airbnbDF = spark.read.parquet(filePath)

(trainDF, testDF) = airbnbDF.randomSplit([0.8, 0.2], seed=42)

In [30]:
import src.models.train_model as tm

In [32]:
categoricalCols = [field for (field, dataType) in trainDF.dtypes if dataType == 'string']

stages_string_indexer = tm.make_string_indexer_list(categoricalCols)

indexOutputCols = [indexer.getOutputCol() for indexer in stages_string_indexer]

In [35]:
numericCols = [field for (field, dataType) in trainDF.dtypes
                  if ((dataType == 'double') & (field != 'price'))]

In [36]:
assemblerInputs = indexOutputCols + numericCols
vecAssembler = VectorAssembler(inputCols=assemblerInputs,
                              outputCol='features')

rf = RandomForestRegressor(labelCol='price', maxBins=40, maxDepth=5, numTrees=100, seed=42)

stages = stages_string_indexer + [vecAssembler, rf]
pipeline = Pipeline(stages=stages)

## Start logging with MLflow
---
* start with `mlflow.start_run()`
* end with `mlflow.end_run()`, or, as in this case, put it in a `with` clause to end automatically

In [37]:
import mlflow
import mlflow.spark
import pandas as pd

In [38]:
with mlflow.start_run(run_name='random-forest') as run:
    # Log params: num_trees and max_depth
    mlflow.log_param('num_trees', rf.getNumTrees())
    mlflow.log_param('max_depth', rf.getMaxDepth())
    
    # Log model
    pipelineModel = pipeline.fit(trainDF)
    mlflow.spark.log_model(pipelineModel, 'model')
    
    # Log metrics: RMSE, R2
    predDF = pipelineModel.transform(testDF)
    regressionEvaluator = RegressionEvaluator(predictionCol='prediction',
                                              labelCol='price')
    rmse = regressionEvaluator.setMetricName('rmse').evaluate(predDF)
    r2 = regressionEvaluator.setMetricName('r2').evaluate(predDF)
    mlflow.log_metrics({'rmse': rmse, 'r2': r2})
    
    # Log artifact: feature importance scores
    rfModel = pipelineModel.stages[-1]
    pandasDF = (pd.DataFrame(list(zip(vecAssembler.getInputCols(),
                                      rfModel.featureImportances)),
                            columns=['feature', 'importance'])
               .sort_values(by='importance', ascending=False))
    
    # First write to local filesystem, then tell MLflow wher to find that file
    fname_feat_imp = 'feature_importance.csv'
    pandasDF.to_csv(fname_feat_imp, index=False)
    mlflow.log_artifact(fname_feat_imp)

## Seeing the results

* from the command line, in the directory where the notebook is running, run

```
mlflow ui
```

Then, navigate to `localhost:5000`

In [43]:
from mlflow.tracking import MlflowClient

client = MlflowClient()
runs = client.search_runs(run.info.experiment_id, 
                          order_by=["attributes.start_time desc"], 
                          max_results=1)

run_id = runs[0].info.run_id
runs[0].data.metrics


{'r2': 0.17493315816019928, 'rmse': 282.57071791501807}

In [44]:
mlflow.run(
  "https://github.com/databricks/LearningSparkV2/#mlflow-project-example", 
  parameters={"max_depth": 5, "num_trees": 100})

2020/10/17 21:55:25 INFO mlflow.projects.utils: === Fetching project from https://github.com/databricks/LearningSparkV2/#mlflow-project-example into /var/folders/41/2ty9kyz5511f9rbhn6y5gn680000gq/T/tmpcthu04xb ===


ExecutionException: Could not find Conda executable at conda. Ensure Conda is installed as per the instructions at https://conda.io/projects/conda/en/latest/user-guide/install/index.html. You can also configure MLflow to look for a specific Conda executable by setting the MLFLOW_CONDA_HOME environment variable to the path of the Conda executable

## BV add cross-validator and run MLflow again

### Setup `ParamGrid`

In [45]:
from pyspark.ml.tuning import ParamGridBuilder
paramGrid = (ParamGridBuilder()
            .addGrid(rf.maxDepth, [2, 4, 6])
            .addGrid(rf.numTrees, [10, 100])
            .build())

from pyspark.ml.tuning import CrossValidator

evaluator = RegressionEvaluator(labelCol='price',
                               predictionCol='prediction',
                               metricName='rmse')
cv = CrossValidator(estimator=rf,
                    evaluator=evaluator,
                    estimatorParamMaps=paramGrid,
                    numFolds=3,
                    parallelism=4,
                    seed=42)

pipeline = Pipeline(stages=stages_string_indexer + [vecAssembler, cv])

In [68]:
from src.models.utils import parse_param_map

In [67]:
with mlflow.start_run(run_name='random-forest') as run:
    # Log model
    pipelineModel = pipeline.fit(trainDF)
    mlflow.spark.log_model(pipelineModel, 'model')

    best_model = pipelineModel.stages[-1].bestModel
    
    parameters = parse_param_map(best_model.extractParamMap())
    
    # Log params: num_trees and max_depth
    mlflow.log_param('num_trees', parameters['numTrees'])
    mlflow.log_param('max_depth', parameters['maxDepth'])

    # Log metrics: RMSE, R2
    predDF = pipelineModel.transform(testDF)
    regressionEvaluator = RegressionEvaluator(predictionCol='prediction',
                                              labelCol='price')
    rmse = regressionEvaluator.setMetricName('rmse').evaluate(predDF)
    r2 = regressionEvaluator.setMetricName('r2').evaluate(predDF)
    mlflow.log_metrics({'rmse': rmse, 'r2': r2})
    
    # Log artifact: feature importance scores
#     rfModel = pipelineModel.stages[-1]
    pandasDF = (pd.DataFrame(list(zip(vecAssembler.getInputCols(),
                                      best_model.featureImportances)),
                            columns=['feature', 'importance'])
               .sort_values(by='importance', ascending=False))
    
    # First write to local filesystem, then tell MLflow wher to find that file
    fname_feat_imp = 'feature_importance.csv'
    pandasDF.to_csv(fname_feat_imp, index=False)
    mlflow.log_artifact(fname_feat_imp)

# Batch

In [78]:
from mlflow.tracking import MlflowClient

client = MlflowClient()
runs = client.search_runs(run.info.experiment_id, 
                          order_by=["attributes.start_time desc"], 
                          max_results=1)

run_id = runs[0].info.run_id
runs[0].data.metrics


{'r2': 0.1779329157505093, 'rmse': 282.05656834731315}

In [79]:
# Load saved model with MLflow
import mlflow.spark

pipelineModel = mlflow.spark.load_model(f"runs:/{run_id}/model")

2020/10/17 23:36:49 INFO mlflow.spark: 'runs:/4ad18afeeda8402f9df3d5dba286cd87/model' resolved as 'file:///Users/bartev/dev/github-bv/san-tan/lrn-spark/notebooks/mlruns/0/4ad18afeeda8402f9df3d5dba286cd87/artifacts/model'
2020/10/17 23:36:49 INFO mlflow.spark: File 'file:///Users/bartev/dev/github-bv/san-tan/lrn-spark/notebooks/mlruns/0/4ad18afeeda8402f9df3d5dba286cd87/artifacts/model/sparkml' is already on DFS, copy is not necessary.


In [80]:
# Generate predictions
inputDF = spark.read.parquet(db_fname('sf-airbnb/sf-airbnb-clean.parquet'))

predDF = pipelineModel.transform(inputDF)

In [92]:
from pyspark.sql.functions import col, round

(predDF.select('features', round('price', 3).alias('price'), round('prediction', 2).alias('pred'))
 .withColumn('error', round(col('pred') - col('price'), 3))
 .show())

+--------------------+-----+------+------+
|            features|price|  pred| error|
+--------------------+-----+------+------+
|(33,[0,1,2,3,7,8,...|170.0|180.36| 10.36|
|(33,[3,7,8,9,10,1...|235.0|220.22|-14.78|
|(33,[3,5,7,8,9,10...| 65.0| 93.19| 28.19|
|(33,[3,5,7,8,9,10...| 65.0| 93.45| 28.45|
|(33,[3,4,7,8,9,10...|785.0| 289.1|-495.9|
|(33,[1,3,7,8,9,10...|255.0| 362.6| 107.6|
|(33,[0,2,4,5,7,8,...|139.0|154.58| 15.58|
|(33,[3,4,5,7,8,9,...|135.0|139.28|  4.28|
|(33,[0,1,7,8,9,10...|265.0|332.36| 67.36|
|(33,[3,7,8,9,10,1...|177.0|220.78| 43.78|
|(33,[3,7,8,9,10,1...|194.0|325.35|131.35|
|(33,[3,7,8,9,10,1...|139.0|161.56| 22.56|
|(33,[3,5,7,8,9,10...| 85.0| 94.05|  9.05|
|(33,[3,5,7,8,9,10...| 85.0| 93.31|  8.31|
|(33,[0,2,3,5,7,8,...| 79.0|128.99| 49.99|
|(33,[0,3,4,7,8,9,...|136.0|220.06| 84.06|
|(33,[1,7,8,9,10,1...|215.0|177.68|-37.32|
|(33,[3,4,7,8,9,10...|450.0| 285.7|-164.3|
|(33,[0,3,7,8,9,10...|107.0|159.81| 52.81|
|(33,[0,3,4,5,7,8,...|110.0|149.01| 39.01|
+----------

# Streaming

* when reading in data, use `spark.readStream()` instead of `spark.read()`

In [94]:
pipelineModel = mlflow.spark.load_model(f"runs:/{run_id}/model")

# Set up simulated streaming data
repartitionedPath = db_fname('sf-airbnb/sf-airbnb-clean-100p.parquet')
schema = spark.read.parquet(repartitionedPath).schema

2020/10/17 23:54:22 INFO mlflow.spark: 'runs:/4ad18afeeda8402f9df3d5dba286cd87/model' resolved as 'file:///Users/bartev/dev/github-bv/san-tan/lrn-spark/notebooks/mlruns/0/4ad18afeeda8402f9df3d5dba286cd87/artifacts/model'
2020/10/17 23:54:22 INFO mlflow.spark: File 'file:///Users/bartev/dev/github-bv/san-tan/lrn-spark/notebooks/mlruns/0/4ad18afeeda8402f9df3d5dba286cd87/artifacts/model/sparkml' is already on DFS, copy is not necessary.


In [95]:
streamingData = (spark
                .readStream
                .schema(schema) # can set the schema this way
                .option('maxFilesPerTrigger', 1)
                .parquet(repartitionedPath))

In [96]:
# Generate predictions
streamPred = pipelineModel.transform(streamingData)

In [101]:
streamPred.show()

AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nFileSource[/Users/bartev/dev/github-bv/LearningSparkV2/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean-100p.parquet]'

# Model Export Patterns for Real-Time Inference

* MLeap https://mleap-docs.combust.ml
* ONNX https://onnx.ai