# Lab: Adding Pre and Post-Processing Logic

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you:<br>
 - Import data and train a random forest model
 - Defining pre-processing steps
 - Adding post-processing steps
 
## Prerequisites
- Web browser: Chrome
- A cluster configured with **8 cores** and **DBR 7.3 ML**

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "../Includes/Classroom-Setup"

## Import Data and Train Random Forest

Import the Airbnb DataFrame.

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv")
X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1), df[["price"]].values.ravel(), random_state=42)

Train a random forest model.

In [0]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

rf = RandomForestRegressor(n_estimators=100, max_depth=25)
rf.fit(X_train, y_train)
rf_mse = mean_squared_error(y_test, rf.predict(X_test))

rf_mse

## Pre-processing Our Data

We would like to add some pre-processing steps to our data before training a RF model in order to decrease the MSE and improve our model's performance.

Take a look at the first 10 rows of our data.

In [0]:
df.iloc[:10]

-sandbox
Notice that all the values in the `latitude` and `longitude` columns are very similar (up to tenth place) since all the Airbnb listings are in San Francisco. The Airbnb pricing probably will not vary too much between longitude and latitude differences of 0.0001 so we can facilitate the splitting factors of our tree by rounding the `latitude` and `longitude` values to the nearest hundredth instead of worrying about all 6 digits after the decimal point. We will create these values in new columns called `trunc_lat` and `trunc_long` and drop the original `latitude` and `longitude` columns.

Additionally, notice that the 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location', and
       'review_scores_value'
       encode pretty similar information so we will go ahead and summarize them into single column called `summed_review_scores` which contains the summation of the above 6 columns. Hopefully the tree will be able to make a more informed split given this additional information.


Fill in the pre-processing lines to create the `X_test_processed` and `X_train_processed` DataFrames. Then we will train a new random forest model off this pre-processed data.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Take a look at python's built in `round` function.

In [0]:
# ANSWER
# new random forest model
rf2 = RandomForestRegressor(n_estimators=100, max_depth=25)

cols_to_drop = ["latitude", "longitude"]

X_train_processed = X_train.copy()
X_train_processed["trunc_lat"] = round(X_train["latitude"], 3)
X_train_processed["trunc_long"] = round(X_train["longitude"], 3)
X_train_processed["review_scores_sum"] = (
   X_train['review_scores_accuracy'] + 
   X_train['review_scores_cleanliness']+
   X_train['review_scores_checkin'] + 
   X_train['review_scores_communication'] + 
   X_train['review_scores_location'] + 
   X_train['review_scores_value']
)
X_train_processed = X_train_processed.drop(cols_to_drop, axis=1)


X_test_processed = X_test.copy()
X_test_processed["trunc_lat"] = round(X_test["latitude"], 3)  
X_test_processed["trunc_long"] = round(X_test["longitude"], 3) 
X_test_processed["review_scores_sum"] = (
  X_test['review_scores_accuracy'] +
  X_test['review_scores_cleanliness'] +
  X_test['review_scores_checkin'] + 
  X_test['review_scores_communication'] +
  X_test['review_scores_location'] +
  X_test['review_scores_value']
)
X_test_processed = X_test_processed.drop(cols_to_drop, axis=1)


# fit and evaluate new rf model
rf2.fit(X_train_processed, y_train)
rf2_mse = mean_squared_error(y_test, rf2.predict(X_test_processed))

rf2_mse

After training our new `rf2` model, let us log this run in MLflow so we can use this trained model in the future by loading it.

In [0]:
import mlflow.sklearn

with mlflow.start_run(run_name="RF Model Pre-process") as run: 
  mlflow.sklearn.log_model(rf2, "random-forest-model-preprocess")
  mlflow.log_metric("mse", rf2_mse)
  
  experimentID = run.info.experiment_id
  artifactURI = mlflow.get_artifact_uri()

Now let's load the `python_function` flavor of the model so we can apply it to a test set.

In [0]:
import mlflow.pyfunc
from  mlflow.tracking import MlflowClient

client = MlflowClient()
rf2_run = sorted(client.list_run_infos(experimentID), key=lambda r: r.start_time, reverse=True)[0]
rf2_path = rf2_run.artifact_uri+"/random-forest-model-preprocess/"

rf2_pyfunc_model = mlflow.pyfunc.load_model(rf2_path)

Let's try giving our new `rf2_pyfunc_model` the `X_test` DataFrame to generate predictions off of.

In [0]:
try:
  rf2_pyfunc_model.predict(X_test)
except ValueError as e:
  print("ERROR: " + str(e))

Why did this fail?

## Adding Pre-Processing Steps

We trained our `rf2` model using a pre-processed training set that has one extra column (`review_scores_sum`) than the unprocessed `X_train` and `X_test` DataFrames.  The `rf2` model is expecting to have `review_scores_sum` as an input column as well. Even if `X_test` had the same number of columns as the processed data we trained on, the line above will still error since it does not have our custom truncated `trunc_lat` and `trunc_long` columns.

To fix this, we could manually re-apply the same pre-processing logic to the `X_test` set each time we wish to use our model. 

However, there is a cleaner and more streamlined way to account for our pre-processing steps. We can define a custom model class that automatically pre-processes the raw input it receives before passing that input into the trained model's `.predict()` function. This way, in future applications of our model, we will no longer have to worry about remembering to pre-process every batch of data beforehand.

Complete the `preprocess_input(self, model_input)` helper function of the custom `RF_with_preprocess` class so that the random forest model is always predicting off of a DataFrame with the correct column names and the appropriate number of columns.

In [0]:
# ANSWER
# Define the model class
class RF_with_preprocess(mlflow.pyfunc.PythonModel):

    def __init__(self, trained_rf):
        self.rf = trained_rf

    def preprocess_input(self, model_input):
        '''return pre-processed model_input'''
        model_input["trunc_lat"] = round(model_input["latitude"], 3)
        model_input["trunc_long"] = round(model_input["longitude"], 3)
        model_input["review_scores_sum"] = ( 
          model_input['review_scores_accuracy'] +
          model_input['review_scores_cleanliness'] +
          model_input['review_scores_checkin'] +
          model_input['review_scores_communication'] +
          model_input['review_scores_location'] +
          model_input['review_scores_value']
        )
        model_input = model_input.drop(["latitude", "longitude"], axis=1)
        return model_input
    
    def predict(self, context, model_input):
        processed_model_input = self.preprocess_input(model_input.copy())
        return self.rf.predict(processed_model_input)

Let's save, then load this custom model's `python_function`.

In [0]:
import shutil

# Construct and save the model
model_path =  f"{workingDir}/RF_with_preprocess/".replace("dbfs:", "/dbfs")
shutil.rmtree(model_path, True) # remove folder if already exists

rf_preprocess_model = RF_with_preprocess(trained_rf = rf2)
mlflow.pyfunc.save_model(path=model_path, python_model=rf_preprocess_model)

# Load the model in `python_function` format
loaded_preprocess_model = mlflow.pyfunc.load_model(model_path)

Now we can directly give our loaded model the unmodified `X_test` and have it generate predictions without errors!

In [0]:
# Apply the model
loaded_preprocess_model.predict(X_test)

## Adding Post-Processing Steps

Now suppose we are not as interested in a numerical prediction as we are in a categorical label of `Expensive` and `Not Expensive` where the cut-off is above a price of $100. Instead of retraining an entirely new classification model, we can simply add on a post-processing step to our custom model so it returns the predicted label instead of numerical price.

Complete the following model class with **both the previous preprocess steps and the new `postprocess_result(self, result)`** function such that passing in `X_test` into our model will return an `Expensive` or `Not Expensive` label for each row.

In [0]:
# ANSWER
# Define the model class
class RF_with_postprocess(mlflow.pyfunc.PythonModel):

    def __init__(self, trained_rf):
        self.rf = trained_rf

    def preprocess_input(self, model_input):
        '''return pre-processed model_input'''
        model_input["trunc_lat"] = round(model_input["latitude"], 3)
        model_input["trunc_long"] = round(model_input["longitude"], 3)
        model_input["review_scores_sum"] = ( 
          model_input['review_scores_accuracy'] +
          model_input['review_scores_cleanliness'] +
          model_input['review_scores_checkin'] +
          model_input['review_scores_communication'] +
          model_input['review_scores_location'] +
          model_input['review_scores_value']
        )
        model_input = model_input.drop(["latitude", "longitude"], axis=1)
        return model_input
      
    def postprocess_result(self, results):
        '''return post-processed results
        Expensive: predicted price > 100
        Not Expensive: predicted price <= 100'''
        
        return ["Expensive" if result>100 else "Not Expensive" for result in results]
    
    def predict(self, context, model_input):
        processed_model_input = self.preprocess_input(model_input.copy())
        results = self.rf.predict(processed_model_input)
        return self.postprocess_result(results)

Create, save, and apply the model to `X_test`.

In [0]:
# Construct and save the model
model_path =  f"{workingDir}/RF_with_postprocess/".replace("dbfs:", "/dbfs")

shutil.rmtree(model_path, True) # remove folder if already exists

rf_postprocess_model = RF_with_postprocess(trained_rf = rf2)
mlflow.pyfunc.save_model(path=model_path, python_model=rf_postprocess_model)

# Load the model in `python_function` format
loaded_postprocess_model = mlflow.pyfunc.load_model(model_path)

# Apply the model
loaded_postprocess_model.predict(X_test)

-sandbox
Given any unmodified raw data, our model can perform the pre-processing steps, apply the trained model, and follow the post-processing step all in one `.predict` function call!

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See the solutions folder for an example solution to this lab.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> All done!</h2>

Thank you for your participation!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>