
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>




# Pandas UDF Lab

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Learning Objectives:<br>

By the end of this lab, you should be able to;

* Perform model inference at scale using a Pandas UDF created from MLflow

## Lesson Setup

The first thing we're going to do is to **run setup script**. This script will define the required configuration variables that are scoped to each user.

In [0]:
%run "../Includes/Classroom-Setup"

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)

Creating & using the schema "charlie_ohara_4mi2_da_sml" in the catalog "hive_metastore"...(0 seconds)

Predefined tables in "charlie_ohara_4mi2_da_sml":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02

Setup completed (7 seconds)



## Build Model


In the cell below, we train the same model on the same data set as in the lesson and <a href="https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html" target="_blank">autolog</a> metrics, parameters, and models to MLflow.

In [0]:
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

# use mlflow to register model
with mlflow.start_run(run_name="sklearn-random-forest") as run:
    # Enable autologging 
    mlflow.sklearn.autolog(log_input_examples=True, log_model_signatures=True, log_models=True)
    
    # Import the data
    df = pd.read_csv(f"dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02/airbnb/sf-listings/airbnb-cleaned-mlflow.csv".replace("dbfs:/", "/dbfs/"))

    # convert ineteger fields to double in case of missing values, which would cause error
    int_cols = df.select_dtypes(include='int64').columns
    df[int_cols] = df[int_cols].astype('float64')

    X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1), df[["price"]].values.ravel(), random_state=42)

    # Create model, train it, and create predictions
    rf = RandomForestRegressor(n_estimators=100, max_depth=10)
    rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)








Let's convert our Pandas DataFrame to a Spark DataFrame for distributed inference.

In [0]:
spark_df = spark.createDataFrame(X_test)




## MLflow UDF

Here, instead of using **`mlflow.sklearn.load_model(model_path)`**, we would like to use **`mlflow.pyfunc.spark_udf()`**.

This method can reduce computational cost and space, since it only loads the model into memory once per Python process. In other words, when we generate predictions for a DataFrame, the Python process knows that it should reuse the copy of the model, rather than loading the same model more than once. This can actually be more performant than using a Pandas Iterator UDF.




In the cell below, fill in the **`model_path`** variable and the **`mlflow.pyfunc.spark_udf`** function. You can refer to this <a href="https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.spark_udf" target="_blank">documentation</a> for help.

In [0]:
import mlflow
from pyspark.sql.functions import struct, col
logged_model = 'runs:/94166fcfaa21456983841dce79a1676e/model'

# Load model as a Spark UDF.
predict = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model)


2024/03/01 17:33:12 INFO mlflow.models.flavor_backend_registry: Selected backend for flavor 'python_function'






After loading the model using **`mlflow.pyfunc.spark_udf`**, we can now perform model inference at scale.

In the cell below, fill in the blank to use the **`predict`** function you have defined above to predict the price based on the features.

In [0]:
# TODO

features = X_train.columns
display(spark_df.withColumn("prediction", predict(*features)))

host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,prediction
1.0,29.0,21.0,37.750853665952526,-122.47896134638864,0.0,0.0,4.0,1.0,0.0,4.0,0.0,2.0,194.0,96.0,10.0,10.0,10.0,10.0,9.0,9.0,136.4791205719536
2.0,12.0,11.0,37.79569442370353,-122.417081972524,1.0,1.0,2.0,1.5,1.0,1.0,0.0,2.0,124.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,130.2949492754391
2.0,7.0,6.0,37.76393574011793,-122.43001124805248,0.0,1.0,2.0,1.0,1.0,1.0,0.0,5.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,135.7061457972107
1.0,7.0,0.0,37.76690648031917,-122.43792377044348,1.0,0.0,7.0,2.0,3.0,3.0,0.0,3.0,3.0,93.0,10.0,9.0,10.0,10.0,10.0,10.0,439.3675934317436
1.0,2.0,0.0,37.77491545710221,-122.44027012206556,6.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,21.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,131.79648832122325
39.0,19.0,13.0,37.729883744746296,-122.42672685799468,1.0,1.0,2.0,1.0,1.0,1.0,0.0,30.0,15.0,89.0,8.0,8.0,9.0,9.0,8.0,9.0,45.72749641196101
4.0,30.0,22.0,37.714110738500814,-122.4072828875996,1.0,1.0,2.0,1.0,1.0,1.0,0.0,1.0,20.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,80.87758482183735
54.0,6.0,5.0,37.78663277349695,-122.4085188120046,17.0,1.0,2.0,1.0,0.0,1.0,0.0,1.0,0.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,80.2775773972316
1.0,15.0,4.0,37.78294900804669,-122.38856041539098,0.0,0.0,10.0,2.0,2.0,8.0,0.0,1.0,6.0,93.0,9.0,9.0,8.0,9.0,10.0,8.0,370.07589738288993
2.0,7.0,6.0,37.76852191665309,-122.4278718063526,0.0,1.0,2.0,1.0,1.0,1.0,0.0,2.0,127.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,124.96658663459506



## Classroom Cleanup

Run the following cell to remove lessons-specific assets created during this lesson:

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the schema "charlie_ohara_4mi2_da_sml"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark"...(0 seconds)

Validating the locally installed datasets:
| listing local files...

&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>