
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>




# Linear Regression: Improving the Model

In this notebook we will be adding additional features to our model, as well as discuss how to handle categorical features.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Learning Objectives:<br>

By the end of this lesson, you should be able to;
* Encode categorical variables using One-Hot-Encoder method
* Create a Spark ML Pipeline to fit a model
* Evaluate a model’s performance
* Save and load a model using Spark ML Pipeline

## 📌 Requirements

**Required Databricks Runtime Version:** 
* Please note that in order to run this notebook, you must use one of the following Databricks Runtime(s): **12.2.x-cpu-ml-scala2.12**

## Lesson Setup

The first thing we're going to do is to **run setup script**. This script will define the required configuration variables that are scoped to each user.

In [0]:
%run "./Includes/Classroom-Setup"

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| dropping the schema "charlie_ohara_4mi2_da_sml"...(2 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark"...(0 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)

Creating & using the schema "charlie_ohara_4mi2_da_sml" in the catalog "hive_metastore"...(1 seconds)

Predefined tables in "charlie_ohara_4mi2_da_sml":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/scalable-machi

## Load Dataset

In [0]:
print(DA.paths.datasets)

dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02


In [0]:
file_path = "dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)




## Train/Test Split

Let's use the same 80/20 split with the same seed as the previous notebook so we can compare our results apples to apples (unless you changed the cluster config!)

In [0]:
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)




## Categorical Variables

There are a few ways to handle categorical features:
* Assign them a numeric value
* Create "dummy" variables (also known as One Hot Encoding)
* Generate embeddings (mainly used for textual data)

### One Hot Encoder
Here, we are going to One Hot Encode (OHE) our categorical variables. Spark doesn't have a **`dummies`** function, and OHE is a two-step process. First, we need to use <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html?highlight=stringindexer#pyspark.ml.feature.StringIndexer" target="_blank">StringIndexer</a> to map a string column of labels to an ML column of label indices.

Then, we can apply the <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.OneHotEncoder.html?highlight=onehotencoder#pyspark.ml.feature.OneHotEncoder" target="_blank">OneHotEncoder</a> to the output of the StringIndexer.

In [0]:
from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()

+---+--------+-------------+
| id|category|categoryIndex|
+---+--------+-------------+
|  0|       a|          0.0|
|  1|       b|          2.0|
|  2|       c|          1.0|
|  3|       a|          0.0|
|  4|       a|          0.0|
|  5|       c|          1.0|
+---+--------+-------------+



In [0]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
index_output_cols = [x + "Index" for x in categorical_cols]
ohe_output_cols = [x + "OHE" for x in categorical_cols]

string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")
# one hot encoder doesn't accept string input, it requires numeric input
# so the string indexer is first ran to transform strings into numbers, and then we can do one hot encoding on the number representation of the string
ohe_encoder = OneHotEncoder(inputCols=index_output_cols, outputCols=ohe_output_cols)




## Vector Assembler

Now we can combine our OHE categorical features with our numeric features.

In [0]:
from pyspark.ml.feature import VectorAssembler

# don't include price since that is what we are trying to predict with our regression model
numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != "price"))]
# this is a list of the name of the columns we want to include in the vector assembler 
assembler_inputs = ohe_output_cols + numeric_cols
print(assembler_inputs)
# the vector assembler is used to combine a list of columns to create a single column as a list 
# this seems to be the preferrable input for ML
vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

['host_is_superhostOHE', 'cancellation_policyOHE', 'instant_bookableOHE', 'neighbourhood_cleansedOHE', 'property_typeOHE', 'room_typeOHE', 'bed_typeOHE', 'host_total_listings_count', 'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'minimum_nights', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'bedrooms_na', 'bathrooms_na', 'beds_na', 'review_scores_rating_na', 'review_scores_accuracy_na', 'review_scores_cleanliness_na', 'review_scores_checkin_na', 'review_scores_communication_na', 'review_scores_location_na', 'review_scores_value_na']





## Linear Regression

Now that we have all of our features, let's build a linear regression model.

In [0]:
from pyspark.ml.regression import LinearRegression

# this is just the definition of the model - that we are going to pass in the vec_assembler, which contains all our features in a single features column
# and that we expect the price to be the output column added to the dataframe, which contains the prediction
lr = LinearRegression(labelCol="price", featuresCol="features")




## Pipeline

Let's put all these stages in a Pipeline. A <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html?highlight=pipeline#pyspark.ml.Pipeline" target="_blank">Pipeline</a> is a way of organizing all of our transformers and estimators.

This way, we don't have to worry about remembering the same ordering of transformations to apply to our test dataset.

In [0]:
from pyspark.ml import Pipeline

# this is essentially the steps we need to take in order to create a model
# first we run the string indexer to convert strings into numeric representations
# then we use one hot encoding to specify either 0 or 1 for each exploded column 
# then we combine all our columns - both the original numeric fields + the string fields that have been transformed to numbers - in the vector assembler step
# finally we create a template of the linear regression and apply it to our training data set to generate a model
stages = [string_indexer, ohe_encoder, vec_assembler, lr]
pipeline = Pipeline(stages=stages)

pipeline_model = pipeline.fit(train_df)




## Saving Models

We can save our models to persistent storage (e.g. DBFS) in case our cluster goes down so we don't have to recompute our results.

In [0]:
dbfs_path = "dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark"

In [0]:
pipeline_model.write().overwrite().save(dbfs_path)


In [0]:
%fs ls dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark/metadata/part-00000

path,name,size,modificationTime
dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark/metadata/part-00000,part-00000,299,1706898457715





## Loading models

When you load in models, you need to know the type of model you are loading back in (was it a linear regression or logistic regression model?).

For this reason, we recommend you always put your transformers/estimators into a Pipeline, so you can always load the generic `PipelineModel` back in.

In [0]:
from pyspark.ml import PipelineModel

saved_pipeline_model = PipelineModel.load(DA.paths.working_dir)




## Apply the Model to Test Set

In [0]:
# now we can use transform to get predictions based on our test data 
pred_df = saved_pipeline_model.transform(test_df)

display(pred_df.select("features", "price", "prediction"))

features,price,prediction
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 12, 43, 68, 69, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.73615, -122.41245, 2.0, 1.0, 1.0, 2.0, 1.0, 194.0, 91.0, 9.0, 9.0, 10.0, 10.0, 9.0, 9.0))",86.0,42.87569095191793
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 11, 45, 67, 69, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88, 92, 93, 94, 95, 96, 97, 98), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.76702, -122.43518, 2.0, 1.0, 1.0, 1.0, 3.0, 98.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",190.0,207.7985705134852
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 28, 42, 68, 69, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88, 92, 93, 94, 95, 96, 97, 98), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.78424, -122.39925, 2.0, 1.0, 1.0, 1.0, 180.0, 98.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",100.0,66.44237123080529
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 19, 43, 67, 69, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.7787, -122.4554, 4.0, 2.0, 2.0, 2.0, 3.0, 6.0, 100.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0))",325.0,310.55147643214514
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 17, 43, 68, 69, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.79256, -122.42135, 1.0, 1.0, 1.0, 1.0, 140.0, 2.0, 60.0, 7.0, 6.0, 8.0, 8.0, 9.0, 7.0))",200.0,14.25784629349073
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 14, 42, 67, 69, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.75369, -122.42577, 2.0, 1.0, 1.0, 1.0, 30.0, 2.0, 100.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0))",200.0,159.0941630804955
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 24, 43, 68, 69, 73, 74, 75, 76, 77, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.71969, -122.44378, 2.0, 1.0, 2.0, 1.0, 24.0, 86.0, 9.0, 9.0, 10.0, 10.0, 9.0, 9.0))",80.0,-19.64792436422431
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 23, 42, 68, 69, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 90), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.79586, -122.43035, 1.0, 1.0, 1.0, 1.0, 30.0, 1.0, 80.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 1.0))",160.0,161.33811082075044
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 9, 42, 67, 69, 73, 74, 75, 76, 77, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.7752, -122.43765, 3.0, 1.0, 1.0, 90.0, 6.0, 100.0, 9.0, 9.0, 10.0, 10.0, 10.0, 9.0))",132.0,112.37330010167352
"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 9, 44, 68, 69, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 37.77814, -122.44079, 2.0, 1.0, 1.0, 1.0, 3.0, 5.0, 100.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0))",100.0,215.7629068979477


In [0]:
display(pred_df.select("price", "prediction"))

price,prediction
86.0,42.87569095191793
190.0,207.7985705134852
100.0,66.44237123080529
325.0,310.55147643214514
200.0,14.25784629349073
200.0,159.0941630804955
80.0,-19.64792436422431
160.0,161.33811082075044
132.0,112.37330010167352
100.0,215.7629068979477


Databricks visualization. Run in Databricks to view.




## Evaluate the Model

![](https://files.training.databricks.com/images/r2d2.jpg) How is our R2 doing?

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regression_evaluator.evaluate(pred_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)
print(f"RMSE is {rmse}")
# R2 - 0 is terrible, 1 is perfect prediction
print(f"R2 is {r2}")

RMSE is 133.4629525945939
R2 is 0.4416148762122888





As you can see, our RMSE decreased when compared to the model without one-hot encoding that we trained in the previous notebook, and the R2 increased as well!


## Classroom Cleanup

Run the following cell to remove lessons-specific assets created during this lesson:

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the schema "charlie_ohara_4mi2_da_sml"...(0 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark"...(0 seconds)

Validating the locally installed datasets:
| listing local files...(2 seconds)
| validation completed...(2 seconds total)


&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>