
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>




# Regression: Predicting Rental Price

In this notebook, we will use the dataset we cleansed in the previous lab to predict Airbnb rental prices in San Francisco.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Learning Objectives:<br>

By the end of this lesson, you should be able to;
* Use the SparkML to build a linear regression model
* Identify the differences between estimators and transformers in Spark ML

## 📌 Requirements

**Required Databricks Runtime Version:** 
* Please note that in order to run this notebook, you must use one of the following Databricks Runtime(s): **12.2.x-cpu-ml-scala2.12**

## Lesson Setup

The first thing we're going to do is to **run setup script**. This script will define the required configuration variables that are scoped to each user.

In [0]:
%run "./Includes/Classroom-Setup"

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| dropping the schema "charlie_ohara_4mi2_da_sml"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark"...(0 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)

Creating & using the schema "charlie_ohara_4mi2_da_sml" in the catalog "hive_metastore"...(0 seconds)

Predefined tables in "charlie_ohara_4mi2_da_sml":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/scalable-machi

## Load Dataset

In [0]:
file_path = f"{DA.paths.datasets}/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)




## Train/Test Split

![](https://files.training.databricks.com/images/301/TrainTestSplit.png)

**Question**: Why is it necessary to set a seed? What happens if I change my cluster configuration?

In [0]:
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)
print(train_df.cache().count())

5786





Let's change the # of partitions (to simulate a different cluster configuration), and see if we get the same number of data points in our training set.

In [0]:
train_repartition_df, test_repartition_df = (airbnb_df
                                             .repartition(24) # will cause a shuffle of the data
                                             .randomSplit([.8, .2], seed=42))

print(train_repartition_df.count())

5736





## Linear Regression

We are going to build a very simple model predicting **`price`** just given the number of **`bedrooms`**.

**Question**: What are some assumptions of the linear regression model? Assume a linear relationship exists

In [0]:
display(train_df.select("price", "bedrooms"))

price,bedrooms
85.0,1.0
45.0,1.0
128.0,1.0
100.0,1.0
250.0,1.0
250.0,2.0
125.0,0.0
80.0,1.0
72.0,1.0
150.0,2.0


Databricks visualization. Run in Databricks to view.

In [0]:
display(train_df.select("price", "bedrooms").summary()) # large skew with outliers 

summary,price,bedrooms
count,5786.0,5786.0
mean,215.2701348081576,1.3370203940546146
stddev,335.00495198272256,0.9336511382658126
min,10.0,0.0
25%,100.0,1.0
50%,150.0,1.0
75%,235.0,2.0
max,10000.0,14.0


In [0]:
display(train_df)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,bathrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,flexible,f,1.0,Bayview,37.72001,-122.39249,House,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,2.0,128.0,97.0,10.0,10.0,10.0,10.0,9.0,10.0,85.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bayview,37.7325,-122.39221,House,Private room,1.0,1.0,1.0,1.0,Real Bed,31.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,45.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,flexible,f,1.0,Bernal Heights,37.73905,-122.41269,Apartment,Private room,1.0,1.0,1.0,1.0,Real Bed,30.0,1.0,80.0,10.0,8.0,10.0,10.0,8.0,10.0,128.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.7422,-122.42091,Guest suite,Private room,4.0,1.0,1.0,3.0,Real Bed,3.0,49.0,95.0,10.0,10.0,10.0,10.0,10.0,9.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.74552,-122.41195,Apartment,Entire home/apt,2.0,2.0,1.0,1.0,Real Bed,2.0,4.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Financial District,37.7842,-122.39925,Apartment,Entire home/apt,4.0,2.0,2.0,2.0,Real Bed,183.0,3.0,74.0,6.0,6.0,4.0,10.0,10.0,8.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Glen Park,37.74185,-122.42977,Apartment,Entire home/apt,3.0,1.0,0.0,2.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,125.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,flexible,f,1.0,Haight Ashbury,37.76637,-122.4467,House,Private room,2.0,1.0,1.0,1.0,Real Bed,7.0,50.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Haight Ashbury,37.77407,-122.44556,Condominium,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Inner Richmond,37.77777,-122.45531,House,Entire home/apt,4.0,2.0,2.0,2.0,Real Bed,30.0,74.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0





There does appear to be some outliers in our dataset for the price ($10,000 a night??). Just keep this in mind when we are building our models.

We will use <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression" target="_blank">LinearRegression</a> to build our first model.

The cell below will fail because the Linear Regression estimator expects a vector of values as input. We will fix that with VectorAssembler below.

In [0]:
from pyspark.ml.regression import LinearRegression

# labelCol = target = value we want to predict 
# featuresCol = input = what value we want to focus on for generating the price prediction
# config settings 
lr = LinearRegression(featuresCol="bedrooms", labelCol="price")
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
epsilon: The shape parameter to control the amount of robustness. Must be > 1.0. Only valid when loss is huber (default: 1.35)
featuresCol: features column name. (default: features, current: bedrooms)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: price)
loss: The loss function to be optimized. Supported options: squaredError, huber. (default: squaredError)
maxBlockSizeInMB: maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be >= 0. (default: 0.0)
maxIter: max number of itera

In [0]:
# fit = learn how house prices change based on the the number of bedrooms using the train_df data  
# the lr_model is like the brain after its has been taught
lr_model = lr.fit(train_df)

[0;31m---------------------------------------------------------------------------[0m
[0;31mIllegalArgumentException[0m                  Traceback (most recent call last)
File [0;32m<command-893755963985714>:3[0m
[1;32m      1[0m [38;5;66;03m# fit = learn how house prices change based on the the number of bedrooms using the train_df data  [39;00m
[1;32m      2[0m [38;5;66;03m# the lr_model is like the brain after its has been taught[39;00m
[0;32m----> 3[0m lr_model [38;5;241m=[39m [43mlr[49m[38;5;241;43m.[39;49m[43mfit[49m[43m([49m[43mtrain_df[49m[43m)[49m

File [0;32m/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_pyspark.py:30[0m, in [0;36m_create_patch_function.<locals>.patched_method[0;34m(self, *args, **kwargs)[0m
[1;32m     28[0m call_succeeded [38;5;241m=[39m [38;5;28;01mFalse[39;00m
[1;32m     29[0m [38;5;28;01mtry[39;00m:
[0;32m---> 30[0m     result [38;5;241m=[39m [43moriginal_method[49m[43m([49m[38;5;28;43ms




## Vector Assembler

What went wrong? Turns out that the Linear Regression **estimator** (**`.fit()`**) expected a column of Vector type as input.

We can easily get the values from the **`bedrooms`** column into a single vector using <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html?highlight=vectorassembler#pyspark.ml.feature.VectorAssembler" target="_blank">VectorAssembler</a>. VectorAssembler is an example of a **transformer**. Transformers take in a DataFrame, and return a new DataFrame with one or more columns appended to it. They do not learn from your data, but apply rule based transformations.

You can see an example of how to use VectorAssembler on the <a href="https://spark.apache.org/docs/latest/ml-features.html#vectorassembler" target="_blank">ML Programming Guide</a>.

In [0]:
from pyspark.ml.feature import VectorAssembler

# in this case we convert a single column to a vector, but can also accept multiple columns and convert it into a vector aka list in one column vs multiple columns
vec_assembler = VectorAssembler(inputCols=["bedrooms"], outputCol="features")

# adds features column - type udt = user defined type 
vec_train_df = vec_assembler.transform(train_df)
vec_train_df.select("features").show(truncate=False)

+--------+
|features|
+--------+
|[1.0]   |
|[1.0]   |
|[1.0]   |
|[1.0]   |
|[1.0]   |
|[2.0]   |
|[0.0]   |
|[1.0]   |
|[1.0]   |
|[2.0]   |
|[1.0]   |
|[1.0]   |
|[1.0]   |
|[1.0]   |
|[2.0]   |
|[0.0]   |
|[1.0]   |
|[1.0]   |
|[2.0]   |
|[1.0]   |
+--------+
only showing top 20 rows



In [0]:
# here we define our estimator as linear regression, specifying to use the features column to predict the price
lr = LinearRegression(featuresCol="features", labelCol="price")
# we then pass in the training data, which includes a mapping of the feature to the actual price
# the model essentially creates a formula where the input is the feature(s) x and the output is the price y
lr_model = lr.fit(vec_train_df)




## Inspect the Model

In [0]:
m = lr_model.coefficients[0] # multiply # of bedrooms by the coefficient 
b = lr_model.intercept # as if what the price would be if there were 0 bedrooms

print(f"The formula for the linear regression line is y = {m:.2f}x + {b:.2f}")

The formula for the linear regression line is y = 123.54x + 50.09


In [0]:
# pValues tell us how much each of the features contributes to our ability to predict the price of the house
# low p-value means it is important to know the number of bedrooms to make an accurate prediction = useful model = significant 
# high p-value means it doesn't matter the number of bedrooms to predict the house = useless model 
lr_model.summary.pValues

Out[17]: [0.0, 4.495293026707259e-12]




## Apply Model to Test Set

In [0]:
vec_test_df = vec_assembler.transform(test_df)

# lr_model contains the formula 
# when using transform, we apply the formula to each bedroom column to generate the predicted price
# append column "prediction" using transform 
pred_df = lr_model.transform(vec_test_df)

# then we can compare to the actual price 
# going to have the same prediction for each place with 1 bedroom
pred_df.select("bedrooms", "price", "prediction").show()

+--------+-----+------------------+
|bedrooms|price|        prediction|
+--------+-----+------------------+
|     1.0| 86.0|173.63467134146967|
|     1.0|190.0|173.63467134146967|
|     1.0|100.0|173.63467134146967|
|     2.0|325.0| 297.1745644790371|
|     1.0|200.0|173.63467134146967|
|     1.0|200.0|173.63467134146967|
|     0.0| 80.0|50.094778203902216|
|     1.0|160.0|173.63467134146967|
|     0.0|132.0|50.094778203902216|
|     1.0|100.0|173.63467134146967|
|     1.0|165.0|173.63467134146967|
|     1.0| 90.0|173.63467134146967|
|     1.0| 73.0|173.63467134146967|
|     0.0|119.0|50.094778203902216|
|     1.0| 80.0|173.63467134146967|
|     1.0| 84.0|173.63467134146967|
|     1.0|119.0|173.63467134146967|
|     1.0|555.0|173.63467134146967|
|     1.0|200.0|173.63467134146967|
|     0.0| 60.0|50.094778203902216|
+--------+-----+------------------+
only showing top 20 rows



In [0]:
display(pred_df.select("price", "prediction"))

price,prediction
86.0,173.63467134146967
190.0,173.63467134146967
100.0,173.63467134146967
325.0,297.1745644790371
200.0,173.63467134146967
200.0,173.63467134146967
80.0,50.094778203902216
160.0,173.63467134146967
132.0,50.094778203902216
100.0,173.63467134146967


Databricks data profile. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.




## Evaluate the Model

Let's see how our linear regression model with just one variable does. Does it beat our baseline model?

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

# configure to map the prediction to the price
regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regression_evaluator.evaluate(pred_df)
print(f"RMSE is {rmse}") # ~150

RMSE is 149.54529126695462


In [0]:
print(regression_evaluator.explainParams())

labelCol: label column name. (default: label, current: price)
metricName: metric name in evaluation - one of:
                       rmse - root mean squared error (default)
                       mse - mean squared error
                       r2 - r^2 metric
                       mae - mean absolute error
                       var - explained variance. (default: rmse, current: rmse)
predictionCol: prediction column name. (default: prediction, current: prediction)
throughOrigin: whether the regression is through the origin. (default: False)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)


In [0]:
regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="r2")
r2 = regression_evaluator.evaluate(pred_df)
print(f"r2 is {r2}") # ~0.3 = not great - 0 is useless, 1 is perfect

r2 is 0.29893567394162723





Wahoo! Our RMSE is better than our baseline model. However, it's still not that great. Let's see how we can further decrease it in future notebooks.


## Classroom Cleanup

Run the following cell to remove lessons-specific assets created during this lesson:

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the schema "charlie_ohara_4mi2_da_sml"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark"...(0 seconds)

Validating the locally installed datasets:
| listing local files...(2 seconds)
| validation completed...(2 seconds total)


&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>