
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>




#Linear Regression Lab

In the previous lesson, we predicted price using just one variable: bedrooms. Now, we want to predict price given a few other features.

Steps:
1. Use the features: **`bedrooms`**, **`bathrooms`**, **`bathrooms_na`**, **`minimum_nights`**, and **`number_of_reviews`** as input to your VectorAssembler.
1. Build a Linear Regression Model
1. Evaluate the **`RMSE`** and the **`R2`**.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Learning Objectives:<br>

By the end of this lab, you should be able to;

* Build a linear regression model with multiple features
* Compute various metrics to evaluate goodness of fit
* Explain Spark ML’s approach to solve distributed linear regression problems

## Lab Setup

The first thing we're going to do is to **run setup script**. This script will define the required configuration variables that are scoped to each user.

In [0]:
%run "../Includes/Classroom-Setup"

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)

Creating & using the schema "charlie_ohara_4mi2_da_sml" in the catalog "hive_metastore"...(1 seconds)

Predefined tables in "charlie_ohara_4mi2_da_sml":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02

Setup completed (14 seconds)



## Load Dataset and Train Model

In [0]:
print(DA.paths.datasets)
file_path = f"{DA.paths.datasets}/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02


In [0]:
# TODO

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression

# merges multiple columns into a vector column to reduce the noise and make it easier for the model to process
vec_assembler = VectorAssembler(
  inputCols=["bedrooms", "bathrooms", "bathrooms_na", "minimum_nights", "number_of_reviews"],
  outputCol="features"
)

vtrain_df = vec_assembler.transform(train_df)
vtest_df = vec_assembler.transform(test_df)

# generate a model using fit() method to fit on the training data 
lr_model = LinearRegression(labelCol="price").fit(vtrain_df)


In [0]:
# this will add a column to the data frame with the predictions 
pred_df = lr_model.transform(vtest_df)


In [0]:
regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")
rmse = regression_evaluator.evaluate(pred_df)

regression_evaluator.setMetricName("r2").evaluate(pred_df)
r2 = regression_evaluator.evaluate(pred_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

RMSE is 146.66557395182022
R2 is 0.3256757861255597





Examine the coefficients for each of the variables.

In [0]:
# the higher the coefficient value, the greater the prediction price 
# the lower the coefficient value, the lower the prediction price
# y = mx + b  (m is the coefficient)
for col, coef in zip(vec_assembler.getInputCols(), lr_model.coefficients):
    print(col, coef)
  
print(f"intercept: {lr_model.intercept}")

bedrooms 115.67218110629405
bathrooms 15.32773278579738
bathrooms_na -59.66329665713675
minimum_nights -0.5012697007581
number_of_reviews -0.29570073989207135
intercept: 61.14012549013659





## Distributed Setting

Although we can quickly solve for the parameters when the data is small, the closed form solution doesn't scale well to large datasets. 

Spark uses the following approach to solve a linear regression problem:

* First, Spark tries to use matrix decomposition to solve the linear regression problem. 
* If it fails, Spark then uses <a href="https://spark.apache.org/docs/latest/ml-advanced.html#limited-memory-bfgs-l-bfgs" target="_blank">L-BFGS</a> to solve for the parameters. L-BFGS is a limited-memory version of BFGS that is particularly suited to problems with very large numbers of variables. The <a href="https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm" target="_blank">BFGS</a> method belongs to <a href="https://en.wikipedia.org/wiki/Quasi-Newton_method" target="_blank">quasi-Newton methods</a>, which are used to either find zeroes or local maxima and minima of functions iteratively. 

If you are interested in how linear regression is implemented in the distributed setting and bottlenecks, check out these lecture slides:
* <a href="https://files.training.databricks.com/static/docs/distributed-linear-regression-1.pdf" target="_blank">distributed-linear-regression-1</a>
* <a href="https://files.training.databricks.com/static/docs/distributed-linear-regression-2.pdf" target="_blank">distributed-linear-regression-2</a>




## Next Steps

Yikes! We built a pretty bad model. In the next notebook, we will see how we can further improve upon our model.


## Classroom Cleanup

Run the following cell to remove lessons-specific assets created during this lesson:

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the schema "charlie_ohara_4mi2_da_sml"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark"...(0 seconds)

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)


&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>