-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Linear Regression II Lab

Alright! We're making progress. Still not a great RMSE or R2, but better than the baseline or just using a single feature.

In the lab, you will see how to improve our performance even more.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Learning Objectives:<br>

By the end of this lab, you should be able to;

* Use RFormula to simplify the process of using StringIndexer, OneHotEncoder, and VectorAssembler
* Transform data into log scale to fit a model
* Convert log scale predictions to appropriate form for model evaluation

## Lab Setup


The first thing we're going to do is to **run setup script**. This script will define the required configuration variables that are scoped to each user.

In [0]:
%run "../Includes/Classroom-Setup"

## Load Dataset and Train Model

In [0]:
file_path = f"{DA.paths.datasets}/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

## RFormula

Instead of manually specifying which columns are categorical to the StringIndexer and OneHotEncoder, <a href="(https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.RFormula.html?highlight=rformula#pyspark.ml.feature.RFormula" target="_blank">RFormula</a> can do that automatically for you.

With RFormula, if you have any columns of type String, it treats it as a categorical feature and string indexes & one hot encodes it for us. Otherwise, it leaves as it is. Then it combines all of one-hot encoded features and numeric features into a single vector, called **`features`**.

You can see a detailed example of how to use RFormula <a href="https://spark.apache.org/docs/latest/ml-features.html#rformula" target="_blank">here</a>.

In [0]:
# ANSWER
from pyspark.ml import Pipeline
from pyspark.ml.feature import RFormula
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

r_formula = RFormula(formula="price ~ .", featuresCol="features", labelCol="price", handleInvalid="skip") # Look at handleInvalid

lr = LinearRegression(labelCol="price", featuresCol="features")
pipeline = Pipeline(stages=[r_formula, lr])
pipeline_model = pipeline.fit(train_df)
pred_df = pipeline_model.transform(test_df)

regression_evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction")
rmse = regression_evaluator.setMetricName("rmse").evaluate(pred_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

## Log Scale

Now that we have verified we get the same result using RFormula as above, we are going to improve upon our model. If you recall, our price dependent variable appears to be log-normally distributed, so we are going to try to predict it on the log scale.

Let's convert our price to be on log scale, and have the linear regression model predict the log price

In [0]:
from pyspark.sql.functions import log

display(train_df.select(log("price")))

In [0]:
# ANSWER
from pyspark.sql.functions import col, log

log_train_df = train_df.withColumn("log_price", log(col("price")))
log_test_df = test_df.withColumn("log_price", log(col("price")))

r_formula = RFormula(formula="log_price ~ . - price", featuresCol="features", labelCol="log_price", handleInvalid="skip") 

lr.setLabelCol("log_price").setPredictionCol("log_pred")
pipeline = Pipeline(stages=[r_formula, lr])
pipeline_model = pipeline.fit(log_train_df)
pred_df = pipeline_model.transform(log_test_df)

## Exponentiate

In order to interpret our RMSE, we need to convert our predictions back from logarithmic scale.

In [0]:
# ANSWER
from pyspark.sql.functions import col, exp

exp_df = pred_df.withColumn("prediction", exp(col("log_pred")))

rmse = regression_evaluator.setMetricName("rmse").evaluate(exp_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(exp_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

Nice job! You have increased the R2 and dropped the RMSE significantly in comparison to the previous model.

In the next few notebooks, we will see how we can reduce the RMSE even more.

## Classroom Cleanup

Run the following cell to remove lessons-specific assets created during this lesson:

In [0]:
DA.cleanup()

-sandbox
&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>