# Description

* `spark.ml`
    * newer API based on DataFrames
----    
* `spark.mllib` (DON'T USE ME)
    * original ML API based on RDD API


Data used in this notebook is from SF housing data set from Inside Airbnb.

`dev/github-bv/LearningSparkV2/databricks-datasets/learning-spark-v2/sf-airbnb`

# Setup

In [2]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

# Imports

In [3]:
import os
import os.path as path

# Spark

In [11]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = (SparkSession
         .builder
         .master('local[*]')
         .appName("spark-ml-ch-10b")
         .config('ui.showConsoleProgress', 'false')
         .getOrCreate())

# Functions

In [4]:
def db_fname(fname):
    import os.path as path
    data_dir = '~/dev/github-bv/LearningSparkV2/databricks-datasets/learning-spark-v2/'
    return path.expanduser(path.join(data_dir, fname))

# Designing ML Pipelines

* Pipeline API provides a high-level API built on top of DataFrames to organize ML worlflow
* composed of a series of `transformers` and `estimators`

## Definitions

* Transformer
    * DF -> DF + 1 or more columns appended
    * has `.transform()` method
* Estimator
    * DF -> Model (Transformer)
    * learns ('fits') params
    * has `.fit()` method
* Pipeline
    * organize a series of transformers and estimators into a single model
    * Pipeline is an `estimator`
    * `pipeline.fit()` returns a `PipelineModel`, which is a `transformer`

## Data Ingestion and Exploration

They've done some cleansing already. See Databricks communitiy edition notebook

In [8]:
filePath = db_fname('sf-airbnb/sf-airbnb-clean.parquet')

In [12]:
airbnbDF = spark.read.parquet(filePath)

In [13]:
airbnbDF.select('neighbourhood_cleansed', 'room_type', 'bedrooms', 'bathrooms', 'number_of_reviews', 'price').show(5)

+----------------------+---------------+--------+---------+-----------------+-----+
|neighbourhood_cleansed|      room_type|bedrooms|bathrooms|number_of_reviews|price|
+----------------------+---------------+--------+---------+-----------------+-----+
|      Western Addition|Entire home/apt|     1.0|      1.0|            180.0|170.0|
|        Bernal Heights|Entire home/apt|     2.0|      1.0|            111.0|235.0|
|        Haight Ashbury|   Private room|     1.0|      4.0|             17.0| 65.0|
|        Haight Ashbury|   Private room|     1.0|      4.0|              8.0| 65.0|
|      Western Addition|Entire home/apt|     2.0|      1.5|             27.0|785.0|
+----------------------+---------------+--------+---------+-----------------+-----+
only showing top 5 rows



In [15]:
trainDF, testDF = airbnbDF.randomSplit([0.8, 0.2], seed=42)
print(f"""There are {trainDF.count()} rows in the training set, and {testDF.count()} in the test set""")

There are 5758 rows in the training set, and 1388 in the test set


## Preparing Features with Transformers

Linear regression (like many other algorithms in Spark) requires that **all the input features are contained within a single vector in your DataFrame**. Thus, we need to transform our data

Use `VectorAssembler` to combine all columns into a single vector. https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler

In [17]:
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols = ['bedrooms'], outputCol='features')
vecTrainDF = vecAssembler.transform(trainDF)
vecTrainDF.select('bedrooms','features', 'price').show(10)

+--------+--------+-----+
|bedrooms|features|price|
+--------+--------+-----+
|     1.0|   [1.0]|200.0|
|     1.0|   [1.0]|250.0|
|     3.0|   [3.0]|250.0|
|     1.0|   [1.0]| 45.0|
|     1.0|   [1.0]|115.0|
|     1.0|   [1.0]| 70.0|
|     1.0|   [1.0]|105.0|
|     1.0|   [1.0]| 86.0|
|     1.0|   [1.0]|100.0|
|     2.0|   [2.0]|220.0|
+--------+--------+-----+
only showing top 10 rows



In [20]:
vecAssembler.getInputCols()

['bedrooms']

In [21]:
vecAssembler.getOutputCol()

'features'

## Using Estimators to Build Models

In [26]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol='features', labelCol='price')
lrModel = lr.fit(vecTrainDF)

In [27]:
type(lrModel)

pyspark.ml.regression.LinearRegressionModel

In [29]:
type(lr)

pyspark.ml.regression.LinearRegression

`lr.fit()` returns a `LinearRegressionModel` (lrModel), which is a `transformer`. In other words, the **output** of an estimator’s `fit()` method is a `transformer`. Once the estimator has learned the parameters, the transformer can apply these parameters to new data points to generate predictions

In [31]:
m = round(lrModel.coefficients[0], 2)
b = round(lrModel.intercept,2)

print(f"""The formula for the linear regression line is  price = {m} x bedrooms + {b}""")

The formula for the linear regression line is  price = 119.32 x bedrooms + 54.11


In [32]:
lrModel.coefficients

DenseVector([119.3164])

In [36]:
mp = lrModel.extractParamMap()

In [37]:
mp

{Param(parent='LinearRegression_96b02464c14f', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2,
 Param(parent='LinearRegression_96b02464c14f', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty'): 0.0,
 Param(parent='LinearRegression_96b02464c14f', name='epsilon', doc='The shape parameter to control the amount of robustness. Must be > 1.0.'): 1.35,
 Param(parent='LinearRegression_96b02464c14f', name='featuresCol', doc='features column name'): 'features',
 Param(parent='LinearRegression_96b02464c14f', name='fitIntercept', doc='whether to fit an intercept term'): True,
 Param(parent='LinearRegression_96b02464c14f', name='labelCol', doc='label column name'): 'price',
 Param(parent='LinearRegression_96b02464c14f', name='loss', doc='The loss function to be optimized. Supported options: squaredError, huber. (Default squaredError)'): 'squaredError',
 Param(pa

In [42]:
for m in mp:
    print(f'{m.name}: \t{mp[m]}')

aggregationDepth: 	2
elasticNetParam: 	0.0
epsilon: 	1.35
featuresCol: 	features
fitIntercept: 	True
labelCol: 	price
loss: 	squaredError
maxIter: 	100
predictionCol: 	prediction
regParam: 	0.0
solver: 	auto
standardization: 	True
tol: 	1e-06


## Creating a Pipeline

`pipelineModel` is a `transformer`

In [44]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[vecAssembler, lr])
pipelineModel = pipeline.fit(trainDF)

Apply it to our test data set

In [45]:
preDF = pipelineModel.transform(testDF)

In [46]:
preDF.columns

['host_is_superhost',
 'cancellation_policy',
 'instant_bookable',
 'host_total_listings_count',
 'neighbourhood_cleansed',
 'latitude',
 'longitude',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'minimum_nights',
 'number_of_reviews',
 'review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'price',
 'bedrooms_na',
 'bathrooms_na',
 'beds_na',
 'review_scores_rating_na',
 'review_scores_accuracy_na',
 'review_scores_cleanliness_na',
 'review_scores_checkin_na',
 'review_scores_communication_na',
 'review_scores_location_na',
 'review_scores_value_na',
 'features',
 'prediction']

In [47]:
preDF.select('bedrooms', 'features', 'price', 'prediction').show(10)

+--------+--------+-----+------------------+
|bedrooms|features|price|        prediction|
+--------+--------+-----+------------------+
|     1.0|   [1.0]|130.0|173.42588969100558|
|     1.0|   [1.0]| 85.0|173.42588969100558|
|     1.0|   [1.0]| 95.0|173.42588969100558|
|     1.0|   [1.0]|128.0|173.42588969100558|
|     1.0|   [1.0]|250.0|173.42588969100558|
|     1.0|   [1.0]| 95.0|173.42588969100558|
|     1.0|   [1.0]|105.0|173.42588969100558|
|     0.0|   [0.0]|125.0| 54.10946937938496|
|     3.0|   [3.0]|405.0|412.05873031424676|
|     1.0|   [1.0]| 72.0|173.42588969100558|
+--------+--------+-----+------------------+
only showing top 10 rows



## One-hot encoding

Convert categorical values into numeric values

Spark uses a `SparseVector` when the majority of entries are 0, so OHE does not massively increase consumption of memory or compute resources

Multiple ways to one-hot encode data in Spark

1. Use `StringIndexer` and `OneHotEncoder`

  * apply `StringIndexer` estimator to convert categorical values into category indices (ordered by label frequencies)
  * pass output to `OneHotEncoder` (`OneHotEncoderEstimator` for us, since we're using Spark 2.4.0)

In [64]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, OneHotEncoderEstimator

# categoricalCols = [field for (field, dataType) in trainDF.dtypes if dataType == 'string']
# indexOutputCols = [x + 'Index' for x in categoricalCols]
# oheOutputCols = [x + 'OHE' for x in categoricalCols]

# stringIndexer = StringIndexer(inputCols=categoricalCols,
#                               outputCols=indexOutputCols,
#                               handleInvalid='skip')
# oheEncoder = OneHotEncoder(inputCols=indexOutputCols,
#                            outputCols=oheOutputCols)

In [65]:
cat_fields = [field for (field, dataType) in trainDF.dtypes if dataType == 'string']

In [66]:
cat_fields

['host_is_superhost',
 'cancellation_policy',
 'instant_bookable',
 'neighbourhood_cleansed',
 'property_type',
 'room_type',
 'bed_type']

In [136]:
def make_string_indexer(col_name):
    """valid values of handleInvalid
    skip (filter rows)
    error (throw an error)
    keep (put in a special additional bucket)
    
    NOTE: spark 3.0 will accept multple columns as input/output
    """
    encoded_col_name = f'{col_name}_Index'
    string_indexer = StringIndexer(inputCol=col_name, 
                                   outputCol=encoded_col_name, 
                                   handleInvalid='keep')
    return string_indexer

def make_one_hot_encoder(col_names):
    """each `*_OHE` column will be a SparseVector after fitting and transformation
    
    Usage:
    ohe_room_type = make_one_hot_encoder(['room_type'])
    encoded_room_type = ohe_room_type.fit(transformed_room_type)

    encoded_room_type.transform(transformed_room_type).show()
    
    +---------------+-----+---------------+-------------+
    |      room_type|price|room_type_Index|room_type_OHE|
    +---------------+-----+---------------+-------------+
    |   Private room|200.0|            1.0|(3,[1],[1.0])|
    |Entire home/apt|250.0|            0.0|(3,[0],[1.0])|
    |Entire home/apt|250.0|            0.0|(3,[0],[1.0])|
    """
    input_col_names = [f'{col_name}_Index' for col_name in col_names]
    output_col_names = [f'{col_name}_OHE' for col_name in col_names]
    estimator = OneHotEncoderEstimator(inputCols=input_col_names,
                                  outputCols=output_col_names)
    return estimator


In [137]:
stages_cat_str_index = [make_string_indexer(c) for c in cat_fields]

In [138]:
oheEncoder = make_one_hot_encoder(cat_fields)
oheOutputCols = oheEncoder.getOutputCols()

In [139]:
oheEncoder.extractParamMap()

{Param(parent='OneHotEncoderEstimator_cad88dda0193', name='handleInvalid', doc="How to handle invalid data during transform(). Options are 'keep' (invalid data presented as an extra categorical feature) or error (throw an error). Note that this Param is only used during transform; during fitting, invalid data will result in an error."): 'error',
 Param(parent='OneHotEncoderEstimator_cad88dda0193', name='dropLast', doc='whether to drop the last category'): True,
 Param(parent='OneHotEncoderEstimator_cad88dda0193', name='inputCols', doc='input column names.'): ['host_is_superhost_Index',
  'cancellation_policy_Index',
  'instant_bookable_Index',
  'neighbourhood_cleansed_Index',
  'property_type_Index',
  'room_type_Index',
  'bed_type_Index'],
 Param(parent='OneHotEncoderEstimator_cad88dda0193', name='outputCols', doc='output column names.'): ['host_is_superhost_OHE',
  'cancellation_policy_OHE',
  'instant_bookable_OHE',
  'neighbourhood_cleansed_OHE',
  'property_type_OHE',
  'room_ty

In [140]:
numericCols = [field for (field, dataType) in trainDF.dtypes if ((dataType == 'double') & (field != 'price'))]
numericCols

['host_total_listings_count',
 'latitude',
 'longitude',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'minimum_nights',
 'number_of_reviews',
 'review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'bedrooms_na',
 'bathrooms_na',
 'beds_na',
 'review_scores_rating_na',
 'review_scores_accuracy_na',
 'review_scores_cleanliness_na',
 'review_scores_checkin_na',
 'review_scores_communication_na',
 'review_scores_location_na',
 'review_scores_value_na']

In [141]:
assemblerInputs = oheOutputCols + numericCols
vecAssembler = VectorAssembler(inputCols=assemblerInputs, outputCol='features')

In [142]:
assemblerInputs

['host_is_superhost_OHE',
 'cancellation_policy_OHE',
 'instant_bookable_OHE',
 'neighbourhood_cleansed_OHE',
 'property_type_OHE',
 'room_type_OHE',
 'bed_type_OHE',
 'host_total_listings_count',
 'latitude',
 'longitude',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'minimum_nights',
 'number_of_reviews',
 'review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'bedrooms_na',
 'bathrooms_na',
 'beds_na',
 'review_scores_rating_na',
 'review_scores_accuracy_na',
 'review_scores_cleanliness_na',
 'review_scores_checkin_na',
 'review_scores_communication_na',
 'review_scores_location_na',
 'review_scores_value_na']

### What does `StringIndexer` do?

* `StringIndexer` creates an estimator that will convert a column of strings to a column of numbers ordered by frequency
* to use it, `.fit()` your dataframe to get a `model` object out
* then, `.transform()` new data to append the indexed columns

#### Make the indexer object (no fitting yet)

In [92]:
se_1 = make_string_indexer('room_type')

#### `.fit()` the indexer object (returns a `model`)

In [81]:
trainDF.select('room_type', 'price').show()

+---------------+-----+
|      room_type|price|
+---------------+-----+
|   Private room|200.0|
|Entire home/apt|250.0|
|Entire home/apt|250.0|
|   Private room| 45.0|
|   Private room|115.0|
|   Private room| 70.0|
|   Private room|105.0|
|   Private room| 86.0|
|Entire home/apt|100.0|
|Entire home/apt|220.0|
|Entire home/apt|110.0|
|   Private room|130.0|
|   Private room|100.0|
|Entire home/apt|350.0|
|   Private room|159.0|
|Entire home/apt|200.0|
|Entire home/apt|250.0|
|Entire home/apt|299.0|
|Entire home/apt|250.0|
|   Private room| 95.0|
+---------------+-----+
only showing top 20 rows



In [90]:
se_1_model = se_1.fit(trainDF.select('room_type', 'price'))

#### `.transform()` some data using the fitted indexer object (the `model`)

In [93]:
transformed_room_type = se_1_model.transform(trainDF.select('room_type', 'price'))
transformed_room_type.show()

+---------------+-----+---------------+
|      room_type|price|room_type_Index|
+---------------+-----+---------------+
|   Private room|200.0|            1.0|
|Entire home/apt|250.0|            0.0|
|Entire home/apt|250.0|            0.0|
|   Private room| 45.0|            1.0|
|   Private room|115.0|            1.0|
|   Private room| 70.0|            1.0|
|   Private room|105.0|            1.0|
|   Private room| 86.0|            1.0|
|Entire home/apt|100.0|            0.0|
|Entire home/apt|220.0|            0.0|
|Entire home/apt|110.0|            0.0|
|   Private room|130.0|            1.0|
|   Private room|100.0|            1.0|
|Entire home/apt|350.0|            0.0|
|   Private room|159.0|            1.0|
|Entire home/apt|200.0|            0.0|
|Entire home/apt|250.0|            0.0|
|Entire home/apt|299.0|            0.0|
|Entire home/apt|250.0|            0.0|
|   Private room| 95.0|            1.0|
+---------------+-----+---------------+
only showing top 20 rows



In [102]:
ohe_room_type = make_one_hot_encoder(['room_type'])
encoded_room_type = ohe_room_type.fit(transformed_room_type)

encoded_room_type.transform(transformed_room_type).show()

+---------------+-----+---------------+-------------+
|      room_type|price|room_type_Index|room_type_OHE|
+---------------+-----+---------------+-------------+
|   Private room|200.0|            1.0|(3,[1],[1.0])|
|Entire home/apt|250.0|            0.0|(3,[0],[1.0])|
|Entire home/apt|250.0|            0.0|(3,[0],[1.0])|
|   Private room| 45.0|            1.0|(3,[1],[1.0])|
|   Private room|115.0|            1.0|(3,[1],[1.0])|
|   Private room| 70.0|            1.0|(3,[1],[1.0])|
|   Private room|105.0|            1.0|(3,[1],[1.0])|
|   Private room| 86.0|            1.0|(3,[1],[1.0])|
|Entire home/apt|100.0|            0.0|(3,[0],[1.0])|
|Entire home/apt|220.0|            0.0|(3,[0],[1.0])|
|Entire home/apt|110.0|            0.0|(3,[0],[1.0])|
|   Private room|130.0|            1.0|(3,[1],[1.0])|
|   Private room|100.0|            1.0|(3,[1],[1.0])|
|Entire home/apt|350.0|            0.0|(3,[0],[1.0])|
|   Private room|159.0|            1.0|(3,[1],[1.0])|
|Entire home/apt|200.0|     

In [101]:
transformed_room_type.('room_type').distinct().show()

+---------------+
|      room_type|
+---------------+
|    Shared room|
|Entire home/apt|
|   Private room|
+---------------+



### Back to example

## `RFormula`

In [122]:
from pyspark.ml.feature import RFormula

In [123]:
rFormula = RFormula(formula='price ~.',
                    featuresCol='features',
                    labelCol='price',
                    handleInvalid='keep')

In [127]:
rf_transformer = rFormula.fit(trainDF.select('room_type', 'price'))

In [129]:
rf_transformer.transform(trainDF.select('room_type', 'price')).show()

+---------------+-----+-------------+
|      room_type|price|     features|
+---------------+-----+-------------+
|   Private room|200.0|[0.0,1.0,0.0]|
|Entire home/apt|250.0|[1.0,0.0,0.0]|
|Entire home/apt|250.0|[1.0,0.0,0.0]|
|   Private room| 45.0|[0.0,1.0,0.0]|
|   Private room|115.0|[0.0,1.0,0.0]|
|   Private room| 70.0|[0.0,1.0,0.0]|
|   Private room|105.0|[0.0,1.0,0.0]|
|   Private room| 86.0|[0.0,1.0,0.0]|
|Entire home/apt|100.0|[1.0,0.0,0.0]|
|Entire home/apt|220.0|[1.0,0.0,0.0]|
|Entire home/apt|110.0|[1.0,0.0,0.0]|
|   Private room|130.0|[0.0,1.0,0.0]|
|   Private room|100.0|[0.0,1.0,0.0]|
|Entire home/apt|350.0|[1.0,0.0,0.0]|
|   Private room|159.0|[0.0,1.0,0.0]|
|Entire home/apt|200.0|[1.0,0.0,0.0]|
|Entire home/apt|250.0|[1.0,0.0,0.0]|
|Entire home/apt|299.0|[1.0,0.0,0.0]|
|Entire home/apt|250.0|[1.0,0.0,0.0]|
|   Private room| 95.0|[0.0,1.0,0.0]|
+---------------+-----+-------------+
only showing top 20 rows



In [130]:
rf_transformer = rFormula.fit(trainDF)

In [135]:
rf_transformer.transform(trainDF).select('room_type', 'price', 'features').show(truncate=False)

+---------------+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|room_type      |price|features                                                                                                                                                                                                                                 |
+---------------+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Private room   |200.0|(104,[0,4,8,43,46,47,49,74,76,77,78,79,80,85,86,87,88,89,90,91,92,93],[1.0,1.0,1.0,1.0,37.7431,-122.44509,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0])                        

You do not need to one-hot encode categorical features for tree-based methods, and it will often make your tree-based models worse.

## Add LinearRegression model

In [145]:
lr = LinearRegression(labelCol='price', featuresCol='features')
pipeline = Pipeline(stages= stages_cat_str_index + [oheEncoder, vecAssembler, lr])

In [146]:
pipelineModel = pipeline.fit(trainDF)

predDF = pipelineModel.transform(testDF)

predDF

DataFrame[host_is_superhost: string, cancellation_policy: string, instant_bookable: string, host_total_listings_count: double, neighbourhood_cleansed: string, latitude: double, longitude: double, property_type: string, room_type: string, accommodates: double, bathrooms: double, bedrooms: double, beds: double, bed_type: string, minimum_nights: double, number_of_reviews: double, review_scores_rating: double, review_scores_accuracy: double, review_scores_cleanliness: double, review_scores_checkin: double, review_scores_communication: double, review_scores_location: double, review_scores_value: double, price: double, bedrooms_na: double, bathrooms_na: double, beds_na: double, review_scores_rating_na: double, review_scores_accuracy_na: double, review_scores_cleanliness_na: double, review_scores_checkin_na: double, review_scores_communication_na: double, review_scores_location_na: double, review_scores_value_na: double, host_is_superhost_Index: double, cancellation_policy_Index: double, inst

In [150]:
predDF.columns

['host_is_superhost',
 'cancellation_policy',
 'instant_bookable',
 'host_total_listings_count',
 'neighbourhood_cleansed',
 'latitude',
 'longitude',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'minimum_nights',
 'number_of_reviews',
 'review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'price',
 'bedrooms_na',
 'bathrooms_na',
 'beds_na',
 'review_scores_rating_na',
 'review_scores_accuracy_na',
 'review_scores_cleanliness_na',
 'review_scores_checkin_na',
 'review_scores_communication_na',
 'review_scores_location_na',
 'review_scores_value_na',
 'host_is_superhost_Index',
 'cancellation_policy_Index',
 'instant_bookable_Index',
 'neighbourhood_cleansed_Index',
 'property_type_Index',
 'room_type_Index',
 'bed_type_Index',
 'property_type_OHE',
 'room_type_OHE',
 'instant_bookable_OHE',
 'bed_

In [152]:
predDF.select('price', 'prediction', 'features').show()

+-----+------------------+--------------------+
|price|        prediction|            features|
+-----+------------------+--------------------+
|130.0|-50.08206919874374|(104,[0,4,8,27,45...|
| 85.0| 69.30273500843668|(104,[0,4,8,27,46...|
| 95.0|122.44764330428825|(104,[0,4,8,27,48...|
|128.0|-73.85987476664286|(104,[0,4,8,15,45...|
|250.0| 122.6754680966792|(104,[0,4,8,15,46...|
| 95.0| 200.5074117194099|(104,[0,4,8,35,45...|
|105.0|128.48438720450395|(104,[0,4,8,36,46...|
|125.0|108.70731811393489|(104,[0,4,8,36,45...|
|405.0|444.05924130765334|(104,[0,4,8,16,50...|
| 72.0|187.88334714553457|(104,[0,4,8,16,47...|
|150.0|185.72996926754968|(104,[0,4,8,21,47...|
|450.0|268.26018917100646|(104,[0,4,8,22,47...|
|165.0|357.28925461431845|(104,[0,4,8,10,47...|
| 85.0|200.10202003698168|(104,[0,4,8,10,47...|
|100.0| 187.3679853991889|(104,[0,4,8,19,46...|
|100.0| 52.38409496248187|(104,[0,4,8,30,45...|
| 57.0|203.43439302167099|(104,[0,4,8,29,45...|
| 99.0| 277.2300165159686|(104,[0,4,8,29

# Evaluating Models

In spark.ml there are classification, regression, clustering, and ranking evaluators (introduced in Spark 3.0).

## RMSE (root mean square error)

use this and R2 since regression

$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}$

In [154]:
from pyspark.ml.evaluation import RegressionEvaluator
regressionEvaluator = RegressionEvaluator(
    predictionCol='prediction',
    labelCol='price',
    metricName='rmse')

In [216]:
rmse = regressionEvaluator.setMetricName('rmse').evaluate(predDF)
print(f'RMSE is {rmse:.1f}')

RMSE is 286.2


In [161]:
regressionEvaluator.evaluate(predDF, {regressionEvaluator.metricName: 'rmse'})

286.19910175951

In [162]:
regressionEvaluator.evaluate(predDF, {regressionEvaluator.metricName: 'mae'})

92.50341120810286

### Create a Baseline Model

In [164]:
from pyspark.sql.functions import avg, lit

In [166]:
trainDF.select(avg('price')).show()

+------------------+
|        avg(price)|
+------------------+
|214.60020840569643|
+------------------+



In [169]:
trainDF.select(avg('price')).first()

Row(avg(price)=214.60020840569643)

In [170]:
trainDF.select(avg('price')).first()[0]

214.60020840569643

In [171]:
avgPrice = trainDF.select(avg('price')).first()[0]

Here, we don't need to do a `model.transform(testDF)` to get a prediction.

Instead, we are assigning the average price as the prediction value.

In [172]:
predDF_baseline = testDF.withColumn('avgPrediction', lit(avgPrice))

In [176]:
predDF_baseline.columns

['host_is_superhost',
 'cancellation_policy',
 'instant_bookable',
 'host_total_listings_count',
 'neighbourhood_cleansed',
 'latitude',
 'longitude',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'minimum_nights',
 'number_of_reviews',
 'review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'price',
 'bedrooms_na',
 'bathrooms_na',
 'beds_na',
 'review_scores_rating_na',
 'review_scores_accuracy_na',
 'review_scores_cleanliness_na',
 'review_scores_checkin_na',
 'review_scores_communication_na',
 'review_scores_location_na',
 'review_scores_value_na',
 'avgPrediction']

In [177]:
from pyspark.ml.evaluation import RegressionEvaluator

In [178]:
regressionMeanEvaluator = RegressionEvaluator(predictionCol='avgPrediction', labelCol='price', metricName='rmse')

In [181]:
rmse_baseline = regressionMeanEvaluator.evaluate(predDF_baseline)
print(f'RMSE for predicting avg price is: {rmse_baseline:.2f}')

RMSE for predicting avg price is: 311.16


In [185]:
print(f'RMSE model: {rmse:.2f}\nRMSE baseline: {rmse_baseline:.2f}')
print('The model beat the baseline.')

RMSE model: 286.20
RMSE baseline: 311.16
The model beat the baseline.


## R2

* $R^2$ values range from $(-\infty, 1)$

\begin{align}
R^2 &= 1 - \frac{SS_{res}}{SS_{tot}} \\
SS_{tot} &= \sum_{i=1}^n (y_i - \bar{y})^2 \\
SS_{res} &= \sum_{i=1}^n (y_i - \hat{y})^2
\end{align}

In [214]:
r2 = regressionEvaluator.setMetricName('r2').evaluate(predDF)

In [193]:
regressionEvaluator.evaluate(predDF, {regressionEvaluator.metricName: 'r2'})

0.15360837656049942

In [189]:
r2

0.15360837656049942

In [192]:
r2 - (1 - (rmse/rmse_baseline)**2)

-0.0003858279473081261

## Predict price on log scale

In [200]:
from pyspark.sql.functions import col, log

In [201]:
logTrainDF = trainDF.withColumn('log_price', log(col('price')))
logTestDF = testDF.withColumn('log_price', log(col('price')))

In [207]:
log_lr = LinearRegression(labelCol='log_price', featuresCol='features', predictionCol='log_pred')
log_pipeline = Pipeline(stages = stages_cat_str_index + [oheEncoder, vecAssembler, log_lr])

In [208]:
log_pipeline_model = log_pipeline.fit(logTrainDF)

log_predDF = log_pipeline_model.transform(logTestDF)

log_predDF

DataFrame[host_is_superhost: string, cancellation_policy: string, instant_bookable: string, host_total_listings_count: double, neighbourhood_cleansed: string, latitude: double, longitude: double, property_type: string, room_type: string, accommodates: double, bathrooms: double, bedrooms: double, beds: double, bed_type: string, minimum_nights: double, number_of_reviews: double, review_scores_rating: double, review_scores_accuracy: double, review_scores_cleanliness: double, review_scores_checkin: double, review_scores_communication: double, review_scores_location: double, review_scores_value: double, price: double, bedrooms_na: double, bathrooms_na: double, beds_na: double, review_scores_rating_na: double, review_scores_accuracy_na: double, review_scores_cleanliness_na: double, review_scores_checkin_na: double, review_scores_communication_na: double, review_scores_location_na: double, review_scores_value_na: double, log_price: double, host_is_superhost_Index: double, cancellation_policy_

In [209]:
log_predDF.columns

['host_is_superhost',
 'cancellation_policy',
 'instant_bookable',
 'host_total_listings_count',
 'neighbourhood_cleansed',
 'latitude',
 'longitude',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'minimum_nights',
 'number_of_reviews',
 'review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'price',
 'bedrooms_na',
 'bathrooms_na',
 'beds_na',
 'review_scores_rating_na',
 'review_scores_accuracy_na',
 'review_scores_cleanliness_na',
 'review_scores_checkin_na',
 'review_scores_communication_na',
 'review_scores_location_na',
 'review_scores_value_na',
 'log_price',
 'host_is_superhost_Index',
 'cancellation_policy_Index',
 'instant_bookable_Index',
 'neighbourhood_cleansed_Index',
 'property_type_Index',
 'room_type_Index',
 'bed_type_Index',
 'property_type_OHE',
 'room_type_OHE',
 'instant_bookabl

### Exponentiate

In [206]:
from pyspark.sql.functions import col, exp
from pyspark.ml.evaluation import RegressionEvaluator

In [210]:
log_predDF.withColumn('prediction', exp(col('log_pred'))).select('price', 'prediction', 'log_pred').show()

+-----+------------------+------------------+
|price|        prediction|          log_pred|
+-----+------------------+------------------+
|130.0| 64.90612595675911| 4.172942009960593|
| 85.0|103.75846825102599| 4.642065777476574|
| 95.0|117.26581005219721| 4.764443238766688|
|128.0| 47.53748348391692|3.8615185258215945|
|250.0|107.07401774424223| 4.673520349929049|
| 95.0|159.17878797836354| 5.070028023190929|
|105.0|110.26373784158828| 4.702875112837148|
|125.0|123.76142608947094|4.8183557292225885|
|405.0| 378.0789841136678|5.9351031264736775|
| 72.0|135.11715474504877| 4.906142215032787|
|150.0| 161.5131508284563| 5.084586568625895|
|450.0| 147.4394921654137| 4.993417869008198|
|165.0|243.79652313459525| 5.496333955807643|
| 85.0|123.53302787706579|4.8165085525173765|
|100.0|119.54310069714214| 4.783676981633761|
|100.0|109.49555077883437| 4.695883916274283|
| 57.0|115.30904518886132| 4.747615873364225|
| 99.0|185.54644286828037| 5.223305216577927|
|165.0| 281.4170711500708| 5.63983

In [211]:
expDF = log_predDF.withColumn('prediction', exp(col('log_pred')))

In [218]:
log_regr_eval = RegressionEvaluator(labelCol='price', predictionCol='prediction')
log_rmse = log_regr_eval.setMetricName('rmse').evaluate(expDF)
log_r2 = log_regr_eval.setMetricName('r2').evaluate(expDF)

print(f'RMSE: {rmse:.2f}')
print(f'r2: {r2:.2f}')
print()
print(f'log RMSE: {log_rmse:.2f}')
print(f'log r2: {log_r2:.2f}')

RMSE: 286.20
r2: 0.15

log RMSE: 278.56
log r2: 0.20


Notice: prices are lognormal (the log of the prices is a normal distribution)

building a model to predict log prices, then exponentiating to get actual price results in a lower RMSE and higher $R^2$

## Save the model

In [219]:
pipelinePath = './lr-pipeline-model'
(pipelineModel
 .write()
 .overwrite()
 .save(pipelinePath))

## Load the model

When loading you need to specify the type of model you are loading (e.g. `LinearRegressionModel` or `LogisticRegressionModel`).

If you always put transformers/estimators in a `Pipeline`, then you'll always load a `PipelineModel`

In [220]:
from pyspark.ml import PipelineModel

savedPipelineModel = PipelineModel.load(pipelinePath)

In [221]:
pred_df_saved = savedPipelineModel.transform(testDF)

In [225]:
regressionEvaluator.setMetricName('rmse').evaluate(pred_df_saved)

286.19910175951

In [228]:
print(f'rmse: {regressionEvaluator.setMetricName("rmse").evaluate(pred_df_saved)}')
print(f'R2: {regressionEvaluator.setMetricName("r2").evaluate(pred_df_saved)}')

rmse: 286.19910175951
R2: 0.15360837656049942


# Tree Based Models