<a href="https://colab.research.google.com/github/dalgual/aidatasci/blob/main/airbnbPriceRF_ray.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="http://www.calstatela.edu/centers/hipic"><img align="left" src="https://avatars2.githubusercontent.com/u/4156894?v=3&s=100"><image/>
</a>
<img align="right" alt="California State University, Los Angeles" src="http://www.calstatela.edu/sites/default/files/groups/California%20State%20University%2C%20Los%20Angeles/master_logo_full_color_horizontal_centered.svg" style="width: 360px;"/>

#    CIS5560 Term Project Tutorial

------
#### Authors: Samyuktha Muralidharan, Sanjana Boddireddy, Savita Yadav, Farnood Rahbar Far

#### Instructor: [Jongwook Woo](https://www.linkedin.com/in/jongwook-woo-7081a85)

#### Date: 05/23/2021

## Objective
**Airbnb** is an online marketplace that connects people who want to rent out their homes with people who are looking for accommodations in that locale. One challenge that Airbnb hosts face is determining the **optimal rent price/night**. The amount a host can charge on a nightly basis is closely linked to the dynamics of the marketplace.The objective of this tutorial includes building a machine learning model that predicts the optimal price of a property considering the features of the listings. We use a **Regression model** for Price Prediction and the algorithm used here is **Gradient Boosted Tree Regression**. We also evaluate the model performance to determine how well the model predicts the Airbnb Listing price.

##Import Spark SQL and Spark ML Libraries
Import all the Spark SQL and ML libraries as mentioned below. This is neccessary to access the functions available in those libraries.

In [None]:
import numpy as np
from time import time
from decimal import Decimal
import pyspark.pandas as ps

# XGBoost on ray is needed to run this example.
# Please refer to https://docs.ray.io/en/latest/xgboost-ray.html to install it.
import ray
from xgboost_ray import RayXGBClassifier, RayDMatrix, train, RayParams, predict
import raydp
from raydp.utils import random_split
from raydp.spark import RayMLDataset
#from ray import tune

# data_process.py at the same directory
#from data_process import nyc_taxi_preprocess, NYC_TRAIN_CSV
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

In [None]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.storagelevel import StorageLevel

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler,StringIndexer, VectorIndexer, MinMaxScaler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import GBTRegressor, RandomForestRegressionModel, RandomForestRegressor

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

## To run the code in PySpark CLI
Set the following to True:
```
PYSPARK_CLI = True
```
Generate .py(Python) file from Databricks: File > Export > Source File
```
Run it at the Hadoop/Spark cluster:
$ spark-submit Random Forest Regression.py

In [None]:
PYSPARK_CLI = False
if PYSPARK_CLI:
    sc = SparkContext.getOrCreate()
    spark = SparkSession(sc)

##Read the csv file from DBFS (Databricks File System)
The file **'airbnb_sample.csv'** consists of various Airbnb Listings and its features. The label column is **'Price'** that indicates the price/night of the Airbnb property.Locate the data file, mention its type and read the file as a pyspark dataframe

The url to the sampled file : **https://www.kaggle.com/samyukthamurali/airbnb-ratings-dataset?select=airbnb_sample.csv**. You can download the sampled file from this url and upload it in DBFS.

In [None]:
# Auxiliar functions
def equivalent_type(f):
    if f == 'datetime64[ns]': return TimestampType()
    elif f == 'int64': return LongType()
    elif f == 'int32': return IntegerType()
    elif f == 'float64': return FloatType()
    else: return StringType()



In [None]:
def define_structure(string, format_type):
    try: typo = equivalent_type(format_type)
    except: typo = StringType()
    return StructField(string, typo)


In [None]:
# Given pandas dataframe, it will return a spark's dataframe.
# Performance: 100MB/10 partitions
def pandas_to_spark(pandas_df, num_partition):
    columns = list(pandas_df.columns)
    types = list(pandas_df.dtypes)
    struct_list = []
    for column, typo in zip(columns, types):
      struct_list.append(define_structure(column, typo))
    p_schema = StructType(struct_list)
    return spark.createDataFrame(pandas_df, p_schema).repartition(num_partition)
    #return sqlContext.createDataFrame(pandas_df, p_schema)

In [None]:
# shutdown before connect to ray cluster
# ray.shutdown()
# ray.init(address='auto')

# for the host/master node only
# ray.init(num_cpus=2)

# for the existing cluster
ray.init(address='auto')


2022-04-08 00:48:34,503	INFO worker.py:861 -- Connecting to existing Ray cluster at address: 172.30.0.26:6379


{'node_ip_address': '172.30.0.26',
 'raylet_ip_address': '172.30.0.26',
 'redis_address': None,
 'object_store_address': '/tmp/ray/session_2022-04-07_19-01-43_954995_4123/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-04-07_19-01-43_954995_4123/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2022-04-07_19-01-43_954995_4123',
 'metrics_export_port': 38615,
 'gcs_address': '172.30.0.26:6379',
 'address': '172.30.0.26:6379',
 'node_id': '88951fc54095205b29f9c860414d6511bd041dac8fba944047078d03'}

In [None]:
# After ray.init, you can use the raydp api to get a spark session
#g4dn.2xlarge, 1 GPU, 8 vCPUs, 32 GiB of memory, 225 NVMe SSD, up to 25 Gbps network performance
app_name = "Airbnb Rating with GBT RayDP"
num_executors = 1 #3 #1  3 min_wokers, 5 min_wokers,
cores_per_executor = 1 #4, 2, 1
memory_per_executor = "2GB" #"6GB" # "2GB" "1GB" "500M"

In [None]:
spark = raydp.init_spark(app_name, num_executors, cores_per_executor, memory_per_executor)


[2m[36m(RayDPSparkMaster pid=18838)[0m 2022-04-08 00:48:37,929 WARN NativeCodeLoader [Thread-2]: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[2m[36m(RayDPSparkMaster pid=18838)[0m 2022-04-08 00:48:37,987 INFO SecurityManager [Thread-2]: Changing view acls to: ubuntu
[2m[36m(RayDPSparkMaster pid=18838)[0m 2022-04-08 00:48:37,987 INFO SecurityManager [Thread-2]: Changing modify acls to: ubuntu
[2m[36m(RayDPSparkMaster pid=18838)[0m 2022-04-08 00:48:37,988 INFO SecurityManager [Thread-2]: Changing view acls groups to: 
[2m[36m(RayDPSparkMaster pid=18838)[0m 2022-04-08 00:48:37,988 INFO SecurityManager [Thread-2]: Changing modify acls groups to: 
[2m[36m(RayDPSparkMaster pid=18838)[0m 2022-04-08 00:48:37,988 INFO SecurityManager [Thread-2]: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ubuntu); groups with view permissions: Set(); users  with modify permissions: Set(u

In [None]:
start = time()

IS_S3 = False #True False
file_name = 'airbnb_listings.csv' #'airbnb-listings.csv' 'airbnb_US.csv' #'airbnb_sample.csv' 'airbnb-listings.csv'
if IS_S3:
    #aws s3
    file_location = "s3://bigdai-pub/" + file_name
    phrase = "S3 Data Read Time: "
    #file_location = "s3://bigdai-pub/airbnb_US.csv"
    #file_s3 = "s3://bigdai-pub/airbnb-listings.csv"
else:
    phrase = "Local Data Read Time: "
    # local at the same directory of this code
    file_location = "./" + file_name

##Create a temporary view of the dataframe 'df'

In [None]:
import pandas as pd
import pyspark.pandas as ps

if IS_S3:
    '''
    data = ray.data.read_csv(file_location).option("header", "true") \
        .option("inferSchema", "true").option("strings_can_be_null", "true") # (NYC_TRAIN_CSV)
'''
    # [10] col_types: NULL (the default) to infer types from the data.
    #data = ray.data.read_csv(file_location)
    #data = ray.data.read_csv(file_location).option(arrow_csv_args: {"strings_can_be_null": True})
    #data = ray.data.read_csv(file_location).option({"strings_can_be_null": True})

    f1 = file_location
    #file_location = "s3://bigdai-pub/splits"
    #f1=file_location+"/airbnb_US_x00.csv"
    #f2=file_location+"/airbnb_US_x01.csv"
    #f3=file_location+"/airbnb_US_x02.csv"
    #pandasDF = pd.read_csv([f1, f2, f3], on_bad_lines='skip') #, na_filter=False) #, na_values="null") #
    # panda to read
    pandasDF = pd.read_csv(f1, on_bad_lines='skip') #, na_filter=False) #, na_values="null") #
    csv=pandas_to_spark(pandasDF, 10) #spark.createDataFrame(pandasDF)

    # dask to read
    '''ddf = dd.read_csv(f1, on_bad_lines='skip', dtype=dtypes)
    #data = pandas_to_dask(ddf)
    ddf.compute(scheduler='processes')'''

    # Pyspark Read: Hadoop no class error
    '''file_location = "s3a://bigdai-pub/splits"
    f1=file_location+"/airbnb_US_x00.csv"
    data = ps.read_csv(f1, on_bad_lines='skip')'''

    #     # Pyspark Read: Hadoop no class error
    '''data = spark.read.format("csv").option("header", "true") \
        .option("inferSchema", "true") \
        .load(file_location)'''
else: # airbnb-listing.csv # comment if for airbnb_US.csv and airbnb_sample.csv
    csv = spark.read.format("csv").option("header", "true") \
        .option("inferSchema", "true") \
        .option("delimiter", ";") \
        .load(file_location)

end = time()
print('{} takes {} seconds'.format(phrase, (end - start))) #round(end - start, 2)))


[2m[33m(raylet)[0m log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
[2m[33m(raylet)[0m log4j:WARN Please initialize the log4j system properly.
[2m[33m(raylet)[0m log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.


Local Data Read Time:  takes 17.53439736366272 seconds


In [None]:
# Set spark timezone for processing datetime
spark.conf.set("spark.sql.session.timeZone", "UTC")

In [None]:
csv.printSchema()

root
 |-- id: string (nullable = true)
 |-- listing_url: string (nullable = true)
 |-- scrape_id: string (nullable = true)
 |-- last_scraped: string (nullable = true)
 |-- name: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- space: string (nullable = true)
 |-- description: string (nullable = true)
 |-- experiences_offered: string (nullable = true)
 |-- neighborhood_overview: string (nullable = true)
 |-- notes: string (nullable = true)
 |-- transit: string (nullable = true)
 |-- access: string (nullable = true)
 |-- interaction: string (nullable = true)
 |-- house_rules: string (nullable = true)
 |-- thumbnail_url: string (nullable = true)
 |-- medium_url: string (nullable = true)
 |-- picture_url: string (nullable = true)
 |-- xl_picture_url: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_url: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- host_since: string (nullable = true)
 |-- host_location: string (nullable

##Selecting features
In the following step, we are selecting the features that are useful for Price Prediction. We also select the **'Price'** column which will be the label the model will predict.

In [None]:
phrase= "data engineering time: "
start = time()

In [None]:

csv = csv.withColumn("review_scores_rating", when(col("review_scores_rating") >= 80,1).otherwise(0))
csv = csv.withColumn("host_response_rate", csv["host_response_rate"].cast(IntegerType()))
csv = csv.withColumn("host_listings_count", csv["host_listings_count"].cast(IntegerType()))
csv = csv.withColumn("host_total_listings_count", csv["host_total_listings_count"].cast(IntegerType()))
csv = csv.withColumn("price", csv["price"].cast(IntegerType()))
csv = csv.withColumn("weekly_price", csv["weekly_price"].cast(IntegerType()))
csv = csv.withColumn("monthly_price", csv["monthly_price"].cast(IntegerType()))

csv = csv.withColumn("maximum_nights", csv["maximum_nights"].cast(IntegerType()))
csv = csv.withColumn("review_scores_accuracy", csv["review_scores_accuracy"].cast(IntegerType()))
csv = csv.withColumn("review_scores_cleanliness", csv["review_scores_cleanliness"].cast(IntegerType()))
csv = csv.withColumn("review_scores_checkin", csv["review_scores_checkin"].cast(IntegerType()))
csv = csv.withColumn("review_scores_communication", csv["review_scores_communication"].cast(IntegerType()))
csv = csv.withColumn("review_scores_location", csv["review_scores_location"].cast(IntegerType()))

csv = csv.withColumn("review_scores_value", csv["review_scores_value"].cast(IntegerType()))
csv = csv.withColumn("calculated_host_listings_count", csv["calculated_host_listings_count"].cast(IntegerType()))
csv = csv.withColumn("bedrooms", csv["bedrooms"].cast(IntegerType()))
csv = csv.withColumn("bathrooms", csv["bathrooms"].cast(IntegerType()))
csv = csv.withColumn("beds", csv["beds"].cast(IntegerType()))
csv = csv.withColumn("security_deposit", csv["security_deposit"].cast(IntegerType()))

csv = csv.withColumn("host_acceptance_rate", csv["host_acceptance_rate"].cast(IntegerType()))
csv = csv.withColumn("cleaning_fee", csv["cleaning_fee"].cast(IntegerType()))
csv = csv.withColumn("extra_people", csv["extra_people"].cast(IntegerType()))
csv = csv.withColumn("minimum_nights", csv["minimum_nights"].cast(IntegerType()))

csv.show(5)


+--------+--------------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+-----+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+--------------------+---------+----------+--------------------+--------------------+------------------+------------------+--------------------+--------------------+--------------------+------------------+-------------------+-------------------------+--------------------+--------------------+-------------+----------------------+----------------------------+-------------+--------------------+-------+----------+--------------------+------------+-------+------------------+------------------+-------------+---------------+------------+---------+--------+----+--------+--------------------+-----------+-----+------------+-------------+-------

In [None]:
# Select features and label
#data =csv.select('Host Listings Count','Host Total Listings Count',"Neighborhood","Property Type","Room Type","Bed Type","Latitude","Longitude","Accomodates","Bathrooms","Bedrooms","Monthly Price","Guests Included","Extra People","Minimum Nights","Review Scores Rating","Review Scores Accuracy","Review Scores Cleanliness","Review Scores Checkin","Review Scores Communication","Review Scores Location","Review Scores Value","Sentiment",col("Price").cast("Int").alias("label"))
# jwoo
#csv = spark.sql("SELECT * FROM airbnb_sample_csv")
#airbnb_sample.csv
#csv=data
# airibnb_US has entire columns while we need only the following column
data = csv.select("review_scores_rating", "host_listings_count", "host_total_listings_count", "calculated_host_listings_count", "security_deposit", "cleaning_fee" , "host_response_time","host_response_rate","host_acceptance_rate","property_type","room_type","bed_type", "weekly_price","monthly_price","maximum_nights","review_scores_accuracy","review_scores_cleanliness","review_scores_checkin","review_scores_communication","review_scores_location","review_scores_value","cancellation_policy","bedrooms","bathrooms","beds","extra_people","minimum_nights",col("price").cast("Int").alias("label"))

#csv.show(5)
data.show(5)


+--------------------+-------------------+-------------------------+------------------------------+----------------+------------+------------------+------------------+--------------------+-------------+---------------+--------+------------+-------------+--------------+----------------------+-------------------------+---------------------+---------------------------+----------------------+-------------------+-------------------+--------+---------+----+------------+--------------+-----+
|review_scores_rating|host_listings_count|host_total_listings_count|calculated_host_listings_count|security_deposit|cleaning_fee|host_response_time|host_response_rate|host_acceptance_rate|property_type|      room_type|bed_type|weekly_price|monthly_price|maximum_nights|review_scores_accuracy|review_scores_cleanliness|review_scores_checkin|review_scores_communication|review_scores_location|review_scores_value|cancellation_policy|bedrooms|bathrooms|beds|extra_people|minimum_nights|label|
+-------------------

In [None]:
print(data.count())

914210


In [None]:
# Filter Property Type not in the correct list
property_list = ["Apartment","House","Bed & Breakfast","Condominium","Loft", "Townhouse","Other","Villa", "Guesthouse", "Bungalow", "Dorm", "Boat", "Cabin", "Chalet", "Boutique hotel", "Serviced apartment", "Hostel", "Camper/RV", "Timeshare", "Guest suite", "Tent", "Vacation home", "Castle, Treehouse", "In-law", "Earth House", "Hut", "Yurt", "Entire Floor", "Tipi", "Nature lodge", "Cave", "Lighthouse", "Casa particular", "Train", "Island", "Igloo", "Parking Space", "Pension (Korea)", "Ryokan (Japan)", "Car", "Heritage hotel (India)", "Plane", "Van"
]

data = data.filter(data.property_type.isin(property_list))

In [None]:
#csv.persist(pyspark.StorageLevel.MEMORY_AND_DISK_2)
#csv.persist(pyspark.StorageLevel.OFF_HEAP)
data.persist(StorageLevel.DISK_ONLY_2)

#jwoo
#data = csv

DataFrame[review_scores_rating: int, host_listings_count: int, host_total_listings_count: int, calculated_host_listings_count: int, security_deposit: int, cleaning_fee: int, host_response_time: string, host_response_rate: int, host_acceptance_rate: int, property_type: string, room_type: string, bed_type: string, weekly_price: int, monthly_price: int, maximum_nights: int, review_scores_accuracy: int, review_scores_cleanliness: int, review_scores_checkin: int, review_scores_communication: int, review_scores_location: int, review_scores_value: int, cancellation_policy: string, bedrooms: int, bathrooms: int, beds: int, extra_people: int, minimum_nights: int, label: int]

## Data Cleaning
It is a critical process for the success of a machine learning model.

**Detecting and Removing Outliers:** We determine the **5th** percentile and **95th** percentile values of each of the features. Then filter the dataframe to contain data between these values.

In [None]:
# approxQuantile() to determine the 5th and 95th percentile values
# outliers = data.stat.approxQuantile(['label','Host Listings Count','Host Total Listings Count',"Accomodates","Bathrooms","Bedrooms","Monthly Price","Guests Included","Extra People","Minimum Nights"],[0.05,0.95],0.0)
# outliers = data.stat.approxQuantile(['label','Host Listings Count','Host Total Listings Count', "Bathrooms","Bedrooms","Monthly Price","Guests Included","Extra People","Minimum Nights"],[0.05,0.95],0.0)



data.show(5)

+--------------------+-------------------+-------------------------+------------------------------+----------------+------------+------------------+------------------+--------------------+-------------+---------------+--------+------------+-------------+--------------+----------------------+-------------------------+---------------------+---------------------------+----------------------+-------------------+-------------------+--------+---------+----+------------+--------------+-----+
|review_scores_rating|host_listings_count|host_total_listings_count|calculated_host_listings_count|security_deposit|cleaning_fee|host_response_time|host_response_rate|host_acceptance_rate|property_type|      room_type|bed_type|weekly_price|monthly_price|maximum_nights|review_scores_accuracy|review_scores_cleanliness|review_scores_checkin|review_scores_communication|review_scores_location|review_scores_value|cancellation_policy|bedrooms|bathrooms|beds|extra_people|minimum_nights|label|
+-------------------

##Data Cleaning
**Handling Missing Values:** Filling the missing values of numeric columns with **'0'** and string columns with **'NA'**

In [None]:
# Replacing missing values with '0' and 'NA' for numeric columns and string columns respectively
data_clean = data.na.fill(value=0).na.fill("NA")
data_clean.show(5)


+--------------------+-------------------+-------------------------+------------------------------+----------------+------------+------------------+------------------+--------------------+-------------+---------------+--------+------------+-------------+--------------+----------------------+-------------------------+---------------------+---------------------------+----------------------+-------------------+-------------------+--------+---------+----+------------+--------------+-----+
|review_scores_rating|host_listings_count|host_total_listings_count|calculated_host_listings_count|security_deposit|cleaning_fee|host_response_time|host_response_rate|host_acceptance_rate|property_type|      room_type|bed_type|weekly_price|monthly_price|maximum_nights|review_scores_accuracy|review_scores_cleanliness|review_scores_checkin|review_scores_communication|review_scores_location|review_scores_value|cancellation_policy|bedrooms|bathrooms|beds|extra_people|minimum_nights|label|
+-------------------

## Correlation
Correlation determines how one variable changes in relation with the other variable. It gives us an idea about the degree of the relationship of the two variables. Determine the correlation of the label **'price'** with the features of the data indicating the **dependence** between the label and each of the features. We can iteratively try to remove the features with less correlation to improve model performance.

In [None]:
import six

#df_Corr=data_clean.select("Host Listings Count","Host Total Listings Count","Neighborhood",	"Latitude","Longitude","Property Type","Room Type","Bed Type","Accomodates","Bathrooms","Bedrooms",	"Monthly Price","Guests Included","Extra People","Minimum Nights","Review Scores Rating","Review Scores Accuracy","Review Scores Cleanliness","Review Scores Checkin","Review Scores Communication","Review Scores Location","Review Scores Value","Sentiment","label")
#jwoo
df_Corr=data_clean.select("host_listings_count","host_total_listings_count","property_type","room_type", "bed_type", "beds", "bathrooms","bedrooms", "monthly_price", "extra_people","minimum_nights","review_scores_rating","review_scores_accuracy","review_scores_cleanliness","review_scores_checkin","review_scores_communication","review_scores_location","review_scores_value", "label")
# Determining correlation using DataFrameStatFunctions.corr
for i in df_Corr.columns:
   if not( isinstance(df_Corr.select(i).take(1)[0][0], six.string_types)):
      print( "Correlation to PRICE for ", i, df_Corr.stat.corr('label',i))

Correlation to PRICE for  host_listings_count 0.04027415372210895
Correlation to PRICE for  host_total_listings_count 0.04027415372210895
Correlation to PRICE for  beds 0.2667896213673431
Correlation to PRICE for  bathrooms 0.2153421395455276
Correlation to PRICE for  bedrooms 0.327902693901976
Correlation to PRICE for  monthly_price 0.13732364833838634
Correlation to PRICE for  extra_people 0.197763106643986
Correlation to PRICE for  minimum_nights 0.0008857971743691394
Correlation to PRICE for  review_scores_rating -0.018744816261029598
Correlation to PRICE for  review_scores_accuracy -0.025789155731662895
Correlation to PRICE for  review_scores_cleanliness -0.021992965383685763
Correlation to PRICE for  review_scores_checkin -0.027763330521278922
Correlation to PRICE for  review_scores_communication -0.02742904891380971
Correlation to PRICE for  review_scores_location -0.021258931791208094
Correlation to PRICE for  review_scores_value -0.02954309263075418
Correlation to PRICE for  l

## Feature Transformation
Convert the string type columns into indices using StringIndexer

In [None]:
# Converting the String type columns into indices
data_clean = StringIndexer(inputCol='host_response_time', outputCol='host_response_time_index').setHandleInvalid("skip").fit(data_clean).transform(data_clean)
data_clean = StringIndexer(inputCol='cancellation_policy', outputCol='cancellation_policy_index').setHandleInvalid("skip").fit(data_clean).transform(data_clean)

data_clean = StringIndexer(inputCol='property_type', outputCol='property_type_index').setHandleInvalid("keep").fit(data_clean).transform(data_clean)
data_clean= StringIndexer(inputCol='room_type', outputCol='room_type_index').setHandleInvalid("keep").fit(data_clean).transform(data_clean)
data_clean = StringIndexer(inputCol='bed_type', outputCol='bed_type_index').setHandleInvalid("keep").fit(data_clean).transform(data_clean)

data_clean = StringIndexer(inputCol="review_scores_rating", outputCol='review_scores_rating_index').fit(data_clean).transform(data_clean)

data_clean.show(5)


+--------------------+-------------------+-------------------------+------------------------------+----------------+------------+------------------+------------------+--------------------+-------------+---------------+--------+------------+-------------+--------------+----------------------+-------------------------+---------------------+---------------------------+----------------------+-------------------+-------------------+--------+---------+----+------------+--------------+-----+------------------------+-------------------------+-------------------+---------------+--------------+--------------------------+
|review_scores_rating|host_listings_count|host_total_listings_count|calculated_host_listings_count|security_deposit|cleaning_fee|host_response_time|host_response_rate|host_acceptance_rate|property_type|      room_type|bed_type|weekly_price|monthly_price|maximum_nights|review_scores_accuracy|review_scores_cleanliness|review_scores_checkin|review_scores_communication|review_scores

##Split the Data
It is common practice when building supervised machine learning models to split the source data, using some of it to train the model and reserving some to test the trained model. Here we split the data into train data and test data. We have split the data in the ratio of **70:30**

In [None]:
# Split the data for gradient boosted tree regression
splits = data_clean.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")

print ("Training Rows:", train.count(), " Testing Rows:", test.count())

Training Rows: 250749  Testing Rows: 107620


## Define the Pipeline
Define a pipeline for feature transformation. It creates a feature vector and trains a regression model
1. A **VectorAssembler** that combines categorical features into a single vector.
2. A **Vector Indexer** that creates indices for a vector of categorical features.
3. A **VectorAssembler** that creates a vector of continuous numeric features.
4. A **MinMaxScaler** to normalize the continuous numeric features.
5. A **VectorAssembler** that creates a vector of categorical and continuous features.
6. A **GBT Regressor** that trains a Regression model.
7. **Process pipeline** with the series of transformations above.

In [None]:
# Combine Categorical features into a single vector
#jwoo
catVect = VectorAssembler(inputCols =['host_response_time_index', 'cancellation_policy_index', 'property_type_index', 'room_type_index', 'bed_type_index', 'review_scores_rating_index'], outputCol="catFeatures")
#catVect = VectorAssembler(inputCols =['property_type_index', 'room_type_index', 'bed_type_index'], outputCol="catFeatures")

# Create indices for the vector of categorical features
catIdx = VectorIndexer(inputCol = catVect.getOutputCol(), outputCol = "idxCatFeatures").setHandleInvalid("keep")

#Create a vector of the numeric features
#numVect = VectorAssembler(inputCols = ["Host Listings Count","Host Total Listings Count","Latitude","Longitude","Accomodates","Bathrooms","Bedrooms","Monthly Price","Guests Included","Extra People","Minimum Nights","Review Scores Rating","Review Scores Accuracy","Review Scores Cleanliness","Review Scores Checkin","Review Scores Communication","Review Scores Location","Review Scores Value","Sentiment"], outputCol="numFeatures")
#jwoo
numVect = VectorAssembler(inputCols = ["host_listings_count", "host_total_listings_count","bathrooms","bedrooms","monthly_price","minimum_nights","review_scores_rating","review_scores_accuracy","review_scores_cleanliness","review_scores_checkin","review_scores_communication","review_scores_location","review_scores_value"], outputCol="numFeatures")

# Scale the numeric features
minMax = MinMaxScaler(inputCol = numVect.getOutputCol(), outputCol="normFeatures")

#Create a vector of categorical and numeric features
featVect = VectorAssembler(inputCols=["idxCatFeatures", "normFeatures"],  outputCol="features")

# Gradient Boosted Tree Regression model: but categorical feature 0 has 42 values.
# gbt = GBTRegressor(labelCol="label", featuresCol="features", maxBins=50) # maxBins=40) # maxBins=2904 ) #
gbt = RandomForestRegressor(labelCol="label", featuresCol="features", maxBins=50)

# Process the pipeline with the transformations
pipeline = Pipeline(stages=[catVect,catIdx,numVect, minMax,featVect, gbt])


In [None]:
end=time()
print('{} takes {} seconds'.format(phrase, (end - start))) #round(end - start, 2)))

data engineering time:  takes 32.27800273895264 seconds


In [None]:
#data_clean.select('property_type_index').distinct().show()
data_clean.select('property_type').distinct().show()


+------------------+
|     property_type|
+------------------+
|         Apartment|
|         Townhouse|
|   Bed & Breakfast|
|       Earth House|
|       Guest suite|
|         Timeshare|
|               Hut|
|         Camper/RV|
|    Boutique hotel|
|              Loft|
|        Guesthouse|
|            Hostel|
|        Lighthouse|
|             Villa|
|             Other|
|Serviced apartment|
|            In-law|
|      Nature lodge|
|              Dorm|
|       Condominium|
+------------------+
only showing top 20 rows



In [None]:
data_clean.select("room_type").distinct().show()


+---------------+
|      room_type|
+---------------+
|    Shared room|
|Entire home/apt|
|   Private room|
+---------------+



In [None]:
data_clean.select("bed_type").distinct().show()


+-------------+
|     bed_type|
+-------------+
|       Airbed|
|        Futon|
|Pull-out Sofa|
|        Couch|
|     Real Bed|
+-------------+



In [None]:
from pyspark.sql.functions import col, countDistinct

data_clean.agg(countDistinct(col("property_type_index")).alias("count")).show()


+-----+
|count|
+-----+
|   41|
+-----+



In [None]:
data_clean.agg(countDistinct(col("room_type_index")).alias("count")).show()

+-----+
|count|
+-----+
|    3|
+-----+



In [None]:
data_clean.agg(countDistinct(col("bed_type_index")).alias("count")).show()

+-----+
|count|
+-----+
|    5|
+-----+



### Train a Regression model using Parameter Tuning
Use the  **CrossValidator** class to evaluate each combination of parameters defined in a **ParameterGrid** against multiple folds of the data split into training and validation datasets, in order to find the best performing parameters. It is used to find the best model for the data. Here the number of folds is assigned to **2**.

In [None]:
# Data Science
start=time()

In [None]:
# Defining the parameter grid
paramGrid = (ParamGridBuilder()
            .addGrid(gbt.maxDepth,[2,3,9]) #max_depth default 9 at XGBoost: 29
            .addGrid(gbt.minInfoGain,[0.0, 0.7])
            .build())

# Number of folds
K = 3
#cv = CrossValidator(estimator=pipeline, evaluator=RegressionEvaluator(), estimatorParamMaps=paramGrid, numFolds=K)
cv = TrainValidationSplit(estimator=pipeline, evaluator=RegressionEvaluator(), estimatorParamMaps=paramGrid, trainRatio=0.7)




In [None]:
# Train the model
model = cv.fit(train)

In [None]:
end=time()
print('{} takes {} seconds'.format("Training Time", (end - start))) #round(end - start, 2)))

Training Time takes 135.26247572898865 seconds


### Test the Pipeline Model
The model produced by the pipeline is a transformer that will apply all of the stages in the pipeline to a specified DataFrame and apply the trained model to generate predictions. In this case, we will transform the **test** DataFrame using the pipeline to generate label predictions.

In [None]:
# Transform the test data and generate predictions by applying the trained model
prediction = model.transform(test)
predicted = prediction.select("normFeatures", "prediction", "trueLabel")
predicted.show(5)

+--------------------+------------------+---------+
|        normFeatures|        prediction|trueLabel|
+--------------------+------------------+---------+
|(13,[2,3,5],[0.12...|206.93462620795268|       50|
|(13,[2,4,5],[0.12...|371.59270867201906|      597|
|[0.0,0.0,0.125,0....| 80.44399190815916|      265|
|[0.0,0.0,0.125,0....| 69.33166822705309|       35|
|(13,[2,3,5],[0.12...| 81.75511663886071|       30|
+--------------------+------------------+---------+
only showing top 5 rows



## Evaluate the model
Metrics used for evaluation are **Root Mean Square Error(RMSE)** and **Co-efficient of Determination(r2)**. RMSE is measured in the same units as the predicted and actual values - so in this case, the RMSE indicates the average difference in dollars between predicted and actual price values. r2 indicates how close the data are to the fitted regression line. **RegressionEvaluator** class is used to determine **RMSE** and **r2**.

In [None]:
# Evaluator to determine rmse
evaluator = RegressionEvaluator(labelCol="trueLabel", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(prediction)

# Evaluator to determine r2
evaluator = RegressionEvaluator(labelCol="trueLabel", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(prediction)

print ("Root Mean Square Error (RMSE):", rmse)
print ("Co-efficient of Determination (r2)", r2)

Root Mean Square Error (RMSE): 133.67620181664412
Co-efficient of Determination (r2) 0.23155365366222647


In [None]:
raydp.stop_spark()
ray.shutdown()

### Experimental Result:      
#### min_workers: 2, max_workers: 3

1. Experiment 1: (header: g4dn.2xlarge): num_executors = 1, cores_per_executor = 1

|File Size| Loc | workers | no actors |Partition| no_executors| cores/exe| mem/exe| maxDepth| Train (s)| Read| Data Eng| Test rmse | Test r2 |
| :---    | :----:  | :----: | :----:   | :----:   | :----:   | :----:   | :----:   | :----:   | :----:   | :----:| :----:| :----:   | ---: |
|1.930 GB | local |2- 8  | 3 |   NA | 1| 1| "2GB"| 2,3,9 | 135.26 |17.53 | 32.29  | 133.68 | 0.2316 |
