# Big Data Analytics Project - Airbnb Pricing Prediction <br>

### Group 63 <br>
André Lourenço - 20240743 <br>
Fábio Dos Santos - 20240678 <br>
Rafael Borges - 20240497 <br>
Rui Reis - 20240854 <br>
Victor Silva - 20240663 

## Table of Contents
- [1. Import Libraries](#1-import-libraries)
- [2. Data Load](#2-data-load)
- [3. Pipeline](#3-pipeline)
- [4. Model Evaluation](#4-model-evaluation)

## 1. Import Libraries

__`Step 1`__ Import necessary libraries.

In [0]:
from pyspark.sql.functions import col, year, to_date, avg, regexp_replace, current_date, datediff, countDistinct, sum, round, when, lit
from pyspark.sql import functions as F
from functools import reduce

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline, Transformer
from pyspark.ml.feature import (
    VectorAssembler,
    StringIndexer,
    OneHotEncoder,
    StandardScaler,
    Imputer,
    RobustScaler
)
from pyspark.ml.regression import RandomForestRegressor, LinearRegression, GBTRegressor
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.param.shared import Param
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.sql.functions import when

__`Step 2`__ Run both utils and transformers notebooks used in this notebook.

In [0]:
%run "./utils" 



In [0]:
%run "./transformers" 

## 2. Data Load

__`Step 3`__ Import the initially preprocessed data from the EDA notebook.

In [0]:
df = (
    spark.read
         .option("header", "true")      # keep column names
         .option("inferSchema", "true") # let Spark inspect values & choose types
         .option("samplingRatio", "1")  
         .csv("dbfs:/FileStore/tables/Listings_cleaned")
)

display(df.limit(5))

host_is_superhost,host_total_listings_count,host_has_profile_pic,host_identity_verified,neighbourhood,property_type,room_type,accommodates,bedrooms,price,minimum_nights,review_scores_rating,instant_bookable,host_days_active,host_years_active,amenities_length,living_entertainment,kitchen_dining,bedroom,bathroom,baby_family,laundry_cleaning,safety_security,outdoor_garden,heating_cooling,travel_access,wellness_leisure,workspace_tech,guest_services,misc_essentials
0,1,1,0,Buttes-Montmartre,entire_place,Entire place,2,1,53,2,100,0,3354,9,5,0,1,0,0,0,1,0,0,1,1,0,1,0,0
0,1,1,1,Buttes-Montmartre,entire_place,Entire place,2,1,120,2,100,0,2627,7,8,0,1,0,1,0,2,0,0,1,1,0,1,0,1
0,1,1,0,Elysee,entire_place,Entire place,2,1,89,2,100,0,2383,6,6,1,1,0,0,0,1,0,0,1,1,0,1,0,0
0,1,1,1,Vaugirard,entire_place,Entire place,2,1,58,2,100,0,2609,7,5,1,1,0,0,0,0,0,0,1,1,0,1,0,0
0,1,1,0,Passy,entire_place,Entire place,2,1,60,2,100,0,2247,6,12,1,1,0,1,0,2,0,0,1,1,0,1,0,1


In [0]:
df_shape(df)

63750 rows × 30 columns


__`Step 4`__ Check for null values and respective data types.

In [0]:
summarize_nulls_and_dtype(df, only_missing=True)

column                         |    # nulls | dtype
---------------------------------------------------
bedrooms                       |     13,293 | int
review_scores_rating           |     16,199 | int


## 3. Pipeline  

In [0]:
df.printSchema()

root
 |-- host_is_superhost: integer (nullable = true)
 |-- host_total_listings_count: integer (nullable = true)
 |-- host_has_profile_pic: integer (nullable = true)
 |-- host_identity_verified: integer (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- property_type: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- accommodates: integer (nullable = true)
 |-- bedrooms: integer (nullable = true)
 |-- price: integer (nullable = true)
 |-- minimum_nights: integer (nullable = true)
 |-- review_scores_rating: integer (nullable = true)
 |-- instant_bookable: integer (nullable = true)
 |-- host_days_active: integer (nullable = true)
 |-- host_years_active: integer (nullable = true)
 |-- amenities_length: integer (nullable = true)
 |-- living_entertainment: integer (nullable = true)
 |-- kitchen_dining: integer (nullable = true)
 |-- bedroom: integer (nullable = true)
 |-- bathroom: integer (nullable = true)
 |-- baby_family: integer (nullable = true)
 |

In [0]:
categorical_cols = [
    'property_type', 
    'room_type', 
    'neighbourhood',
]

numeric_cols = [
    'host_total_listings_count',
    'accommodates', 
    'bedrooms', 
    'minimum_nights', 
    'review_scores_rating', 
    'host_days_active',
    'host_years_active',
    'amenities_length', 
    "living_entertainment", 
    "kitchen_dining", 
    "bedroom", 
    "bathroom", 
    "baby_family", 
    "laundry_cleaning", 
    "safety_security", 
    "outdoor_garden", 
    "heating_cooling", 
    "travel_access", 
    "wellness_leisure", 
    "workspace_tech", 
    "guest_services", 
    "misc_essentials",
]

binary_cols = [
    "host_is_superhost",
    "host_has_profile_pic",
    "host_identity_verified",
    "instant_bookable"
]

label_col = "price"


__`Step 5`__ Create and run the pipeline to explore the models and their parameters.

In [0]:
## RANDOM FOREST REGRESSOR PIPELINE

# 1. Train/test split
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# 2. Build pipeline stages (applied per CV fold)
stages = []

# 2a. Categorical processing
for cat in categorical_cols:
    indexer = StringIndexer(inputCol=cat, outputCol=f"{cat}_idx")
    ohe = OneHotEncoder(inputCol=f"{cat}_idx", outputCol=f"{cat}_ohe")
    stages += [indexer, ohe]

# 2b. Numeric imputation
current_numeric_cols = list(numeric_cols) # Make a copy to modify
processed_cols_map = {col: col for col in numeric_cols}


imputer_input_actual = ["review_scores_rating", "bedrooms"] 
imputer_output_actual = [f"{c}_imputed" for c in imputer_input_actual]
imputer = Imputer(inputCols=imputer_input_actual, outputCols=imputer_output_actual).setStrategy("median")
stages.append(imputer)

# Update dict
for orig_col, imputed_col in zip(imputer_input_actual, imputer_output_actual):
    processed_cols_map[orig_col] = imputed_col

# 2c. Winsorization

winsor_input_original_names = ["bedrooms", "minimum_nights"] # Original names
winsor_input_current_names = [processed_cols_map[c] for c in winsor_input_original_names]
winsor_output = [f"{processed_cols_map[c]}_wins" for c in winsor_input_original_names]

winsor = Winsorizer(
    inputCols=winsor_input_current_names,
    outputCols=winsor_output,
    lowerQuantile=0.05,
    upperQuantile=0.95
)
stages.append(winsor)


# Update dict 
for orig_col, wins_col in zip(winsor_input_original_names, winsor_output):
    processed_cols_map[orig_col] = wins_col

final_numeric_features = list(processed_cols_map.values())

# 2d. Assemble features
onehot_cols  = [f"{c}_ohe" for c in categorical_cols]


# Final feature input list
feature_inputs = final_numeric_features + onehot_cols + binary_cols
assembler = VectorAssembler(inputCols=feature_inputs, outputCol="assembled_features")
stages.append(assembler)

# 3. Estimator: RandomForestRegressor (or swap LinearRegression)
regressor = RandomForestRegressor(featuresCol="assembled_features", labelCol=label_col, seed=42)
stages.append(regressor)

# 4. Pipeline creation
pipeline = Pipeline(stages=stages)

First Run:
- Model: Random Forest Regressor.
- Parameters: numTrees -> [20, 50]; maxDepth -> [5, 10].
- Number of folds: 3

In [0]:
# 5. Hyperparameter grid for Random Forest
paramGrid = ParamGridBuilder() \
    .addGrid(regressor.numTrees, [20, 50]) \
    .addGrid(regressor.maxDepth, [5, 10]) \
    .build()

# 6. Cross-validation setup
evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="rmse")
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=3,
    parallelism=2,
    seed = 42 
)

# 7. Model fitting
cv_model = crossval.fit(train_df)

# 8. Best model & params
best_model = cv_model.bestModel
best_rf = best_model.stages[-1]
print(f"Best numTrees: {best_rf.getNumTrees}")
print(f"Best maxDepth: {best_rf.getOrDefault('maxDepth')}")

# 9. Test set evaluation
preds = best_model.transform(test_df)
rmse = evaluator.evaluate(preds)
r2 = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="r2").evaluate(preds)
print(f"Test RMSE = {rmse:.4f}, R2 = {r2:.4f}")

Best numTrees: 50
Best maxDepth: 10
Test RMSE = 49.3855, R2 = 0.4997


Results: <br>
- Best number of trees: 50
- Best max depth: 10
- Test RMSE: 49.3855
- R2: 0.4997

Second Run:
- Model: Random Forest Regressor.
- Parameters: minInstancesPerNode -> [1, 5]; numTrees -> [10, 20, 50, 100]; maxDepth -> [3, 5].
- Number of folds: 4

In [0]:
# 5. Hyperparameter grid for Random Forest
paramGrid = ParamGridBuilder() \
    .addGrid(regressor.minInstancesPerNode, [1, 5]) \
    .addGrid(regressor.numTrees, [10, 20, 50, 100]) \
    .addGrid(regressor.maxDepth, [3, 5]) \
    .build()


# 6. Cross-validation setup
evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="rmse")
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=4,
    parallelism=2,
    seed = 42 
)

# 7. Model fitting
cv_model = crossval.fit(train_df)

# 8. Best model & params
best_model = cv_model.bestModel
best_rf = best_model.stages[-1]
print(f"Best numTrees: {best_rf.getNumTrees}")
print(f"Best maxDepth: {best_rf.getOrDefault('maxDepth')}")

# 9. Test set evaluation
preds = best_model.transform(test_df)
rmse = evaluator.evaluate(preds)
r2 = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="r2").evaluate(preds)
print(f"Test RMSE = {rmse:.4f}, R2 = {r2:.4f}")

Best numTrees: 100
Best maxDepth: 5
Test RMSE = 52.4792, R2 = 0.4351


Results: <br>
- Best number of trees: 100
- Best max depth: 5
- Test RMSE: 52.4792
- R2: 0.4531

Third Run:
- Model: Random Forest Regressor.
- Parameters: minInstancesPerNode -> [1, 5]; numTrees -> [10, 20, 50, 100]; maxDepth -> [5, 10].
- Number of folds: 5

In [0]:

# 5. Hyperparameter grid for Random Forest
paramGrid = ParamGridBuilder() \
    .addGrid(regressor.minInstancesPerNode, [1, 5]) \
    .addGrid(regressor.numTrees, [10, 20, 50, 100]) \
    .addGrid(regressor.maxDepth, [5, 10]) \
    .build()

# 6. Cross-validation setup
evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="rmse")
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=5,
    parallelism=2,
    seed = 42 
)

# 7. Model fitting
cv_model = crossval.fit(train_df)

# 8. Best model & params
best_model = cv_model.bestModel
best_rf = best_model.stages[-1]
print(f"Best numTrees: {best_rf.getNumTrees}")
print(f"Best maxDepth: {best_rf.getOrDefault('maxDepth')}")

# 9. Test set evaluation
preds = best_model.transform(test_df)
rmse = evaluator.evaluate(preds)
r2 = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="r2").evaluate(preds)
print(f"Test RMSE = {rmse:.4f}, R2 = {r2:.4f}")

Best numTrees: 100
Best maxDepth: 10
Test RMSE = 49.3130, R2 = 0.5012


Results: <br>
- Best number of trees: 100
- Best max depth: 10
- Test RMSE: 49.3130
- R2: 0.5012

Fourth Run:
- Model: Random Forest Regressor.
- Parameters: numTrees -> [100, 125]; maxDepth -> [12, 15].
- Number of folds: 3

In [0]:
# 5. Hyperparameter grid for Random Forest
paramGrid = ParamGridBuilder() \
    .addGrid(regressor.numTrees, [100, 125]) \
    .addGrid(regressor.maxDepth, [12, 15]) \
    .build()

# 6. Cross-validation setup
evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="rmse")
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=3,
    parallelism=2,
    seed = 42 
)

# 7. Model fitting
cv_model = crossval.fit(train_df)

# 8. Best model & params
best_model = cv_model.bestModel
best_rf = best_model.stages[-1]
print(f"Best numTrees: {best_rf.getNumTrees}")
print(f"Best maxDepth: {best_rf.getOrDefault('maxDepth')}")

# 9. Test set evaluation
preds = best_model.transform(test_df)
rmse = evaluator.evaluate(preds)
r2 = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="r2").evaluate(preds)
print(f"Test RMSE = {rmse:.4f}, R2 = {r2:.4f}")

Best numTrees: 125
Best maxDepth: 15
Test RMSE = 47.5310, R2 = 0.5366


Results: <br>
- Best number of trees: 125
- Best max depth: 15
- Test RMSE: 47.5310
- R2: 0.5366

Next we have one of the five examples of pipeline setups using Random Forest Regressor that didn't finish running due to Databricks Community Edition limitations, after 1 hour of running the same code cell the cluster goes down and interrupts the run showing "Cancelled" as output. 

In [0]:
# 5. Hyperparameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(regressor.minInstancesPerNode, [1, 5]) \
    .addGrid(regressor.minInfoGain, [0.0, 0.01]) \
    .addGrid(regressor.numTrees, [10, 20, 50, 100]) \
    .addGrid(regressor.maxDepth, [5, 10, 15]) \
    .build()

# 6. Cross-validation setup
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=5,
    parallelism=2
)

# 7. Model fitting
cv_model = crossval.fit(train_df)

# 8. Best model & params
best_model = cv_model.bestModel
best_rf = best_model.stages[-1]
print(f"Best numTrees: {best_rf.getNumTrees}")
print(f"Best maxDepth: {best_rf.getOrDefault('maxDepth')}")

# 9. Test set evaluation
preds = best_model.transform(test_df)
rmse = evaluator.evaluate(preds)
r2 = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2").evaluate(preds)
print(f"Test RMSE = {rmse:.4f}, R2 = {r2:.4f}")

A solution to turn around this problem would be to run this pipeline in smaller batches but we didn't wanted to loose information, therefore we opted not to have that approach.

__`Step 6`__ Adapt pipeline to use Gradient Boosting Trees as regressor.

In [0]:
# 4. Train/test split
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# 5. Build pipeline stages (applied per CV fold)
stages = []

# 5a. Categorical processing
for cat in categorical_cols:
    indexer = StringIndexer(inputCol=cat, outputCol=f"{cat}_idx")
    ohe = OneHotEncoder(inputCol=f"{cat}_idx", outputCol=f"{cat}_ohe")
    stages += [indexer, ohe]

# 5b. Numeric imputation
current_numeric_cols = list(numeric_cols) # Make a copy to modify
processed_cols_map = {col: col for col in numeric_cols}


imputer_input_actual = ["review_scores_rating", "bedrooms"] 
imputer_output_actual = [f"{c}_imputed" for c in imputer_input_actual]
imputer = Imputer(inputCols=imputer_input_actual, outputCols=imputer_output_actual).setStrategy("median")
stages.append(imputer)

# Update dict
for orig_col, imputed_col in zip(imputer_input_actual, imputer_output_actual):
    processed_cols_map[orig_col] = imputed_col

# 5c. Winsorization

winsor_input_original_names = ["bedrooms", "minimum_nights"] # Original names
winsor_input_current_names = [processed_cols_map[c] for c in winsor_input_original_names]
winsor_output = [f"{processed_cols_map[c]}_wins" for c in winsor_input_original_names]

winsor = Winsorizer(
    inputCols=winsor_input_current_names,
    outputCols=winsor_output,
    lowerQuantile=0.05,
    upperQuantile=0.95
)
stages.append(winsor)

# Update dict 
for orig_col, wins_col in zip(winsor_input_original_names, winsor_output):
    processed_cols_map[orig_col] = wins_col

final_numeric_features = list(processed_cols_map.values())

# 5e. Assemble features
onehot_cols  = [f"{c}_ohe" for c in categorical_cols]

# Final feature input list
feature_inputs = final_numeric_features + onehot_cols + binary_cols
assembler = VectorAssembler(inputCols=feature_inputs, outputCol="assembled_features")
stages.append(assembler)

# 5f. Feature scaling ( Unnecessary for tree based models )
#scaler = RobustScaler(inputCol="assembled_features", outputCol="features")
#stages.append(scaler)

regressor =  GBTRegressor(featuresCol="assembled_features", labelCol=label_col, seed=42)
stages.append(regressor)

# 7. Pipeline creation
pipeline = Pipeline(stages=stages)

Fifth Run:
- Model: GBT Regressor.
- Parameters: maxIter -> [20, 50]; maxDepth -> [3, 5]; stepSize -> [0.1, 0.05].
- Number of folds: 5

In [0]:

# 8. Hyperparameter grid for GBTRegressor
paramGrid = ParamGridBuilder() \
    .addGrid(regressor.maxIter, [20, 50]) \
    .addGrid(regressor.maxDepth, [3, 5]) \
    .addGrid(regressor.stepSize, [0.1, 0.05]) \
    .build()

# 9. Cross-validation setup
evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="rmse")
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=5,
    parallelism=2,
    seed = 42 
)

# 10. Model fitting
cv_model = crossval.fit(train_df)

# 11. Best model & params
best_model = cv_model.bestModel
best_rf = best_model.stages[-1]
print(f"Best numTrees: {best_rf.getNumTrees}")
print(f"Best maxDepth: {best_rf.getOrDefault('maxDepth')}")

# 12. Test set evaluation
preds = best_model.transform(test_df)
rmse = evaluator.evaluate(preds)
r2 = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="r2").evaluate(preds)
print(f"Test RMSE = {rmse:.4f}, R2 = {r2:.4f}")

Best numTrees: 50
Best maxDepth: 5
Test RMSE = 48.2254, R2 = 0.5230


Results: <br>
- Best number of trees: 50
- Best max depth: 5
- Test RMSE: 48.2254
- R2: 0.5230

Next we have one of the three examples of pipeline setups using GBT Regressor that didn't finish running due to Databricks Community Edition limitations, after 1 hour of running the same code cell the cluster goes down and interrupts the run showing "Cancelled" as output. 

In [0]:
# 8. Hyperparameter grid for GBTRegressor
paramGrid = ParamGridBuilder() \
    .addGrid(regressor.maxIter, [20, 50]) \
    .addGrid(regressor.maxDepth, [5, 10, 15]) \
    .addGrid(regressor.stepSize, [0.1, 0.05]) \
    .build()

# 9. Cross-validation setup
evaluator = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="rmse")
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=3,
    parallelism=2,
    seed = 42 
)

# 10. Model fitting
cv_model = crossval.fit(train_df)

# 11. Best model & params
best_model = cv_model.bestModel
best_rf = best_model.stages[-1]
print(f"Best numTrees: {best_rf.getNumTrees}")
print(f"Best maxDepth: {best_rf.getOrDefault('maxDepth')}")

# 12. Test set evaluation
preds = best_model.transform(test_df)
rmse = evaluator.evaluate(preds)
r2 = RegressionEvaluator(labelCol=label_col, predictionCol="prediction", metricName="r2").evaluate(preds)
print(f"Test RMSE = {rmse:.4f}, R2 = {r2:.4f}")

A solution to turn around this problem would be to run this pipeline in smaller batches but we didn't wanted to loose information, therefore we opted not to have that approach.

## 4. Model Evaluation

Next there is a table to access our different model setups to compare between them which one is the best one, therefore the one to be used for our predictions.

| Setup                       | Model | Best Number of Trees | Best Max Depth                       | Test RMSE | R2 |
|------------------------------|-------------|-------------|------------------------------|-------------|-------------|
| First Run                 | Random Forest Regressor |50 |10                 | 49.3855 |0.4997                 |
| Second Run                 |Random Forest Regressor | 100 |5                 | 52.4792 |0.4531                 |
| Third Run                 | Random Forest Regressor |100 |10                 | 49.3130 |0.5012                 |
| Fourth Run                 |Random Forest Regressor | 125 |15                 | 47.5310 |0.5366                 |
| Fifth Run                 |GBT Regressor  |50 |5                 |48.2254  |0.5230                 |

It is important to highlight once again that the metrics could be better if Databricks Community Edition allowed us to have longer runs in order to better optimize the models and have have better metrics.

#### Conclusion:
- With the Databricks Community Edition limitations our best model was a Random Forest Regressor with 125 trees, max depth of 15, test RMSE of 47.5310 and R2 of 0.5366.