# Hyperparameter Tuning

---

We tune the hyperparameters for the previously created RandomForestRegressor.

For a more comprehsneive tuning process, we utilize k fold cross validation, with various tree sizes and maximum depths

---

## Load Spark & Data

In [13]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
    .appName("baselinemodel") \
    .getOrCreate()

24/03/28 13:20:53 WARN Utils: Your hostname, Edmunds-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 10.71.54.143 instead (on interface en0)
24/03/28 13:20:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/28 13:20:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# REPLACE WITH PROCESSED DATA FILEPATH
DATA_PATH = "../data/itineraries_processed.parquet"

In [4]:
df = spark.read.parquet(DATA_PATH)

                                                                                

In [10]:
# train test split
train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

## Create Vector Assembler

In [11]:
feature_columns = df.columns[:-1]
feature_columns.remove('totalFare')

# Assemble features into a vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
# df_ass = assembler.transform(df)

## Run Model

In [14]:
# Instantiate random forest
rf = RandomForestRegressor(featuresCol="features", labelCol="totalFare")

# Create a pipeline
pipeline = Pipeline(stages=[assembler, rf])

# Create ParamGrid for Cross Validation
param_grid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [50, 100, 150, 200]) \
    .addGrid(rf.maxDepth, [5, 10, 15, 20]) \
    .build()

# Define evaluator
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="totalFare", metricName="rmse")

# Create CrossValidator
cv = CrossValidator(estimator=pipeline,
                    estimatorParamMaps=param_grid,
                    evaluator=evaluator,
                    numFolds=5)

# Run cross-validation
cv_model = cv.fit(train_data)

24/03/28 13:40:21 WARN DAGScheduler: Broadcasting large task binary with size 1067.5 KiB
24/03/28 13:40:21 WARN DAGScheduler: Broadcasting large task binary with size 1647.9 KiB
24/03/28 13:40:21 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
24/03/28 13:40:22 WARN DAGScheduler: Broadcasting large task binary with size 3.3 MiB
24/03/28 13:40:26 WARN DAGScheduler: Broadcasting large task binary with size 1067.5 KiB
24/03/28 13:40:26 WARN DAGScheduler: Broadcasting large task binary with size 1647.9 KiB
24/03/28 13:40:27 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
24/03/28 13:40:27 WARN DAGScheduler: Broadcasting large task binary with size 3.3 MiB
24/03/28 13:40:28 WARN DAGScheduler: Broadcasting large task binary with size 4.5 MiB
24/03/28 13:40:29 WARN DAGScheduler: Broadcasting large task binary with size 5.9 MiB
24/03/28 13:40:30 WARN DAGScheduler: Broadcasting large task binary with size 7.4 MiB
24/03/28 13:40:32 WARN DAGScheduler: Broad

## Investigate the best performing model

In [16]:
# Get the best model
best_model = cv_model.bestModel

# Access the stages of the pipeline
stages = best_model.stages

# Access the parameters of the RandomForestRegressor stage
rf_params = stages[-1].extractParamMap()

# Print the parameters
print("Best Model Parameters:")
for param, value in rf_params.items():
    print(param.name, ":", value)

Best Model Parameters:
bootstrap : True
cacheNodeIds : False
checkpointInterval : 10
featureSubsetStrategy : auto
featuresCol : features
impurity : variance
labelCol : totalFare
leafCol : 
maxBins : 32
maxDepth : 15
maxMemoryInMB : 256
minInfoGain : 0.0
minInstancesPerNode : 1
minWeightFractionPerNode : 0.0
numTrees : 50
predictionCol : prediction
seed : 2768091509843736401
subsamplingRate : 1.0


We obtain the best results using a maxdepth of 15 and numtrees of 50. 

## Evaluation

In [15]:
# Make predictions on test data using the best model
predictions = best_model.transform(test_data)

# Evaluate the model on test data
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data:", rmse)

Root Mean Squared Error (RMSE) on test data: 179.04594487306673
