# Hyperparameter Tuning

---

We tune the hyperparameters for the GBTRegressor

For a more comprehsneive tuning process, we utilize k fold cross validation, with various tree sizes and maximum depths

---

## Load Spark & Data

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder \
.appName("linear_tune") \
.config("spark.executor.memory", "8g") \
.config("spark.driver.memory", "4g") \
.config("spark.executor.cores", "2") \
.config("spark.executor.instances", "4") \
.getOrCreate()

In [0]:
# REPLACE WITH PROCESSED DATA FILEPATH
DATA_PATH = "/mnt/bigdataproject/itineraries_processed.parquet"

In [0]:
# Assuming df is a DataFrame you previously called .cache() or .persist() on
# df.unpersist()


In [0]:
df = spark.read.parquet(DATA_PATH)

In [0]:
# train test split
train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)

## Create Vector Assembler

In [0]:
feature_columns = df.columns[:-1]
feature_columns.remove('totalFare')

# Assemble features into a vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
# df_ass = assembler.transform(df)

## Run Model

In [0]:
# Initialize the LinearRegression
lr = LinearRegression(featuresCol="features", labelCol="totalFare")

# Define the pipeline with the stages
# Assuming 'assembler' is defined elsewhere in your code as it was implied in your initial setup
pipeline = Pipeline(stages=[assembler, lr])

# Define evaluator
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="totalFare", metricName="rmse")

# Create ParamGrid for Cross Validation
# Adjusting the parameters for LinearRegression
paramGrid = ParamGridBuilder() \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .addGrid(lr.regParam, [0.01, 0.1, 1.0]) \
    .build()

# Create CrossValidator
cv = CrossValidator(estimator=pipeline,
                    estimatorParamMaps=paramGrid,
                    evaluator=evaluator,
                    numFolds=5)

# Run cross-validation, assuming 'train_data' is your training dataset
cvModel = cv.fit(train_data)


## Investigate the best performing model

In [0]:
# Get the best model
best_model = cvModel.bestModel

# Access the stages of the pipeline
stages = best_model.stages

# Access the parameters
rf_params = stages[-1].extractParamMap()

# Print the parameters
print("Best Model Parameters:")
for param, value in rf_params.items():
    print(param.name, ":", value)

Best Model Parameters:
aggregationDepth : 2
elasticNetParam : 0.0
epsilon : 1.35
featuresCol : features
fitIntercept : True
labelCol : totalFare
loss : squaredError
maxBlockSizeInMB : 0.0
maxIter : 100
predictionCol : prediction
regParam : 0.01
solver : auto
standardization : True
tol : 1e-06


We obtain the best results using a maxdepth of 15 and numtrees of 50. 

## Evaluation

In [0]:
# Make predictions on test data using the best model
predictions = best_model.transform(test_data)

# Evaluate the model on test data
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data:", rmse)

Root Mean Squared Error (RMSE) on test data: 139.03635019386934


spark.stop()

In [0]:
#spark.stop()

