# Baseline Model

---

We create a baseline regression model to fit our data. This is to give a preliminary idea of how well a untuned model is on our raw features.

---

## Load Spark and Data

In [20]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.linalg import VectorUDT

In [4]:
spark = SparkSession.builder \
    .appName("baselinemodel") \
    .getOrCreate()

In [22]:
# REPLACE WITH PROCESSED DATA FILEPATH
DATA_PATH = "../data/itineraries_processed.parquet"

In [23]:
df = spark.read.parquet(DATA_PATH)

                                                                                

In [25]:
df.printSchema()

root
 |-- isBasicEconomy: integer (nullable = true)
 |-- isRefundable: integer (nullable = true)
 |-- isNonStop: integer (nullable = true)
 |-- totalFare: double (nullable = true)
 |-- days_before_flight: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- startingAirport_encoded: vector (nullable = true)
 |-- destinationAirport_encoded: vector (nullable = true)
 |-- num_legs: integer (nullable = true)
 |-- All_Same: integer (nullable = true)
 |-- airline_name_encoded: vector (nullable = true)



## Assemble into Vectors

In [39]:
feature_columns = df.columns[:-1]
feature_columns.remove('totalFare')

# Assemble features into a vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
df_ass = assembler.transform(df)

## Split Data 

In [40]:
# Training split percentage
t_perc = 0.8

In [41]:
(train, test) = df_ass.randomSplit([t_perc, 1-t_perc])

## Create the Model

In [43]:
# Create a RandomForestRegressor model
rf = RandomForestRegressor(featuresCol="features", labelCol="totalFare")

# Train the model
rf_model = rf.fit(train)

                                                                                

In [44]:
# Make predictions on the test data
preds = rf_model.transform(test)

In [48]:
# Evaluate the model
evaluator = RegressionEvaluator(labelCol="totalFare", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(preds)
print("Root Mean Squared Error (RMSE) on test data =", rmse)

[Stage 26:>                                                         (0 + 1) / 1]

Root Mean Squared Error (RMSE) on test data = 226.966981076701


                                                                                

Considering that the median price is 370, this model is not amazing. However, there is a lot of room for data inclusion as well as feature engineering

In [49]:
# evaluate the R2
evaluator = RegressionEvaluator(labelCol="totalFare", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(preds)
print("R-squared (R²) =", r2)

[Stage 27:>                                                         (0 + 1) / 1]

R-squared (R²) = 0.1610178640709934


                                                                                

yep, this is pretty bad