# Import Library

`data_loading` and `feature_engineering` are python files that I wrote. `data_loading` contains a function called load_data which you will use to load the dataframe, and `feature_engineering` contains a function called apply_feature_engineering which you will use to apply the feature engineering (so that we all use the same processed data in ML models).

**Before running this script, make sure you have downloaded 'itineraries_snappy.parquet' and are storing in a folder called 'data'**

You can upload these as normal libaries, as seen below:

In [None]:
from feature_engineering import apply_feature_engineering, add_dummies
from data_loading import load_data
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_absolute_percentage_error

# Data Loading
Here is where you will call the load_data function from data_loading --> there are no parameters needed

In [None]:
# Call the load_data to get the data as a pandas dataframe
df = load_data()
df.head()

In [None]:
# The data is too large to use in entirety, set a sample of 800,000 rows
sample_size = 800000

# Get the first 800,000 rows
df_sample = df.iloc[:sample_size]

# Feature Engineering
Here is where you will call the apply_feature_engineering function from feature_engineering --> there are no parameters needed

In [None]:
# Call the apply_feature_engineering function from feature_engineering to get the data ready for ML Modeling
df_sample = apply_feature_engineering(df_sample)

Starting feature engineering...
Converting date columns...
Date conversion done. Time elapsed: 0.17s
Extracting travel duration...
Travel duration extraction done. Time elapsed: 1.53s
Imputing missing travel distances...
Imputation done. Time elapsed: 1.55s
Processing departure times...
Departure time processing done. Time elapsed: 294.14s
Extracting departure hour and float...
Departure time extraction done. Time elapsed: 294.19s
Processing airline codes...
Airline code processing done. Time elapsed: 295.73s
Processing cabin codes...
Cabin class processing done. Time elapsed: 298.13s
Binning seatsRemaining...
Seats binning done. Time elapsed: 298.15s
Calculating days to departure...
Day of week processing done. Time elapsed: 298.21s
Processing holiday features...
Holiday features processing done. Time elapsed: 298.28s
Dropping columns...
Dropping columns done. Time elapsed: 298.39s
Renaming columns...
Renaming done. Total time elapsed: 298.39s
Adding dummies...
Dummies added. Total ti

In [None]:
# You should see the following columns and data types
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 46 columns):
 #   Column                    Non-Null Count   Dtype   
---  ------                    --------------   -----   
 0   travelDuration            800000 non-null  int64   
 1   isRefundable              800000 non-null  bool    
 2   isNonStop                 800000 non-null  bool    
 3   totalFare                 800000 non-null  float64 
 4   seatsRemaining            800000 non-null  int64   
 5   totalTravelDistance       751962 non-null  float64 
 6   travelDistance            800000 non-null  int64   
 7   departureTimeHour         800000 non-null  int32   
 8   departureTimeFloat        800000 non-null  float64 
 9   binnedSeatsRemaining      798145 non-null  category
 10  daysToDeparture           800000 non-null  int64   
 11  departureDayOfWeek        800000 non-null  int32   
 12  isWeekend                 800000 non-null  bool    
 13  isHoliday                 800

In [None]:
# The first 5 rows should look like this
df_sample.head()

Unnamed: 0,travelDuration,isRefundable,isNonStop,totalFare,seatsRemaining,totalTravelDistance,travelDistance,departureTimeHour,departureTimeFloat,binnedSeatsRemaining,...,destinationAirport_IAD,destinationAirport_JFK,destinationAirport_LAX,destinationAirport_LGA,destinationAirport_MIA,destinationAirport_OAK,destinationAirport_ORD,destinationAirport_PHL,destinationAirport_SFO,cabinClass_basic economy
0,149,False,True,248.6,9,947.0,947,16,16.95,2,...,False,False,False,False,False,False,False,False,False,False
1,150,False,True,248.6,4,947.0,947,10,10.5,1,...,False,False,False,False,False,False,False,False,False,False
2,150,False,True,248.6,9,947.0,947,15,15.583333,2,...,False,False,False,False,False,False,False,False,False,False
3,152,False,True,248.6,8,947.0,947,17,17.983333,2,...,False,False,False,False,False,False,False,False,False,False
4,154,False,True,248.6,9,947.0,947,13,13.983333,2,...,False,False,False,False,False,False,False,False,False,False


# Example ML Modeling: Decision Tree
You can now use sklearn as normal --> see below:

In [None]:
# Instantiate decision tree regressor (since we predicting price, not classifying)
dt = DecisionTreeRegressor(random_state= 42)

In [None]:
# Our X variables in these models will be all columns that are not price
X = df_sample.drop(columns= ['totalFare'], axis= 1)

# OUr y variable is of course price which is called 'totalFare'
y = df_sample['totalFare']

# Split the data into train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [None]:
# Fit and predict the data
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

In [None]:
# Calculate the error metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the error metrics using four decimal places
print(f"Mean Absolute Error: {mae:.4f}")
print(f"Mean Sqaured Error: {mse:.4f}")
print(f"Root Mean Squared Error {rmse:0.4f}")
print(f"Mean Absolute Percentage Error: {mape:.4%}")
print(f"R2: {r2:.4f}")

Mean Absolute Error: 47.3841
Mean Sqaured Error: 15264.4149
Root Mean Squared Error 123.5492
Mean Absolute Percentage Error: 14.4713%
R2: 0.7230
