# Preprocess Data

---

In this notebook, we will edit the data to include only columns that we want. 

There are two types of variables that we want to keep:
- Those which are practical for the user to input when looking up a price prediction
- Those which would actually contribute to the price of a flight

---

## Data Dictionary Reminder

below is a reminder of what variables we have, next to each we indicate what we want to do with each column

- legId: An identifier for the flight. // Not needed
- searchDate: The date (YYYY-MM-DD) on which this entry was taken from Expedia. // Change search date and flight date into one variable - days before flight
- flightDate: The date (YYYY-MM-DD) of the flight. // Potentially separate this into just the day. Month is meaningless as the data is not even for a full year
- startingAirport: Three-character IATA airport code for the initial location. // One hot encoding
- destinationAirport: Three-character IATA airport code for the arrival location. // One hot encoding
- fareBasisCode: The fare basis code. // Not required
- travelDuration: The travel duration in hours and minutes. // Not required
- elapsedDays: The number of elapsed days (usually 0). // Not required
- isBasicEconomy: Boolean for whether the ticket is for basic economy. // Change into binary
- isRefundable: Boolean for whether the ticket is refundable. // Change into binary 
- isNonStop: Boolean for whether the flight is non-stop. // Change into binary 
- baseFare: The price of the ticket (in USD). // Not required
- totalFare: The price of the ticket (in USD) including taxes and other fees. // No change needed
- seatsRemaining: Integer for the number of seats remaining. // Not required
- totalTravelDistance: The total travel distance in miles. This data is sometimes missing. // Not required
- segmentsDepartureTimeEpochSeconds: String containing the departure time (Unix time) for each leg of the trip. The entries for each of the legs are separated by '||'. // Not required
- segmentsDepartureTimeRaw: String containing the departure time (ISO 8601 format: YYYY-MM-DDThh:mm:ss.000±[hh]:00) for each leg of the trip. The entries for each of the legs are separated by '||'. // Not required
- segmentsArrivalTimeEpochSeconds: String containing the arrival time (Unix time) for each leg of the trip. The entries for each of the legs are separated by '||'. // Not required
- segmentsArrivalTimeRaw: String containing the arrival time (ISO 8601 format: YYYY-MM-DDThh:mm:ss.000±[hh]:00) for each leg of the trip. The entries for each of the legs are separated by '||'. // Not required
- segmentsArrivalAirportCode: String containing the IATA airport code for the arrival location for each leg of the trip. The entries for each of the legs are separated by '||'. // Not required
- segmentsDepartureAirportCode: String containing the IATA airport code for the departure location for each leg of the trip. The entries for each of the legs are separated by '||'. // Not required
- segmentsAirlineName: String containing the name of the airline that services each leg of the trip. The entries for each of the legs are separated by '||'. // Most trips use the same airlines the whole way, drop trips that don't do this and change trips that do into one hot encoding. We can also use this to count how many layovers there are.
- segmentsAirlineCode: String containing the two-letter airline code that services each leg of the trip. The entries for each of the legs are separated by '||'. // Not required
- segmentsEquipmentDescription: String containing the type of airplane used for each leg of the trip (e.g. "Airbus A321" or "Boeing 737-800"). The entries for each of the legs are separated by '||'. // Not required
- segmentsDurationInSeconds: String containing the duration of the flight (in seconds) for each leg of the trip. The entries for each of the legs are separated by '||'. // Not required
- segmentsDistance: String containing the distance traveled (in miles) for each leg of the trip. The entries for each of the legs are separated by '||'. // Not required
- segmentsCabinCode: String containing the cabin for each leg of the trip (e.g. "coach"). The entries for each of the legs are separated by '||'. // Not required

## Load Spark and Data

In [18]:
from itertools import groupby

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, datediff, dayofmonth, when, udf, split, size
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline

from pyspark.sql.types import IntegerType, StringType

In [19]:
spark = SparkSession.builder.appName("flights").getOrCreate()

# REPLACE WITH DATA FILEPATH
DATA_PATH = "../data/itineraries.csv"

df = spark.read.csv(DATA_PATH, header=True, inferSchema=True)

## Drop the Uneccessary Columns

In [20]:
cols_to_drop = ["legId", "fareBasisCode", "travelDuration", "elapsedDays", "baseFare", "seatsRemaining", "totalTravelDistance",\
                "segmentsDepartureTimeEpochSeconds","segmentsDepartureTimeRaw", "segmentsArrivalTimeEpochSeconds", "segmentsArrivalTimeRaw",\
                "segmentsArrivalAirportCode", "segmentsDepartureAirportCode","segmentsAirlineCode", "segmentsEquipmentDescription", \
                "segmentsDurationInSeconds", "segmentsDistance", "segmentsCabinCode"]

In [21]:
# drop cols
df = df.drop(*cols_to_drop)

In [22]:
df.show()

+----------+----------+---------------+------------------+--------------+------------+---------+---------+--------------------+
|searchDate|flightDate|startingAirport|destinationAirport|isBasicEconomy|isRefundable|isNonStop|totalFare| segmentsAirlineName|
+----------+----------+---------------+------------------+--------------+------------+---------+---------+--------------------+
|2022-04-16|2022-04-17|            ATL|               BOS|         false|       false|     true|    248.6|               Delta|
|2022-04-16|2022-04-17|            ATL|               BOS|         false|       false|     true|    248.6|               Delta|
|2022-04-16|2022-04-17|            ATL|               BOS|         false|       false|     true|    248.6|               Delta|
|2022-04-16|2022-04-17|            ATL|               BOS|         false|       false|     true|    248.6|               Delta|
|2022-04-16|2022-04-17|            ATL|               BOS|         false|       false|     true|    248.

## Days before flight Column

In [23]:
df = df.withColumn("days_before_flight", datediff(col("flightDate"), col("searchDate")))

## DayofMonth Column

In [24]:
df = df.withColumn("day", dayofmonth(col("flightDate")))

## Drop: searchDay and FlightDay

In [25]:
df = df.drop(*["searchDate", "flightDate"])

In [26]:
df.show()

+---------------+------------------+--------------+------------+---------+---------+--------------------+------------------+---+
|startingAirport|destinationAirport|isBasicEconomy|isRefundable|isNonStop|totalFare| segmentsAirlineName|days_before_flight|day|
+---------------+------------------+--------------+------------+---------+---------+--------------------+------------------+---+
|            ATL|               BOS|         false|       false|     true|    248.6|               Delta|                 1| 17|
|            ATL|               BOS|         false|       false|     true|    248.6|               Delta|                 1| 17|
|            ATL|               BOS|         false|       false|     true|    248.6|               Delta|                 1| 17|
|            ATL|               BOS|         false|       false|     true|    248.6|               Delta|                 1| 17|
|            ATL|               BOS|         false|       false|     true|    248.6|             

## One Hot Encode Starting and Destination Airport

In [27]:
# string indexers
string_indexer = StringIndexer(inputCol="startingAirport", outputCol="startingAirport_index")
string_indexer_2 = StringIndexer(inputCol="destinationAirport", outputCol="destinationAirport_index")

# encoders
encoder = OneHotEncoder(inputCol="startingAirport_index", outputCol="startingAirport_encoded")
encoder_2 = OneHotEncoder(inputCol="destinationAirport_index", outputCol="destinationAirport_encoded")

# pipeline
pipeline = Pipeline(stages=[string_indexer, string_indexer_2, encoder, encoder_2])
pipeline_model = pipeline.fit(df)
df = pipeline_model.transform(df)

## Drop: startingAirport, destinationAirport, startingAirport_index, destinationAirport_index

In [28]:
df = df.drop(*["startingAirport", "destinationAirport", "startingAirport_index", "destinationAirport_index"])

## Convert Boolean Columns into Binary

In [29]:
bool_columns = ["isBasicEconomy", "isRefundable", "isNonStop"]

In [30]:
for col_name in bool_columns:
    df = df.withColumn(col_name, when(col(col_name), 1).otherwise(0))

In [31]:
df.show()

+--------------+------------+---------+---------+--------------------+------------------+---+-----------------------+--------------------------+
|isBasicEconomy|isRefundable|isNonStop|totalFare| segmentsAirlineName|days_before_flight|day|startingAirport_encoded|destinationAirport_encoded|
+--------------+------------+---------+---------+--------------------+------------------+---+-----------------------+--------------------------+
|             0|           0|        1|    248.6|               Delta|                 1| 17|         (15,[1],[1.0])|            (15,[2],[1.0])|
|             0|           0|        1|    248.6|               Delta|                 1| 17|         (15,[1],[1.0])|            (15,[2],[1.0])|
|             0|           0|        1|    248.6|               Delta|                 1| 17|         (15,[1],[1.0])|            (15,[2],[1.0])|
|             0|           0|        1|    248.6|               Delta|                 1| 17|         (15,[1],[1.0])|            (

## Create Number of Flights and Airline Name
## TODO: FIX THIS

In [32]:
# THESE UDF FUNCTIONS DON'T SEEM TO WORK, IGNORE FOR NOW

def all_equal(iterable):
    g = groupby(iterable)
    return next(g, True) and not next(g, False)

# get airline name
def get_airline(segmentsAirlineName):
    splits = segmentsAirlineName.split("||")
    if all_equal(splits):
        return splits[0]
    else:
        return "Multi-Airline"

# get number of splits
def get_nsplits(segmentsAirlineName):
    splits = segmentsAirlineName.split("||")
    return len(splits)

# register as udf 
airline_udf = udf(get_airline)
nsplit_udf = udf(get_nsplits, IntegerType())

In [33]:
# convert segments airline name into a list
df = df.withColumn("segmentsAirlineName", split(col("segmentsAirlineName"), r'\|\|'))

# count the number of legs in the trip
df = df.withColumn("num_legs", size(col("segmentsAirlineName")))

# check whether all the flights in the list are the same
df = df.withColumn("All_Same", col("segmentsAirlineName")[0] == col("segmentsAirlineName")[1])

# change the null results to true (for flights with only one leg)
df = df.withColumn("All_Same", when(col("All_Same").isNull(), True).otherwise(col("All_Same")))

# get the name of the airline
df = df.withColumn("airline_name", col("segmentsAirlineName").getItem(0))

# filter out unneccesary columns
df = df.filter(col("All_Same") != False)

In [34]:
# drop segmentsAirlineName
df = df.drop("segmentsAirlineName")

In [35]:
# convert all_same into bool
df = df.withColumn("All_Same", when(col("All_Same"), 1).otherwise(0))

In [36]:
# convert airline_name into one hot encoded
string_indexer_3 = StringIndexer(inputCol="airline_name", outputCol="airline_name_index")
encoder_3 = OneHotEncoder(inputCol="airline_name_index", outputCol="airline_name_encoded")
pipeline_2 = Pipeline(stages=[string_indexer_3, encoder_3])

pipeline_model_2 = pipeline_2.fit(df)
df = pipeline_model_2.transform(df)

                                                                                

In [37]:
# drop airline_name, airline_name_index
df = df.drop(*["airline_name", "airline_name_index"])

In [38]:
df.show()

+--------------+------------+---------+---------+------------------+---+-----------------------+--------------------------+--------+--------+--------------------+
|isBasicEconomy|isRefundable|isNonStop|totalFare|days_before_flight|day|startingAirport_encoded|destinationAirport_encoded|num_legs|All_Same|airline_name_encoded|
+--------------+------------+---------+---------+------------------+---+-----------------------+--------------------------+--------+--------+--------------------+
|             0|           0|        1|    248.6|                 1| 17|         (15,[1],[1.0])|            (15,[2],[1.0])|       1|       1|      (12,[1],[1.0])|
|             0|           0|        1|    248.6|                 1| 17|         (15,[1],[1.0])|            (15,[2],[1.0])|       1|       1|      (12,[1],[1.0])|
|             0|           0|        1|    248.6|                 1| 17|         (15,[1],[1.0])|            (15,[2],[1.0])|       1|       1|      (12,[1],[1.0])|
|             0|      

## Save the final Data

In [42]:
# Where the file should be saved
TARGET_PATH = "../data/itineraries_processed.csv"
TARGET_PATH_PAR = "../data/itineraries_processed.parquet"

In [40]:
df.toPandas().to_csv(TARGET_PATH, header=True, index=False)

                                                                                

In [43]:
# writing to parquet makes loading the data easier
df.write.parquet(TARGET_PATH_PAR)

                                                                                

In [41]:
df.printSchema()

root
 |-- isBasicEconomy: integer (nullable = false)
 |-- isRefundable: integer (nullable = false)
 |-- isNonStop: integer (nullable = false)
 |-- totalFare: double (nullable = true)
 |-- days_before_flight: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- startingAirport_encoded: vector (nullable = true)
 |-- destinationAirport_encoded: vector (nullable = true)
 |-- num_legs: integer (nullable = false)
 |-- All_Same: integer (nullable = false)
 |-- airline_name_encoded: vector (nullable = true)

