# Debug Notebook

This notebook is used to debug and test the Spark environment and configurations.

## Setup

In [1]:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

Intitializing Scala interpreter ...

Spark Web UI available at http://DESKTOP-85RDGBL:4041
SparkContext available as 'sc' (version = 3.5.1, master = local[*], app id = local-1751486834139)
SparkSession available as 'spark'


import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession


In [2]:
val spark = SparkSession.builder
  .appName("Debug")
  .getOrCreate()

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@7e056281


## Data loading, parsing, and cleaning

The structure of the dataset is as follows:
- **legId**: An identifier for the flight.
- **searchDate**: The date (`YYYY-MM-DD`) on which this entry was taken from Expedia.
- **flightDate**: The date (`YYYY-MM-DD`) of the flight.
- **startingAirport**: Three-character IATA airport code for the initial location.
- **destinationAirport**: Three-character IATA airport code for the arrival location.
- **fareBasisCode**: The fare basis code.
- **travelDuration**: The travel duration in hours and minutes.
- **elapsedDays**: The number of elapsed days (usually 0).
- **isBasicEconomy**: Boolean indicating whether the ticket is for basic economy.
- **isRefundable**: Boolean indicating whether the ticket is refundable.
- **isNonStop**: Boolean indicating whether the flight is non-stop.
- **baseFare**: The price of the ticket (in USD).
- **totalFare**: The price of the ticket (in USD) including taxes and other fees.
- **seatsRemaining**: Integer indicating the number of seats remaining.
- **totalTravelDistance**: The total travel distance in miles. This data is sometimes missing.
- **segmentsDepartureTimeEpochSeconds**: String containing the departure time (Unix time) for each leg of the trip, separated by `||`.
- **segmentsDepartureTimeRaw**: String containing the departure time (ISO 8601 format: `YYYY-MM-DDThh:mm:ss.000±[hh]:00`) for each leg of the trip, separated by `||`.
- **segmentsArrivalTimeEpochSeconds**: String containing the arrival time (Unix time) for each leg of the trip, separated by `||`.
- **segmentsArrivalTimeRaw**: String containing the arrival time (ISO 8601 format: `YYYY-MM-DDThh:mm:ss.000±[hh]:00`) for each leg of the trip, separated by `||`.
- **segmentsArrivalAirportCode**: String containing the IATA airport code for the arrival location for each leg of the trip, separated by `||`.
- **segmentsDepartureAirportCode**: String containing the IATA airport code for the departure location for each leg of the trip, separated by `||`.
- **segmentsAirlineName**: String containing the name of the airline that services each leg of the trip, separated by `||`.
- **segmentsAirlineCode**: String containing the two-letter airline code that services each leg of the trip, separated by `||`.
- **segmentsEquipmentDescription**: String containing the type of airplane used for each leg of the trip (e.g., "Airbus A321" or "Boeing 737-800"), separated by `||`.
- **segmentsDurationInSeconds**: String containing the duration of the flight (in seconds) for each leg of the trip, separated by `||`.
- **segmentsDistance**: String containing the distance traveled (in miles) for each leg of the trip, separated by `||`.
- **segmentsCabinCode**: String containing the cabin class for each leg of the trip (e.g., "coach"), separated by `||`.


Because of Scala tuple limitations, we will use a case class `FlightRecord` to represent each flight record.

In [13]:
case class FlightRecord(
  legId: String,
  searchDate: String,
  flightDate: String,
  startingAirport: String,
  destinationAirport: String,
  fareBasisCode: String,
  travelDuration: String,
  elapsedDays: Int,
  isBasicEconomy: Boolean,
  isRefundable: Boolean,
  isNonStop: Boolean,
  baseFare: Double,
  totalFare: Double,
  seatsRemaining: Int,
  totalTravelDistance: Option[Double],
  segmentsDepartureTimeEpochSeconds: String,
  segmentsDepartureTimeRaw: String,
  segmentsArrivalTimeEpochSeconds: String,
  segmentsArrivalTimeRaw: String,
  segmentsArrivalAirportCode: String,
  segmentsDepartureAirportCode: String,
  segmentsAirlineName: String,
  segmentsAirlineCode: String,
  segmentsEquipmentDescription: String,
  segmentsDurationInSeconds: String,
  segmentsDistance: String,
  segmentsCabinCode: String
)

defined class FlightRecord


In [14]:
def parseLine(line: String): FlightRecord = {
  val columns = line.split(",", -1).map(_.trim)

FlightRecord(
    columns(0),
    columns(1),
    columns(2),
    columns(3),
    columns(4),
    columns(5),
    columns(6),
    columns(7).toInt,
    columns(8).toBoolean,
    columns(9).toBoolean,
    columns(10).toBoolean,
    columns(11).toDouble,
    columns(12).toDouble,
    columns(13).toInt,
    if (columns(14).isEmpty) None else Some(columns(14).toDouble),
    columns(15),
    columns(16),
    columns(17),
    columns(18),
    columns(19),
    columns(20),
    columns(21),
    columns(22),
    columns(23),
    columns(24),
    columns(25),
    columns(26)
  )
}

parseLine: (line: String)FlightRecord


Now we can load the dataset and parse it into an RDD of `FlightRecord` objects.

In [15]:
val path = "../../../../datasets/itineraries_sample.csv"

val rddRaw = spark.sparkContext.textFile(path)
val rddParsed = rddRaw
  .filter(line => !line.startsWith("legId")) // Skip header
  .map(parseLine)

path: String = ../../../../datasets/itineraries_sample.csv
rddRaw: org.apache.spark.rdd.RDD[String] = ../../../../datasets/itineraries_sample.csv MapPartitionsRDD[5] at textFile at <console>:31
rddParsed: org.apache.spark.rdd.RDD[FlightRecord] = MapPartitionsRDD[7] at map at <console>:34


Before proceding, we will clean the data by removing records with missing or invalid values.

Because the `travelDuration` rows are in ISO 8601 format, we will convert them to integer minutes for easier processing.

The other columns are already in a suitable format, so we can proceed with the analysis. Other columns will be left as strings because they are out of scope for the future jobs.

### Time log functions for next jobs

## Jobs

- **Average Flight Price per Route:** Calculate the average ticket price for each departure-destination pair.
- **Direct vs. Connecting Flights Price Comparison:** Compare average prices between non-stop and connecting flights.

### Job implementation: Non-optimized version

### Job implementation: Optimized version