# Debug Notebook

This notebook is used to debug and test the Spark environment and configurations.

## Setup

In [None]:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Row

In [None]:
val spark = SparkSession.builder
  .appName("Debug")
  .getOrCreate()

## Data loading, parsing, and cleaning

The structure of the dataset is as follows:
- **legId**: An identifier for the flight.
- **searchDate**: The date (`YYYY-MM-DD`) on which this entry was taken from Expedia.
- **flightDate**: The date (`YYYY-MM-DD`) of the flight.
- **startingAirport**: Three-character IATA airport code for the initial location.
- **destinationAirport**: Three-character IATA airport code for the arrival location.
- **fareBasisCode**: The fare basis code.
- **travelDuration**: The travel duration in hours and minutes.
- **elapsedDays**: The number of elapsed days (usually 0).
- **isBasicEconomy**: Boolean indicating whether the ticket is for basic economy.
- **isRefundable**: Boolean indicating whether the ticket is refundable.
- **isNonStop**: Boolean indicating whether the flight is non-stop.
- **baseFare**: The price of the ticket (in USD).
- **totalFare**: The price of the ticket (in USD) including taxes and other fees.
- **seatsRemaining**: Integer indicating the number of seats remaining.
- **totalTravelDistance**: The total travel distance in miles. This data is sometimes missing.
- **segmentsDepartureTimeEpochSeconds**: String containing the departure time (Unix time) for each leg of the trip, separated by `||`.
- **segmentsDepartureTimeRaw**: String containing the departure time (ISO 8601 format: `YYYY-MM-DDThh:mm:ss.000±[hh]:00`) for each leg of the trip, separated by `||`.
- **segmentsArrivalTimeEpochSeconds**: String containing the arrival time (Unix time) for each leg of the trip, separated by `||`.
- **segmentsArrivalTimeRaw**: String containing the arrival time (ISO 8601 format: `YYYY-MM-DDThh:mm:ss.000±[hh]:00`) for each leg of the trip, separated by `||`.
- **segmentsArrivalAirportCode**: String containing the IATA airport code for the arrival location for each leg of the trip, separated by `||`.
- **segmentsDepartureAirportCode**: String containing the IATA airport code for the departure location for each leg of the trip, separated by `||`.
- **segmentsAirlineName**: String containing the name of the airline that services each leg of the trip, separated by `||`.
- **segmentsAirlineCode**: String containing the two-letter airline code that services each leg of the trip, separated by `||`.
- **segmentsEquipmentDescription**: String containing the type of airplane used for each leg of the trip (e.g., "Airbus A321" or "Boeing 737-800"), separated by `||`.
- **segmentsDurationInSeconds**: String containing the duration of the flight (in seconds) for each leg of the trip, separated by `||`.
- **segmentsDistance**: String containing the distance traveled (in miles) for each leg of the trip, separated by `||`.
- **segmentsCabinCode**: String containing the cabin class for each leg of the trip (e.g., "coach"), separated by `||`.


Because the `travelDuration` rows are in ISO 8601 format, we will convert them to integer minutes for easier processing.

In [None]:
def parseDurationToMinutes(duration: String): Int = {
    val hourPattern = "PT(\\d+)H".r
    val minutePattern = "PT(?:\\d+H)?(\\d+)M".r

    val hours = hourPattern.findFirstMatchIn(duration).map(_.group(1).toInt).getOrElse(0)
    val minutes = minutePattern.findFirstMatchIn(duration).map(_.group(1).toInt).getOrElse(0)

    hours * 60 + minutes
}

Because of the tuple maximum limit, we will use the `Row` class to represent each record in the dataset.

In [None]:
def parseLineToRow(line: String): Row = {
    val columns = line.split(",", -1).map(_.trim)

    try {
        Row(
            columns(0), // legId
            columns(1), // searchDate
            columns(2), // flightDate
            columns(3), // startingAirport
            columns(4), // destinationAirport
            columns(5), // fareBasisCode
            parseDurationToMinutes(columns(6)),
            columns(7).toInt, // elapsedDays
            columns(8).toBoolean, // isBasicEconomy
            columns(9).toBoolean, // isRefundable
            columns(10).toBoolean, // isNonStop
            columns(11).toDouble, // baseFare
            columns(12).toDouble, // totalFare
            columns(13).toInt, // seatsRemaining
            if (columns(14).isEmpty) null else columns(14).toDouble,
            columns(15), // segmentsDepartureTimeEpochSeconds
            columns(16), // segmentsDepartureTimeRaw
            columns(17), // segmentsArrivalTimeEpochSeconds
            columns(18), // segmentsArrivalTimeRaw
            columns(19), // segmentsArrivalAirportCode
            columns(20), // segmentsDepartureAirportCode
            columns(21), // segmentsAirlineName
            columns(22), // segmentsAirlineCode
            columns(23), // segmentsEquipmentDescription
            columns(24), // segmentsDurationInSeconds
            columns(25), // segmentsDistance
            columns(26)  // segmentsCabinCode
        )
    } catch {
        case e: Exception =>
            println(s"Error parsing line: $line")
            throw e
    }
}

Now we can load the dataset and parse it into an RDD of `FlightRecord` objects.

In [None]:
val path = "../../../../datasets/itineraries_sample.csv"

val rddRaw = spark.sparkContext.textFile(path)
val rddParsed = rddRaw
  .filter(line => !line.startsWith("legId")) // Skip header
  .map(parseLineToRow)

In [None]:
rddParsed.take(5).foreach(println)

Before proceding, we will clean the data by removing records with missing or invalid values.

In [None]:
val rddClean = rddParsed.filter { row =>
    (0 until row.length).forall { i =>
        val value = row.get(i)
        value != null && value.toString.nonEmpty
    }
}

In [None]:
val totalRows = rddClean.count()
println(s"Total rows: $totalRows")

The other columns are already in a suitable format, so we can proceed with the analysis. Other columns will be left as strings because they are out of scope for the future jobs.

## Jobs

- **Average Flight Price per Route:** Calculate the average ticket price for each departure-destination pair.
- **Direct vs. Connecting Flights Price Comparison:** Compare average prices between non-stop and connecting flights.

### Job 1: Average Flight Price per Route

#### Non-optimized version

#### Optimized version


### Job 2: Direct vs. Connecting Flights Price Comparison

#### Non-optimized version

#### Optimized version