# Debug Notebook

This notebook is used to debug and test the Spark environment and configurations.

## Setup

In [1]:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Row

import java.time.LocalDate
import java.time.format.DateTimeFormatter

Intitializing Scala interpreter ...

Spark Web UI available at http://DESKTOP-85RDGBL:4040
SparkContext available as 'sc' (version = 3.5.1, master = local[*], app id = local-1751591079362)
SparkSession available as 'spark'


import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Row
import java.time.LocalDate
import java.time.format.DateTimeFormatter


In [2]:
val path_to_datasets = "../../../../datasets/"
val path_to_output = "../../../../src/main/outputs/"

val path_to_sample = path_to_datasets + "itineraries_sample.csv"

val spark = SparkSession.builder
  .appName("Debug")
  .getOrCreate()

path_to_datasets: String = ../../../../datasets/
path_to_output: String = ../../../../src/main/outputs/
path_to_sample: String = ../../../../datasets/itineraries_sample.csv
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@568f509d


## Data loading, parsing, and cleaning

The structure of the dataset is as follows:
- **legId**: An identifier for the flight.
- **searchDate**: The date (`YYYY-MM-DD`) on which this entry was taken from Expedia.
- **flightDate**: The date (`YYYY-MM-DD`) of the flight.
- **startingAirport**: Three-character IATA airport code for the initial location.
- **destinationAirport**: Three-character IATA airport code for the arrival location.
- **fareBasisCode**: The fare basis code.
- **travelDuration**: The travel duration in hours and minutes.
- **elapsedDays**: The number of elapsed days (usually 0).
- **isBasicEconomy**: Boolean indicating whether the ticket is for basic economy.
- **isRefundable**: Boolean indicating whether the ticket is refundable.
- **isNonStop**: Boolean indicating whether the flight is non-stop.
- **baseFare**: The price of the ticket (in USD).
- **totalFare**: The price of the ticket (in USD) including taxes and other fees.
- **seatsRemaining**: Integer indicating the number of seats remaining.
- **totalTravelDistance**: The total travel distance in miles. This data is sometimes missing.
- **segmentsDepartureTimeEpochSeconds**: String containing the departure time (Unix time) for each leg of the trip, separated by `||`.
- **segmentsDepartureTimeRaw**: String containing the departure time (ISO 8601 format: `YYYY-MM-DDThh:mm:ss.000±[hh]:00`) for each leg of the trip, separated by `||`.
- **segmentsArrivalTimeEpochSeconds**: String containing the arrival time (Unix time) for each leg of the trip, separated by `||`.
- **segmentsArrivalTimeRaw**: String containing the arrival time (ISO 8601 format: `YYYY-MM-DDThh:mm:ss.000±[hh]:00`) for each leg of the trip, separated by `||`.
- **segmentsArrivalAirportCode**: String containing the IATA airport code for the arrival location for each leg of the trip, separated by `||`.
- **segmentsDepartureAirportCode**: String containing the IATA airport code for the departure location for each leg of the trip, separated by `||`.
- **segmentsAirlineName**: String containing the name of the airline that services each leg of the trip, separated by `||`.
- **segmentsAirlineCode**: String containing the two-letter airline code that services each leg of the trip, separated by `||`.
- **segmentsEquipmentDescription**: String containing the type of airplane used for each leg of the trip (e.g., "Airbus A321" or "Boeing 737-800"), separated by `||`.
- **segmentsDurationInSeconds**: String containing the duration of the flight (in seconds) for each leg of the trip, separated by `||`.
- **segmentsDistance**: String containing the distance traveled (in miles) for each leg of the trip, separated by `||`.
- **segmentsCabinCode**: String containing the cabin class for each leg of the trip (e.g., "coach"), separated by `||`.


In [3]:
case class Flight(
    legId: String,
    searchDate: LocalDate,
    flightDate: LocalDate,
    startingAirport: String,
    destinationAirport: String,
    fareBasisCode: String,
    travelDuration: Int,
    elapsedDays: Int,
    isBasicEconomy: Boolean,
    isRefundable: Boolean,
    isNonStop: Boolean,
    baseFare: Double,
    totalFare: Double,
    seatsRemaining: Int,
    totalTravelDistance: Double,
    segmentsDepartureTimeEpochSeconds: String,
    segmentsDepartureTimeRaw: String,
    segmentsArrivalTimeEpochSeconds: String,
    segmentsArrivalTimeRaw: String,
    segmentsArrivalAirportCode: String,
    segmentsDepartureAirportCode: String,
    segmentsAirlineName: String,
    segmentsAirlineCode: String,
    segmentsEquipmentDescription: String,
    segmentsDurationInSeconds: String,
    segmentsDistance: String,
    segmentsCabinCode: String
)

defined class Flight


In [4]:
def parseLine(line: String): Option[Flight] = {
    try {
        val cols = line.split(",", -1).map(_.trim)
        val dateFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")

        def parseDurationToMinutes(duration: String): Int = {
            val hourPattern = "PT(\\d+)H".r
            val minutePattern = "PT(?:\\d+H)?(\\d+)M".r

            val hours = hourPattern.findFirstMatchIn(duration).map(_.group(1).toInt).getOrElse(0)
            val minutes = minutePattern.findFirstMatchIn(duration).map(_.group(1).toInt).getOrElse(0)

            hours * 60 + minutes
        }

        def toIntSafe(s: String): Int = try { s.toInt } catch { case _: Exception => 0 }
        def toDoubleSafe(s: String): Double = try { s.toDouble } catch { case _: Exception => 0.0 }
        def toBooleanSafe(s: String): Boolean = try { s.toBoolean } catch { case _: Exception => false }

        val flight = Flight(
            legId = cols(0),
            searchDate = LocalDate.parse(cols(1), dateFormatter),
            flightDate = LocalDate.parse(cols(2), dateFormatter),
            startingAirport = cols(3),
            destinationAirport = cols(4),
            fareBasisCode = cols(5),
            travelDuration = parseDurationToMinutes(cols(6)),
            elapsedDays = toIntSafe(cols(7)),
            isBasicEconomy = toBooleanSafe(cols(8)),
            isRefundable = toBooleanSafe(cols(9)),
            isNonStop = toBooleanSafe(cols(10)),
            baseFare = toDoubleSafe(cols(11)),
            totalFare = toDoubleSafe(cols(12)),
            seatsRemaining = toIntSafe(cols(13)),
            totalTravelDistance = toDoubleSafe(cols(14)),
            segmentsDepartureTimeEpochSeconds = cols(15),
            segmentsDepartureTimeRaw = cols(16),
            segmentsArrivalTimeEpochSeconds = cols(17),
            segmentsArrivalTimeRaw = cols(18),
            segmentsArrivalAirportCode = cols(19),
            segmentsDepartureAirportCode = cols(20),
            segmentsAirlineName = cols(21),
            segmentsAirlineCode = cols(22),
            segmentsEquipmentDescription = cols(23),
            segmentsDurationInSeconds = cols(24),
            segmentsDistance = cols(25),
            segmentsCabinCode = cols(26)
        )

        Some(flight)
    } catch {
        case e: Exception =>
            println(s"[PARSE ERROR] Line: $line\nError: ${e.getMessage}")
            None
    }
}

parseLine: (line: String)Option[Flight]


### Data parsing

In [5]:
var rddRaw = spark.sparkContext.textFile(path_to_sample).filter(line => !line.startsWith("legId"))
var rddParsed = rddRaw.flatMap(parseLine)

rddRaw: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:31
rddParsed: org.apache.spark.rdd.RDD[Flight] = MapPartitionsRDD[3] at flatMap at <console>:32


In [6]:
val totalRows = rddParsed.count()
println(s"Total rows: $totalRows")

Total rows: 2463214


totalRows: Long = 2463214


## Jobs

- **Average Flight Price per Route:** Calculate the average ticket price for each departure-destination pair.
> There are 17 airports in the dataset: ATL, DFW, DEN, ORD, LAX, CLT, MIA, JFK, EWR, SFO, DTW, BOS, PHL, LGA, IAD, OAK.
- **Direct vs. Connecting Flights Price Comparison:** Compare average prices between non-stop and connecting flights.

### Non-optimized version

In [7]:
def nonOptimizedFlightPriceAnalysis(flights: RDD[Flight]): Unit = {

    // Prepare season lookup RDD: month -> season
    val monthSeason: RDD[(Int, String)] = flights.context.parallelize(Seq(
        (1, "Winter"), (2, "Winter"), (3, "Spring"), (4, "Spring"),
        (5, "Spring"), (6, "Summer"), (7, "Summer"), (8, "Summer"),
        (9, "Autumn"), (10, "Autumn"), (11, "Autumn"), (12, "Winter")
    ))

    // Creating and joining the month-season RDD
    val flightsByMonth: RDD[(Int, Flight)] = flights.map(f => (f.flightDate.getMonthValue, f))
    val flightsWithSeason: RDD[(Int, (Flight, String))] = flightsByMonth.join(monthSeason)
    val enrichedFlights: RDD[(Flight, String)] = flightsWithSeason.map {
        case (_, (f, season)) => (f, season)
    }

    // Metric 1: Average flight price per route per season
    val routeSeasonPrices = enrichedFlights
        .map { case (f, season) => ((f.startingAirport, f.destinationAirport, season), f.totalFare) }
        .groupByKey()

    val averageRouteSeasonPrices = routeSeasonPrices
        .mapValues(fares => fares.sum / fares.size)

    println("Average Flight Price per Route per Season:")
    averageRouteSeasonPrices.collect().foreach {
        case ((start, dest, season), avgFare) =>
            println(f"$start -> $dest, $season : $$${avgFare}%.2f")
    }

    // Metric 2: Average price for Direct vs Connecting flights
    val directVsConnectingPrices = enrichedFlights
        .map { case (f, _) => (if (f.isNonStop) "Non-Stop" else "Connecting", f.totalFare) }
        .groupByKey()

    val averageDirectVsConnecting = directVsConnectingPrices
        .mapValues(fares => fares.sum / fares.size)

    println("\nAverage Price: Direct vs Connecting Flights:")
    averageDirectVsConnecting.collect().foreach {
        case (flightType, avgFare) =>
            println(f"$flightType : $$${avgFare}%.2f")
    }
}

nonOptimizedFlightPriceAnalysis: (flights: org.apache.spark.rdd.RDD[Flight])Unit


In [8]:
var startTimeNonOptimized = System.currentTimeMillis()

nonOptimizedFlightPriceAnalysis(rddParsed)

var endTimeNonOptimized = System.currentTimeMillis()
var durationNonOptimized = endTimeNonOptimized - startTimeNonOptimized
println(s"\n [TIME] Non-optimized job duration: ${durationNonOptimized} ms")

Average Flight Price per Route per Season:
IAD -> LAX, Summer : $475,42
DEN -> DFW, Autumn : $206,94
DEN -> PHL, Spring : $419,45
PHL -> BOS, Autumn : $171,64
OAK -> SFO, Spring : $412,08
SFO -> DFW, Autumn : $253,04
OAK -> SFO, Summer : $413,12
IAD -> ATL, Autumn : $180,75
BOS -> DEN, Summer : $395,63
CLT -> ATL, Autumn : $261,98
SFO -> CLT, Summer : $557,13
DEN -> ORD, Autumn : $250,27
BOS -> LGA, Spring : $152,39
PHL -> ATL, Summer : $254,42
DEN -> LGA, Spring : $285,82
BOS -> OAK, Summer : $666,90
ATL -> LGA, Autumn : $177,45
ATL -> LAX, Spring : $453,27
MIA -> ATL, Spring : $230,45
CLT -> BOS, Summer : $222,18
PHL -> DFW, Summer : $345,76
JFK -> LAX, Spring : $402,79
JFK -> IAD, Spring : $418,89
DFW -> OAK, Summer : $471,89
DEN -> MIA, Summer : $389,71
BOS -> MIA, Summer : $230,93
CLT -> DEN, Spring : $387,58
EWR -> BOS, Spring : $165,90
DEN -> ATL, Autumn : $239,46
BOS -> LAX, Autumn : $287,72
CLT -> EWR, Summer : $247,49
DFW -> DEN, Summer : $235,

startTimeNonOptimized: Long = 1751591095452
endTimeNonOptimized: Long = 1751591133782
durationNonOptimized: Long = 38330


### Optimized version


In [9]:
def optimizedFlightPriceAnalysis(flights: RDD[Flight]): Unit = {

    val monthSeason = flights.context.parallelize(Seq(
        (1, "Winter"), (2, "Winter"), (3, "Spring"), (4, "Spring"),
        (5, "Spring"), (6, "Summer"), (7, "Summer"), (8, "Summer"),
        (9, "Autumn"), (10, "Autumn"), (11, "Autumn"), (12, "Winter")
    ))

    // Creating and joining the month-season RDD
    val flightsByMonth: RDD[(Int, Flight)] = flights.map(f => (f.flightDate.getMonthValue, f))
    val flightsWithSeason: RDD[(Int, (Flight, String))] = flightsByMonth.join(monthSeason)
    val enrichedFlights: RDD[(Flight, String)] = flightsWithSeason.map {
        case (_, (f, season)) => (f, season)
    }

    // Map to composite key ((start, dest, season), stopType) with (fare, count)
    val compositeKeyFares = enrichedFlights.map {
        case (f, season) =>
            val routeKey = (f.startingAirport, f.destinationAirport, season)
            val stopType = if (f.isNonStop) "Non-Stop" else "Connecting"
            ((routeKey, stopType), (f.totalFare, 1))
    }

    // Aggregate sum and count by composite key (shuffle #1)
    val aggregated = compositeKeyFares.reduceByKey {
        case ((fareSum1, count1), (fareSum2, count2)) =>
            (fareSum1 + fareSum2, count1 + count2)
    }

    // Extract average price per route per season
    val routeSeasonAggregates = aggregated
        .map { case ((routeKey, _), (fareSum, count)) => (routeKey, (fareSum, count)) }
        .reduceByKey {
            case ((sum1, cnt1), (sum2, cnt2)) => (sum1 + sum2, cnt1 + cnt2)
        }
        .mapValues { case (sum, cnt) => sum / cnt }

    println("Average Flight Price per Route per Season (Optimized):")
    routeSeasonAggregates.collect().foreach {
        case ((start, dest, season), avgFare) =>
            println(f"$start -> $dest, $season : $$${avgFare}%.2f")
    }

    // Extract average price Direct vs Connecting
    val directConnectingAggregates = aggregated
        .map { case ((_, stopType), (fareSum, count)) => (stopType, (fareSum, count)) }
        .reduceByKey {
            case ((sum1, cnt1), (sum2, cnt2)) => (sum1 + sum2, cnt1 + cnt2)
    }
    .mapValues { case (sum, cnt) => sum / cnt }

    println("\nAverage Price: Direct vs Connecting Flights (Optimized):")
    directConnectingAggregates.collect().foreach {
        case (stopType, avgFare) =>
            println(f"$stopType : $$${avgFare}%.2f")
    }
}


optimizedFlightPriceAnalysis: (flights: org.apache.spark.rdd.RDD[Flight])Unit


In [10]:
var startTimeOptimized = System.currentTimeMillis()

optimizedFlightPriceAnalysis(rddParsed)

var endTimeOptimized = System.currentTimeMillis()
var durationOptimized = endTimeOptimized - startTimeOptimized
println(s"\n[TIME]  Optimized job duration: ${durationOptimized} ms")

Average Flight Price per Route per Season (Optimized):
IAD -> LAX, Summer : $475,42
DEN -> DFW, Autumn : $206,94
DEN -> PHL, Spring : $419,45
PHL -> BOS, Autumn : $171,64
OAK -> SFO, Spring : $412,08
SFO -> DFW, Autumn : $253,04
OAK -> SFO, Summer : $413,12
IAD -> ATL, Autumn : $180,75
BOS -> DEN, Summer : $395,63
CLT -> ATL, Autumn : $261,98
SFO -> CLT, Summer : $557,13
DEN -> ORD, Autumn : $250,27
BOS -> LGA, Spring : $152,39
PHL -> ATL, Summer : $254,42
DEN -> LGA, Spring : $285,82
BOS -> OAK, Summer : $666,90
ATL -> LGA, Autumn : $177,45
ATL -> LAX, Spring : $453,27
MIA -> ATL, Spring : $230,45
CLT -> BOS, Summer : $222,18
PHL -> DFW, Summer : $345,76
JFK -> LAX, Spring : $402,79
JFK -> IAD, Spring : $418,89
DFW -> OAK, Summer : $471,89
BOS -> MIA, Summer : $230,93
DEN -> MIA, Summer : $389,71
CLT -> DEN, Spring : $387,58
EWR -> BOS, Spring : $165,90
DEN -> ATL, Autumn : $239,46
BOS -> LAX, Autumn : $287,72
CLT -> EWR, Summer : $247,49
DFW -> DEN, Su

startTimeOptimized: Long = 1751591206577
endTimeOptimized: Long = 1751591229957
durationOptimized: Long = 23380
