# Exercise 5: Spark APIs [100 points]

## 1. Accumulators [10 points]
[10 points].The title of this Q&A is wrong. It’s really about global variables (aka accumulators). The question shows code that is incorrect. 
val data = Array(1,2,3,4,5)
var counter = 0
var rdd = sc.parallelize(data)

// Wrong: Don't do this!!
rdd.foreach(x => counter += x)

println("Counter value: " + counter)
Write a corrected version of the code and demonstrate its intended operation.



In [None]:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.util.LongAccumulator

object SparkAccumulatorExample {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("AccumulatorExample").setMaster("local[*]")
    val sc = new SparkContext(conf)

    val data = Array(1, 2, 3, 4, 5)
    val rdd = sc.parallelize(data)

    // Correct approach using Accumulator
    val counter: LongAccumulator = sc.longAccumulator("Counter Accumulator")

    rdd.foreach(x => counter.add(x))

    println("Counter value: " + counter.value)
    
    sc.stop()
  }
}

Output: 15

## 2. Airline Traffic [45 points]

#### 1. [15 points] Describe in words and in code (where applicable) the steps you took to set up the environment for gathering the statistical data in the below questions.

#### 2. [6 points] Which US Airline Has the Least Delays? Report by full names, (e.g., Delta Airlines, not DL)

In [23]:
# Step 1: Import libraries
import pandas as pd
import re


    # Step 2: Define documented column names
documented_columns = [
        'Carrier', 'FlightNumber',
        'Undocumented_1', 'Undocumented_2',  # Placeholder for extra columns
        'OperatingCarrier', 'OperatingFlightNumber',
        'DepartureAirport', 'ArrivalAirport', 'FlightDate', 'DayOfWeek',
        'ScheduledDepartureTime_OAG', 'ScheduledDepartureTime_CRS',
        'ActualDepartureTime', 'ScheduledArrivalTime_OAG', 'ScheduledArrivalTime_CRS',
        'ActualArrivalTime', 'Diff_ScheduledDepartureTimes', 'Diff_ScheduledArrivalTimes',
        'ScheduledElapsedMinutes', 'DepartureDelayMinutes', 'ArrivalDelayMinutes',
        'Diff_ElapsedMinutes', 'WheelsOffTime', 'WheelsOnTime', 'AircraftTailNumber',
        'CancellationCode', 'MinutesLate_DelayCodeE', 'MinutesLate_DelayCodeF',
        'MinutesLate_DelayCodeG', 'MinutesLate_DelayCodeH', 'MinutesLate_DelayCodeI'
    ]

    # Step 3: Determine the total number of columns in the data
with open('ontime.td.202406.asc', 'r') as f:
        first_line = f.readline()
        total_columns = len(first_line.split('|'))

    # Step 4: Generate column names
if total_columns > len(documented_columns):
        extra_columns = total_columns - len(documented_columns)
        column_names = documented_columns + [f'ExtraColumn_{i}' for i in range(1, extra_columns + 1)]
else:
        column_names = documented_columns[:total_columns]

    # Step 5: Specify data types for all columns as strings
dtype_spec = {col: str for col in column_names}

    # Step 6: Load the datasets for June 2024 and July 2024
june_data = pd.read_csv('ontime.td.202406.asc', delimiter='|', header=None, names=column_names, dtype=dtype_spec, low_memory=False)
july_data = pd.read_csv('ontime.td.202407.asc', delimiter='|', header=None, names=column_names, dtype=dtype_spec, low_memory=False)

combined_data = pd.concat([june_data, july_data], ignore_index=True)

    # Step 7: Remove the 4 undocumented columns and extra ones
columns_to_keep = [
        'Carrier', 'FlightNumber', 'OperatingCarrier', 'OperatingFlightNumber',
        'DepartureAirport', 'ArrivalAirport', 'FlightDate', 'DayOfWeek',
        'ScheduledDepartureTime_OAG', 'ScheduledDepartureTime_CRS',
        'ActualDepartureTime', 'ScheduledArrivalTime_OAG', 'ScheduledArrivalTime_CRS',
        'ActualArrivalTime', 'Diff_ScheduledDepartureTimes', 'Diff_ScheduledArrivalTimes',
        'ScheduledElapsedMinutes', 'DepartureDelayMinutes', 'ArrivalDelayMinutes',
        'Diff_ElapsedMinutes', 'WheelsOffTime', 'WheelsOnTime', 'AircraftTailNumber',
        'CancellationCode', 'MinutesLate_DelayCodeE', 'MinutesLate_DelayCodeF',
        'MinutesLate_DelayCodeG', 'MinutesLate_DelayCodeH', 'MinutesLate_DelayCodeI'
    ]
combined_data = combined_data[columns_to_keep]

    # Step 8: Clean Carrier column
combined_data['Carrier'] = combined_data['Carrier'].str.upper().str.strip()

    # Step 9: Filter valid Carrier codes (two uppercase letters)
valid_carrier_pattern = re.compile(r'^[A-Z]{2}$')
carrier_filter = combined_data['Carrier'].str.match(valid_carrier_pattern, na=False)
print(f"\nRows with valid Carrier codes: {carrier_filter.sum()}")
combined_data = combined_data[carrier_filter]

    

        


Rows with valid Carrier codes: 1225398


The above code shows how I parsed and cleaned the raw data to be able to answer the questions.
I had to assign labels to the columns. I had to only account for 2 extra columns between B and C because doing
4 like the instructions said messed up the Arrival and Departure Airport columns for the following questions.
I decided to declare all the variables as strings and then convert to int as necessary. I used this method because
I was getting a lot of improper data type errors and couldn't figure them out otherwise. This was likely due to the
column labels being offset by two at first. 

In [53]:
# Step 10: Filter valid DepartureDelayMinutes
    # Convert column to numeric, forcing errors to NaN
combined_data['DepartureDelayMinutes'] = pd.to_numeric(combined_data['DepartureDelayMinutes'], errors='coerce')

    # Drop rows where conversion resulted in NaN
combined_data = combined_data.dropna(subset=['DepartureDelayMinutes'])

    # Convert from float to int (since NaN values are removed)
combined_data['DepartureDelayMinutes'] = combined_data['DepartureDelayMinutes'].astype(int)

relevant_data = combined_data.dropna(subset=['DepartureDelayMinutes'])

    # Step 12: Analyze the data
if relevant_data.empty:
        print("\nThe cleaned dataset is empty. No valid rows found.")
else:
        # Group by Carrier and calculate the mean delay (including all flights)
    average_delays = relevant_data.groupby('Carrier')['DepartureDelayMinutes'].mean()

# Sort by average delay
    sorted_delays = average_delays.sort_values()

    airline_mapping = {
                'DL': 'Delta Airlines',
                'AA': 'American Airlines',
                'UA': 'United Airlines',
                'WN': 'Southwest Airlines',
                'AS': 'Alaska Airlines',
                'B6': 'JetBlue Airways',
                'NK': 'Spirit Airlines',
                'F9': 'Frontier Airlines',
                'HA': 'Hawaiian Airlines',
                'G4': 'Allegiant Air',
                'YX': 'Midwest Airlines',
                'OO': 'SkyWest Airlines',
                'MQ': 'Envoy Air',
                'OH': 'PSA Airlines',
                'YV': 'Mesa Airlines',
                'QX': 'Horizon Air',
                'EV': 'ExpressJet Airlines'
            }

            # Convert airline codes to full names
    sorted_delays.index = sorted_delays.index.map(airline_mapping)

            # Report the airline with the least delays
    least_delay_airline = sorted_delays.idxmin()
    least_delay_value = sorted_delays.min()
    print(sorted_delays.head(5))

    print(f"\nThe airline with the least delays is {least_delay_airline} with an average delay of {least_delay_value:.2f} minutes.")



Carrier
Southwest Airlines    125.858359
American Airlines     129.479622
Delta Airlines        131.965433
Hawaiian Airlines     140.665946
United Airlines       141.952446
Name: DepartureDelayMinutes, dtype: float64

The airline with the least delays is Southwest Airlines with an average delay of 125.86 minutes.


The above code is how I calculated the airlines with the least average delays. 
I converted the departure delays to int to be able to calculate this. 
The top 5 least delayed airlines are shown above. SouthWest Airlines ended
up being the airline with the least average delays. 

#### 3. [6 points] What Departure Time of Day Is Best to Avoid Flight Delays, segmented into 5 time blocks [night (10 pm - 6 am), morning (6 am to 10 am), mid-day (10 am to 2 pm), afternoon (2 pm - 6 pm), evening (6 pm - 10 pm)]

In [52]:
def categorize_time(hour):
    if 22 <= hour or hour < 6:
        return 'Night'
    elif 6 <= hour < 10:
        return 'Morning'
    elif 10 <= hour < 14:
        return 'Mid-Day'
    elif 14 <= hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'

## Find the best time of day to avoid delays
combined_data['DepartureHour'] = combined_data['ActualDepartureTime'].str[:2]
combined_data['DepartureHour'] = pd.to_numeric(combined_data['DepartureHour'], errors='coerce')
combined_data['TimeBlock'] = combined_data['DepartureHour'].apply(categorize_time)
timeblock_delays = combined_data.groupby('TimeBlock')['DepartureDelayMinutes'].mean().sort_values()
print(timeblock_delays)

TimeBlock
Night        132.867186
Mid-Day      135.265775
Afternoon    135.455740
Evening      137.914275
Morning      175.985673
Name: DepartureDelayMinutes, dtype: float64


The code for calculating the time block with the least delays is above. Night was the best 
departure in terms of avoiding delays. 

#### 4. [5 points] Which Airports Have The Most Flight Delays? Report by full name, (e.g., “Newark Liberty International,” not “EWR,” when the airport code EWR is provided).

In [48]:
## Find the airports with the most delays
airport_mapping = {
    'ATL': 'Atlanta - Hartsfield Jackson',
    'BWI': "Baltimore/Wash. Int'l Thurgood Marshall",
    'BOS': 'Boston - Logan International',
    'CLT': 'Charlotte - Douglas',
    'MDW': 'Chicago - Midway',
    'ORD': "Chicago - O'Hare",
    'CVG': 'Cincinnati Greater Cincinnati',
    'DFW': 'Dallas-Fort Worth International',
    'DEN': 'Denver - International',
    'DTW': 'Detroit - Metro Wayne County',
    'FLL': 'Fort Lauderdale Hollywood International',
    'IAH': 'Houston - George Bush International',
    'LAS': 'Las Vegas - McCarran International',
    'LAX': 'Los Angeles International',
    'MIA': 'Miami International',
    'MSP': 'Minneapolis-St. Paul International',
    'EWR': 'Newark Liberty International',
    'JFK': 'New York - JFK International',
    'LGA': 'New York - LaGuardia',
    'MCO': 'Orlando International',
    'OAK': 'Oakland International',
    'PHL': 'Philadelphia International',
    'PHX': 'Phoenix - Sky Harbor International',
    'PDX': 'Portland International',
    'SLC': 'Salt Lake City International',
    'STL': 'St. Louis Lambert International',
    'SAN': 'San Diego Intl. Lindbergh Field',
    'SFO': 'San Francisco International',
    'SEA': 'Seattle-Tacoma International',
    'TPA': 'Tampa International',
    'DCA': 'Washington - Reagan National',
    'IAD': 'Washington - Dulles International',
    'PPG': 'Pago Pago International',
    'GUM': 'Guam International',
    'HNL': 'Honolulu International',
    'OGG': 'Kahului Airport',
    'KOA': 'Kona International',
    'LIH': 'Lihue Airport',
    'ITO': 'Hilo International',
    'BQN': 'Aeropuerto Internacional Rafael Hernández',
    'SJU': 'San Juan - Luis Muñoz Marín International',
    'STT': 'Cyril E. King Airport'
}

    # Find the airports with the most delays
combined_data['ArrivalDelayMinutes'] = pd.to_numeric(combined_data['ArrivalDelayMinutes'], errors='coerce')
combined_data['DepartureDelayMinutes'] = pd.to_numeric(combined_data['DepartureDelayMinutes'], errors='coerce')

arrival_delay = combined_data.groupby('ArrivalAirport')['ArrivalDelayMinutes'].mean().sort_values(ascending=False)
departure_delay = combined_data.groupby('DepartureAirport')['DepartureDelayMinutes'].mean().sort_values(ascending=False)
total_delay = arrival_delay + departure_delay

total_delay.index = total_delay.index.to_series().replace(airport_mapping)
total_delay = total_delay.sort_values(ascending=False)
print(total_delay.head(5))





Pago Pago International                      334.461538
Aeropuerto Internacional Rafael Hernández    254.245902
San Juan - Luis Muñoz Marín International    247.113487
Guam International                           242.926230
Cyril E. King Airport                        242.833724
dtype: float64


The code above shows the airports with the most delays. Pago Pago International in Puerto Rico was the airport with the most delays.

#### [5 points] What Are the Top 5 Busiest Airports in the US. Report by full name, (e.g., “Newark Liberty International,” not “EWR”).

In [46]:
# Count total arrivals and departures
arrivals_count = combined_data['ArrivalAirport'].value_counts()
departures_count = combined_data['DepartureAirport'].value_counts()

# Sum arrivals and departures for each airport
total_flights = arrivals_count.add(departures_count, fill_value=0)

# Map airport codes to full names
total_flights.index = total_flights.index.to_series().replace(airport_mapping)

# Sort in descending order and get the top 5
top_5_busiest_airports = total_flights.sort_values(ascending=False).head(5)

# Display result
print("Top 5 Busiest Airports in the US (by total arrivals and departures):")
print(top_5_busiest_airports)


Top 5 Busiest Airports in the US (by total arrivals and departures):
Atlanta - Hartsfield Jackson       114182
Dallas-Fort Worth International    110252
Chicago - O'Hare                   109931
Denver - International             101826
Charlotte - Douglas                 86198
Name: count, dtype: int64


The above output shows the busiest airports in the US by total arrivals and departures. Atlanta was the busiest overall.

## 3. ShortStoryJam [45 pts]

#### 1. [3 points] To seed the effort, the text of about 22 short stories by Edgar Allan Poe, he of the “quoth the raven” fame, are available in my github repository. Clean the text and remove stopwords, as specified in a previous assignment.

#### 2. [8 points] Use NLTK to decompose the first story (A_DESCENT_INTO…) into sentences & sentences into tokens. Here is the code for doing that, after you set the variable paragraph to hold the text of the story.

#### 3. [11 points] Tag all remaining words in the story as parts of speech using the Penn POS Tags. This SO answer shows how to obtain the POS tag values. Create and print a dictionary with the Penn POS Tags as keys and a list of words as the values.

####  4. [11 points] In this framework, each row will represent a story. The columns will be as follows:
The text of the story,
Two-letter prefixes of each tag, for example NN, VB, RB, JJ etc.and the words belonging to that tag in the story. 
Show your code and the tag columns, at least for the one story.


#### 5. [12 points] The conjecture of many linguists is that the number of different parts of speech per thousand words, (nouns, verbs, adjectives, adverbs, …). is pretty much the same for all stories in a given language. In this case, with all stories in English, and all from the same author, we expect it to be true. Is the conjecture consistent with your findings?