With the given data set, solve the following tasks using PySpark.  

Task 1: Create a function that gives back how many flights arrived earlier than expected. 

Task 2: Create a function that determines the typical departure time for flights over 2000 miles. 

Task 3: Create a function that gives back the proportion of flights that have arrival delays longer than 60 minutes. 

Task 4: Create a function that gives the average airtime for flights that left earlier than 9:00 am. 

Task 5: Create a function that determines the maximum arrival delay for flights that did not experience a delay upon departure. 

In [2]:
# Import PySpark libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [3]:
# Initialize SparkSession
spark = SparkSession.builder.appName("Domestic Flight Delay Records").getOrCreate()

In [4]:
# Load Data
df = spark.read.csv("Flight Dataset - CSV(in).csv", header=True, inferSchema=True)
df.show(5)

+--------+---------+---------+--------+--------+---------+---------+
| FL_DATE|DEP_DELAY|ARR_DELAY|AIR_TIME|DISTANCE| DEP_TIME| ARR_TIME|
+--------+---------+---------+--------+--------+---------+---------+
|1/1/2006|        5|       19|     350|    2475| 9.083333|12.483334|
|1/2/2006|      167|      216|     343|    2475|11.783334|15.766666|
|1/3/2006|       -7|       -2|     344|    2475| 8.883333|12.133333|
|1/4/2006|       -5|      -13|     331|    2475| 8.916667|    11.95|
|1/5/2006|       -3|      -17|     321|    2475|     8.95|11.883333|
+--------+---------+---------+--------+--------+---------+---------+
only showing top 5 rows


In [5]:
df.printSchema()

root
 |-- FL_DATE: string (nullable = true)
 |-- DEP_DELAY: integer (nullable = true)
 |-- ARR_DELAY: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- DEP_TIME: double (nullable = true)
 |-- ARR_TIME: double (nullable = true)



#### Task 1: Create a function that gives back how many flights arrived earlier than expected.

In [7]:
def early_arrival_flights(df):
    
    early_arrival_df= df.filter(col("ARR_DELAY") < 0)
    count_early_flights = early_arrival_df.count()
    
    return count_early_flights

In [8]:
result_task1 = early_arrival_flights(df)
print("Number of flights arrived earlier than expected:", result_task1)

Number of flights arrived earlier than expected: 534655


#### Task 2: Create a function that determines the typical departure time for flights over 2000 miles.

In [10]:
def typical_departure_time_for_long_flights(df):
   
    # flights over 2000 miles
    flights_over_2000miles = df.filter(col("Distance") > 2000)
    
    # Calculate the average departure time
    avg_dep_time = flights_over_2000miles.agg(bround(avg(col("DEP_TIME")), 0).alias("TYPICAL_DEPARTURE_TIME"))
    
    return avg_dep_time

In [11]:
result_task2=typical_departure_time_for_long_flights(df)
result_task2.show()

+----------------------+
|TYPICAL_DEPARTURE_TIME|
+----------------------+
|                  14.0|
+----------------------+



#### Task 3: Create a function that gives back the proportion of flights that have arrival delays longer than 60 minutes.

In [13]:
def proportion_flights_long_arrival_delays(df):
   
    total_flights = df.count()
    if total_flights == 0:
        return 0
    arrival_delayed_flights = df.filter(col("ARR_DELAY") > 60).count()
    
    proportion = (arrival_delayed_flights / total_flights)
    return proportion

In [14]:
result_task3= proportion_flights_long_arrival_delays(df)

# Display result
print("Proportion of flights that have arrival delays longer than 60 minutes:", result_task3)

Proportion of flights that have arrival delays longer than 60 minutes: 0.053066


#### Task 4: Create a function that gives the average airtime for flights that left earlier than 9:00 am.

In [16]:
def average_airtime_for_flights_before_9AM(df):
    
    flights_before_9AM = df.filter(col("DEP_TIME") < 9.0)
    
    average_airtime = (flights_before_9AM.agg(bround(avg(col("AIR_TIME")), 2).alias("average_airtime_before_9AM")))
    
    return average_airtime

In [17]:
result_task4 = average_airtime_for_flights_before_9AM(df)
result_task4.show()

+--------------------------+
|average_airtime_before_9AM|
+--------------------------+
|                    111.36|
+--------------------------+



#### Task 5: Create a function that determines the maximum arrival delay for flights that did not experience a delay upon departure.

In [19]:
def max_arrival_delay_for_fight_with_no_dep_delay(df):
   
    Flights_with_no_dep_delay= df.filter(col("DEP_DELAY")<=0)

    max_arrival_delay = Flights_with_no_dep_delay.agg(bround(max(col("ARR_DELAY")),2).alias("Max_Arrival_Delay_For_On_Time_Fights"))

    return max_arrival_delay
    

In [20]:
result_task5 = max_arrival_delay_for_fight_with_no_dep_delay(df)
result_task5.show()

+------------------------------------+
|Max_Arrival_Delay_For_On_Time_Fights|
+------------------------------------+
|                                 701|
+------------------------------------+

