# Group Assigment: C

In [1]:
import findspark
findspark.init()

In [2]:
findspark.find()
import pyspark
findspark.find()

'/opt/spark-2.4.4-bin-hadoop2.7'

In [3]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

conf = pyspark.SparkConf().setAppName('appName').setMaster('local[4]')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)

In [4]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit, max, min, avg, stddev, mean
import pyspark.sql.functions as f
from pyspark.sql.types import IntegerType

## Introduction to the Flights dataset

According to a 2010 report made by the US Federal Aviation Administration, the economic price of domestic flight delays entails a yearly cost of 32.9 billion dollars to passengers, airlines and other parts of the economy. More than half of that amount comes from passengers' pockets, as they do not only waste time waiting for their planes to leave, but also miss connecting flights, spend money on food and have to sleep on hotel rooms while they're stranded.

The report, focusing on data from year 2007, estimated that air transportation delays put a 4 billion dollar dent in the country's gross domestic product that year. Full report can be found 
<a href="http://www.isr.umd.edu/NEXTOR/pubs/TDI_Report_Final_10_18_10_V3.pdf">here</a>.

### But, are the causes of Delay related to the Morphology of the Arrival Airports?

In order to answer this question, we are going to analyze the provided dataset, containing up to 1.936.758 different internal flights in the US for 2008 and their causes for delay, diversion and cancellation; if any.

##### Disclaimer:
*In this analysis, due to performance and capacity issues, we will utilize a sample of the original dataset containing 100,000 observations. For the purpose of this project, we will assume that this sample dataset is the whole population for flights that ocurred in 2008 (e.g. 100,000 will be the total number of flights for that year, rather than 1.936.758).*


The data comes from the U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS)

This dataset is composed by the following variables:
1. **Year** 2008
2. **Month** 1
3. **DayofMonth** 1-31
4. **DayOfWeek** 1 (Monday) - 7 (Sunday)
5. **DepTime** actual departure time (local, hhmm)
6. **CRSDepTime** scheduled departure time (local, hhmm)
7. **ArrTime** actual arrival time (local, hhmm)
8. **CRSArrTime** scheduled arrival time (local, hhmm)
9. **UniqueCarrie**r unique carrier code
10. **FlightNum** flight number
11. **TailNum** plane tail number: aircraft registration, unique aircraft identifier
12. **ActualElapsedTime** in minutes
13. **CRSElapsedTime** in minutes
14. **AirTime** in minutes
15. **ArrDelay** arrival delay, in minutes: A flight is counted as "on time" if it operated less than 15 minutes later the scheduled time shown in the carriers' Computerized Reservations Systems (CRS).
16. **DepDelay** departure delay, in minutes
17. **Origin** origin IATA airport code
18. **Dest** destination IATA airport code
19. **Distance** in miles
20. **TaxiIn** taxi in time, in minutes
21. **TaxiOut** taxi out time in minutes
22. **Cancelled** *was the flight cancelled
23. **CancellationCode** reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24. **Diverted** 1 = yes, 0 = no
25. **CarrierDelay** in minutes: Carrier delay is within the control of the air carrier. Examples of occurrences that may determine carrier delay are: aircraft cleaning, aircraft damage, awaiting the arrival of connecting passengers or crew, baggage, bird strike, cargo loading, catering, computer, outage-carrier equipment, crew legality (pilot or attendant rest), damage by hazardous goods, engineering inspection, fueling, handling disabled passengers, late crew, lavatory servicing, maintenance, oversales, potable water servicing, removal of unruly passenger, slow boarding or seating, stowing carry-on baggage, weight and balance delays.
26. **WeatherDelay** in minutes: Weather delay is caused by extreme or hazardous weather conditions that are forecasted or manifest themselves on point of departure, enroute, or on point of arrival.
27. **NASDelay** in minutes: Delay that is within the control of the National Airspace System (NAS) may include: non-extreme weather conditions, airport operations, heavy traffic volume, air traffic control, etc.
28. **SecurityDelay** in minutes: Security delay is caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.
29. **LateAircraftDelay** in minutes: Arrival delay at an airport due to the late arrival of the same aircraft at a previous airport. The ripple effect of an earlier delay at downstream airports is referred to as delay propagation

We will read the CSV file using Spark's default delimiter (","). The first line contains the headers, hence we set the header option to true. We also set the InferSchema option to true so that Spark figures out the datatypes from the file.

In [5]:
# This does nothing: Spark is lazy so the read operation will be deferred until an action is executed
df0 = spark.read\
            .option("header", "true")\
            .option("inferSchema", "true")\
            .csv("flights_jan08.csv")

In [6]:
cols = len(df0.columns)
print("The Flights dataset has", cols, "columns")
rows = df0.count()
print("The Flights dataset has", rows, "rows")

The Flights dataset has 29 columns
The Flights dataset has 100000 rows


We will create a new DataFrame by selecting the features that are of our interest to work with those, since we have several features that are irrelevant for our analysis:

In [7]:
df = df0.select("DayOfweek", "AirTime", "ArrDelay", "DepDelay", "Origin", "Dest", "Distance", "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay")

In [8]:
df.printSchema()

root
 |-- DayOfweek: integer (nullable = true)
 |-- AirTime: string (nullable = true)
 |-- ArrDelay: string (nullable = true)
 |-- DepDelay: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: integer (nullable = true)
 |-- CarrierDelay: string (nullable = true)
 |-- WeatherDelay: string (nullable = true)
 |-- NASDelay: string (nullable = true)
 |-- SecurityDelay: string (nullable = true)
 |-- LateAircraftDelay: string (nullable = true)



We can see that some features such as Airtime, ArrDelay and DepDelay have been imported as Strings, even though they seem to be Numeric. We will try to get some insights on why were they imported as strings and an overall understanding of their values:

In [9]:
print ("Summary of the main columns that we will focus on:")
df.select("AirTime","ArrDelay","DepDelay","Distance").summary().show()

Summary of the main columns that we will focus on:
+-------+-----------------+------------------+------------------+-----------------+
|summary|          AirTime|          ArrDelay|          DepDelay|         Distance|
+-------+-----------------+------------------+------------------+-----------------+
|  count|           100000|            100000|            100000|           100000|
|   mean| 91.8637966321506| 5.729954001094247|10.379048736571649|        630.58632|
| stddev|54.20822434609608|30.966959272464596|28.384428068170926|437.3570752611298|
|    min|              100|                -1|                -1|               66|
|    25%|             53.0|              -9.0|              -2.0|              324|
|    50%|             71.0|              -2.0|               1.0|              453|
|    75%|            118.0|              10.0|              10.0|              843|
|    max|               NA|                NA|                NA|             2363|
+-------+----------------

We can see that we have NA values for some of the features. We will first change the datatype for the features that should be encoded as Integers, and after that we will analyze what is the situation regarding Null / NA values:

In [10]:
flights = df.withColumn("AirTimeTemp", df.AirTime.cast(IntegerType()))\
    .withColumn("ArrDelayTemp", df.ArrDelay.cast(IntegerType()))\
    .withColumn("DepDelayTemp", df.DepDelay.cast(IntegerType()))\
    .withColumn("CarrierDelayTemp", df.CarrierDelay.cast(IntegerType()))\
    .withColumn("WeatherDelayTemp", df.WeatherDelay.cast(IntegerType()))\
    .withColumn("NASDelayTemp", df.NASDelay.cast(IntegerType()))\
    .withColumn("SecurityDelayTemp", df.SecurityDelay.cast(IntegerType()))\
    .withColumn("LateAircraftDelayTemp", df.LateAircraftDelay.cast(IntegerType()))\
    .drop("AirTime")\
    .drop("ArrDelay")\
    .drop("DepDelay")\
    .drop("CarrierDelay")\
    .drop("WeatherDelay")\
    .drop("NASDelay")\
    .drop("SecurityDelay")\
    .drop("LateAircraftDelay")\
    .withColumnRenamed("AirTimeTemp", "AirTime")\
    .withColumnRenamed("ArrDelayTemp", "ArrDelay")\
    .withColumnRenamed("DepDelayTemp", "DepDelay")\
    .withColumnRenamed("CarrierDelayTemp", "CarrierDelay")\
    .withColumnRenamed("WeatherDelayTemp", "WeatherDelay")\
    .withColumnRenamed("NASDelayTemp", "NASDelay")\
    .withColumnRenamed("SecurityDelayTemp", "SecurityDelay")\
    .withColumnRenamed("LateAircraftDelayTemp", "PriorDelay")

In [11]:
flights.cache()
flights.printSchema()

root
 |-- DayOfweek: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: integer (nullable = true)
 |-- AirTime: integer (nullable = true)
 |-- ArrDelay: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- CarrierDelay: integer (nullable = true)
 |-- WeatherDelay: integer (nullable = true)
 |-- NASDelay: integer (nullable = true)
 |-- SecurityDelay: integer (nullable = true)
 |-- PriorDelay: integer (nullable = true)



In [12]:
print ("Summary of the main columns that we will focus on:")
flights.select("AirTime","ArrDelay","DepDelay","Distance").summary().show()

Summary of the main columns that we will focus on:
+-------+-----------------+------------------+------------------+-----------------+
|summary|          AirTime|          ArrDelay|          DepDelay|         Distance|
+-------+-----------------+------------------+------------------+-----------------+
|  count|            98698|             98698|             98858|           100000|
|   mean| 91.8637966321506| 5.729954001094247|10.379048736571649|        630.58632|
| stddev|54.20822434609608|30.966959272464596|28.384428068170926|437.3570752611298|
|    min|               12|               -57|               -44|               66|
|    25%|               53|                -9|                -2|              324|
|    50%|               71|                -2|                 1|              453|
|    75%|              118|                10|                10|              843|
|    max|              369|               500|               516|             2363|
+-------+----------------

Ok, the datatypes have been correctly modified. 
#### We will get the number of yearly arrival flights per each Destination Airport and save it in a new DataFrame, since we will need it for posterior analysis:

In [13]:
num_flights = flights\
                    .groupBy("Dest").agg(f.count("Dest").alias("NumFlights"))\
                    .orderBy("NumFlights", ascending=False)

num_flights.show()

+----+----------+
|Dest|NumFlights|
+----+----------+
| LAS|      6734|
| MDW|      6255|
| PHX|      5513|
| BWI|      4691|
| OAK|      3916|
| HOU|      3898|
| DAL|      3594|
| LAX|      3382|
| SAN|      3327|
| MCO|      3258|
| SMF|      2639|
| TPA|      2347|
| BNA|      2343|
| ONT|      2249|
| MCI|      2231|
| SJC|      2187|
| ABQ|      2085|
| STL|      2010|
| PHL|      1703|
| BUR|      1674|
+----+----------+
only showing top 20 rows



Let's try to get some insights regarding the NULL values that appeared on the Summary for some of the columns:

In [14]:
print("Checking for nulls on the columns that we will work with:")
flights.select([count(when(col(c).isNull(), c)).alias(c) for c in ["DayOfweek", "AirTime", "ArrDelay", "DepDelay", "Origin", "Dest", "Distance", "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "PriorDelay"]]).show()

Checking for nulls on the columns that we will work with:
+---------+-------+--------+--------+------+----+--------+------------+------------+--------+-------------+----------+
|DayOfweek|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|PriorDelay|
+---------+-------+--------+--------+------+----+--------+------------+------------+--------+-------------+----------+
|        0|   1302|    1302|    1142|     0|   0|       0|       80371|       80371|   80371|        80371|     80371|
+---------+-------+--------+--------+------+----+--------+------------+------------+--------+-------------+----------+



In [15]:
flights.show(5)

+---------+------+----+--------+-------+--------+--------+------------+------------+--------+-------------+----------+
|DayOfweek|Origin|Dest|Distance|AirTime|ArrDelay|DepDelay|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|PriorDelay|
+---------+------+----+--------+-------+--------+--------+------------+------------+--------+-------------+----------+
|        4|   IAD| TPA|     810|    116|     -14|       8|        null|        null|    null|         null|      null|
|        4|   IAD| TPA|     810|    113|       2|      19|        null|        null|    null|         null|      null|
|        4|   IND| BWI|     515|     76|      14|       8|        null|        null|    null|         null|      null|
|        4|   IND| BWI|     515|     78|      -6|      -4|        null|        null|    null|         null|      null|
|        4|   IND| BWI|     515|     77|      34|      34|           2|           0|       0|            0|        32|
+---------+------+----+--------+-------+--------

As we can see, we have a lot of Null values in the features that describe the reason of the delay: CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, PriorDelay. We assume that the Null values for this features are present in cases where ArrDelay is lower than 15 min (therefore not considered as a delay) + rows where ArrDelay is Null. Let's confirm:

In [16]:
print("Checking number of delays with null values:")
flights.select([count(when(col("ArrDelay").isNull(), c)).alias(c) for c in ["ArrDelay","CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "PriorDelay"]]).show()

Checking number of delays with null values:
+--------+------------+------------+--------+-------------+----------+
|ArrDelay|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|PriorDelay|
+--------+------------+------------+--------+-------------+----------+
|    1302|        1302|        1302|    1302|         1302|      1302|
+--------+------------+------------+--------+-------------+----------+



In [17]:
print("Checking number of delays lower than 15 min:")
flights.select([count(when(col("ArrDelay") < 15, c)).alias(c) for c in ["ArrDelay","CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "PriorDelay"]]).show()


Checking number of delays lower than 15 min:
+--------+------------+------------+--------+-------------+----------+
|ArrDelay|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|PriorDelay|
+--------+------------+------------+--------+-------------+----------+
|   79069|       79069|       79069|   79069|        79069|     79069|
+--------+------------+------------+--------+-------------+----------+



Ok, we have confirmed that the sum of Null values + the delays that are lower than 15 min in ArrDelay, equals the Null values that we have for the descriptive columns (1302 + 79,069 = 80,371). Let's confirm the difference (100,000 - 80,371 = 19,629):

In [18]:
print("Checking number of delays higher or equal than 15 min:")
flights.select([count(when(col("ArrDelay") >= 15, c)).alias(c) for c in ["ArrDelay", "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "PriorDelay"]]).show()

Checking number of delays higher or equal than 15 min:
+--------+------------+------------+--------+-------------+----------+
|ArrDelay|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|PriorDelay|
+--------+------------+------------+--------+-------------+----------+
|   19629|       19629|       19629|   19629|        19629|     19629|
+--------+------------+------------+--------+-------------+----------+



Now that we have a better understanding of the data, we will create a new DataFrame with the following filters: 

- Keep only the flights considered as actual delays (ArrDelay > 15)
- Removing NA values for ArrDelay column, since we cannot get any insight from those

In [19]:
flightsOK = flights\
            .dropna(subset=["ArrDelay"])\
            .where(f.col("ArrDelay") >= 15)

In [20]:
flightsOK.printSchema()

root
 |-- DayOfweek: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: integer (nullable = true)
 |-- AirTime: integer (nullable = true)
 |-- ArrDelay: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- CarrierDelay: integer (nullable = true)
 |-- WeatherDelay: integer (nullable = true)
 |-- NASDelay: integer (nullable = true)
 |-- SecurityDelay: integer (nullable = true)
 |-- PriorDelay: integer (nullable = true)



In [21]:
flightsOK.show(5)

+---------+------+----+--------+-------+--------+--------+------------+------------+--------+-------------+----------+
|DayOfweek|Origin|Dest|Distance|AirTime|ArrDelay|DepDelay|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|PriorDelay|
+---------+------+----+--------+-------+--------+--------+------------+------------+--------+-------------+----------+
|        4|   IND| BWI|     515|     77|      34|      34|           2|           0|       0|            0|        32|
|        4|   IND| LAS|    1591|    230|      57|      67|          10|           0|       0|            0|        47|
|        4|   IND| MCO|     828|    107|      80|      94|           8|           0|       0|            0|        72|
|        4|   IND| PHX|    1489|    213|      15|      27|           3|           0|       0|            0|        12|
|        4|   IND| TPA|     838|    110|      16|      28|           0|           0|       0|            0|        16|
+---------+------+----+--------+-------+--------

We have now created a subset DataFrame containing the actual flights that are considered delayed (> 15 min), and the features that are relevant for our analysis. Let's start working with it:

In [22]:
flightsOK.count()

19629

We will create a new feature called "DelaySeverity" that allows us to categorize and evaluate the impact of the flight delay. It will be categorized as follows:

Delays between 15 and 30 min. categorized as "Annoying":    1

Delays between 30 and 60 min. Categorized as "Impactful":   2

Delays that are above 60 min. Categorized as "Unacceptable": 3

In [23]:

delayed_flights = flightsOK\
   .withColumn("DelaySeverity", when((col("ArrDelay")>15) & (col("ArrDelay")<=30), 1)\
                               .when((col("ArrDelay")>30) & (col("ArrDelay")<=60), 2)\
                               .otherwise(3))

In [24]:
delayed_flights.show(5)

+---------+------+----+--------+-------+--------+--------+------------+------------+--------+-------------+----------+-------------+
|DayOfweek|Origin|Dest|Distance|AirTime|ArrDelay|DepDelay|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|PriorDelay|DelaySeverity|
+---------+------+----+--------+-------+--------+--------+------------+------------+--------+-------------+----------+-------------+
|        4|   IND| BWI|     515|     77|      34|      34|           2|           0|       0|            0|        32|            2|
|        4|   IND| LAS|    1591|    230|      57|      67|          10|           0|       0|            0|        47|            2|
|        4|   IND| MCO|     828|    107|      80|      94|           8|           0|       0|            0|        72|            3|
|        4|   IND| PHX|    1489|    213|      15|      27|           3|           0|       0|            0|        12|            3|
|        4|   IND| TPA|     838|    110|      16|      28|           

#### Importing a DataFrame that contains relevant features of each Airport:

In [25]:
Airports = spark.read\
            .option("header", "true")\
            .option("inferSchema", "true")\
            .csv("airports_delay.csv")

In [26]:
Airports.show(10)

+-----------+-------------+------------+----------+---------------+----------------+------------+--------+
|AirportCode|         City|       State|NumRunways|AvgRunwayLength|ConstructionYear|NumTerminals|NumGates|
+-----------+-------------+------------+----------+---------------+----------------+------------+--------+
|        ABQ|   Bernalillo|Nuevo México|         3|           3000|            1939|           3|      25|
|        ALB|       Albany|    New York|         2|           2393|            1996|           1|       5|
|        AMA|     Amarillo|       Texas|         2|           3261|            1929|           1|       7|
|        AUS|       Austin|       Texas|         2|           3200|            1999|           2|      25|
|        BDL|Windsor Locks| Connecticut|         3|           2000|            1947|           1|      30|
|        BFL|      Oildale|  California|         2|           2800|            1957|           2|       4|
|        BHM|   Birmingham|     Alaba

Join the new imported data set with the previously cleaned delayed_flights dataset.
The join will be executed through the Destination column in the flights dataset and the Ariport Code column from the new one.

In [27]:
Airport_join = delayed_flights.join(Airports, delayed_flights.Dest == Airports.AirportCode)

In [28]:
Airport_join.printSchema()

root
 |-- DayOfweek: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: integer (nullable = true)
 |-- AirTime: integer (nullable = true)
 |-- ArrDelay: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- CarrierDelay: integer (nullable = true)
 |-- WeatherDelay: integer (nullable = true)
 |-- NASDelay: integer (nullable = true)
 |-- SecurityDelay: integer (nullable = true)
 |-- PriorDelay: integer (nullable = true)
 |-- DelaySeverity: integer (nullable = false)
 |-- AirportCode: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- NumRunways: integer (nullable = true)
 |-- AvgRunwayLength: integer (nullable = true)
 |-- ConstructionYear: integer (nullable = true)
 |-- NumTerminals: integer (nullable = true)
 |-- NumGates: integer (nullable = true)



Because the Destination and Airport Code have the same info, we only keep Airport Code:

In [29]:
Airport_join = Airport_join.drop("Dest")

To this dataset, we will add the number of flights dataset extracted from our original flights csv. 

We drop the Destination column again to not have it duplicated:

In [30]:
CompleteDF = Airport_join.join(num_flights, Airport_join.AirportCode == num_flights.Dest)

In [31]:
CompleteDF = CompleteDF.drop("Dest")

Select the important columns we will like to work with for our analysis:

In [32]:
CompleteDF = CompleteDF.select("AirportCode", "ArrDelay", "DelaySeverity", "City", "State", "NumFlights", "NumTerminals", "ConstructionYear", "NumGates", "NumRunways", "AvgRunwayLength", "Origin", "Distance", "AirTime", "DepDelay", "CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "PriorDelay", "DayOfweek")

In [33]:
CompleteDF.printSchema()

root
 |-- AirportCode: string (nullable = true)
 |-- ArrDelay: integer (nullable = true)
 |-- DelaySeverity: integer (nullable = false)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- NumFlights: long (nullable = false)
 |-- NumTerminals: integer (nullable = true)
 |-- ConstructionYear: integer (nullable = true)
 |-- NumGates: integer (nullable = true)
 |-- NumRunways: integer (nullable = true)
 |-- AvgRunwayLength: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Distance: integer (nullable = true)
 |-- AirTime: integer (nullable = true)
 |-- DepDelay: integer (nullable = true)
 |-- CarrierDelay: integer (nullable = true)
 |-- WeatherDelay: integer (nullable = true)
 |-- NASDelay: integer (nullable = true)
 |-- SecurityDelay: integer (nullable = true)
 |-- PriorDelay: integer (nullable = true)
 |-- DayOfweek: integer (nullable = true)



Let's cache our final dataset, to keep it in memory for faster executions:

In [34]:
CompleteDF.cache()
CompleteDF\
        .select("AirportCode", "ArrDelay", "DelaySeverity", "City", "NumFlights", "NumTerminals", "NumRunways", "NumGates", "ConstructionYear").show()

+-----------+--------+-------------+----------------+----------+------------+----------+--------+----------------+
|AirportCode|ArrDelay|DelaySeverity|            City|NumFlights|NumTerminals|NumRunways|NumGates|ConstructionYear|
+-----------+--------+-------------+----------------+----------+------------+----------+--------+----------------+
|        BWI|      34|            2|      Washington|      4691|           4|         4|      73|            1947|
|        LAS|      57|            2|       Las Vegas|      6734|           2|         4|     110|            1942|
|        MCO|      80|            3|         Orlando|      3258|           2|         5|     129|            1981|
|        PHX|      15|            3|         Phoenix|      5513|           3|         3|     100|            1952|
|        TPA|      16|            1|           Tampa|      2347|           4|         3|      16|            1971|
|        BWI|      37|            2|      Washington|      4691|           4|   

### BUSINESS ANALYSIS: Arrival Delays related to the morphology of the arrival airports

Now that we have the dataset ready for our business analysis, we will answer several questions regarding this topic.

#### 1. Is there a relation between the year the airport was built and the number of flights arriving to it?

In [35]:
print("Average Number of Fligths for the 25 oldest airports:")
CompleteDF.groupBy("AirportCode", "ConstructionYear")\
            .agg(avg("NumFlights"))\
            .orderBy("ConstructionYear").show(25)

Average Number of Fligths for the 25 oldest airports:
+-----------+----------------+---------------+
|AirportCode|ConstructionYear|avg(NumFlights)|
+-----------+----------------+---------------+
|        OKC|            1911|          848.0|
|        DAL|            1917|         3594.0|
|        TUL|            1919|          685.0|
|        TUS|            1919|          926.0|
|        LGB|            1923|          138.0|
|        CLE|            1925|          498.0|
|        JAX|            1926|          808.0|
|        BOI|            1926|          643.0|
|        PDX|            1926|         1171.0|
|        DTW|            1927|          570.0|
|        SFO|            1927|          743.0|
|        COS|            1927|          183.0|
|        GSO|            1927|            2.0|
|        ROC|            1927|            1.0|
|        MHT|            1927|          848.0|
|        MDW|            1927|         6255.0|
|        OAK|            1927|         3916.0|
|     

In [36]:
print("Average Number of Fligths for the 25 newest airports:")
CompleteDF.groupBy("AirportCode", "ConstructionYear")\
            .agg(avg("NumFlights"))\
            .orderBy("ConstructionYear", ascending = False).show(25)

Average Number of Fligths for the 25 newest airports:
+-----------+----------------+---------------+
|AirportCode|ConstructionYear|avg(NumFlights)|
+-----------+----------------+---------------+
|        MYR|            2013|            1.0|
|        MAF|            1999|          319.0|
|        AUS|            1999|         1620.0|
|        BUF|            1997|          475.0|
|        ALB|            1996|          388.0|
|        DEN|            1995|         1589.0|
|        SAV|            1994|            1.0|
|        IAH|            1990|           13.0|
|        MCO|            1981|         3258.0|
|        RSW|            1980|          317.0|
|        MCI|            1972|         2231.0|
|        TPA|            1971|         2347.0|
|        HOU|            1969|         3898.0|
|        HRL|            1967|          312.0|
|        SMF|            1967|         2639.0|
|        SJC|            1965|         2187.0|
|        IAD|            1962|          308.0|
|     

#### Conclusions to Q1:
    
Looking at both queries, we can see that there is no aparent relation between the number of flights per airport and the year of construction.

If we look deeper in our data, there are some insights to extract: 

- There has only been 8 airports constructed in the last 30 years. It makes sense that these 8 airports receive less flights.
- During the second decade of the 20th century, the air transportation industry experienced a huge boom: 26 out of the 76 analyzed US airports were constructed in these 10 years.
    - This decade corresponds to the end of World War I. Before this time, the airports were focused on military transportation. At this point the commercial flights started to develop, and there is a clear example of this in the years of constructions of this airports.

#### 2. Are modern airports located mostly in large cities or not necessarily?

In [37]:
CompleteDF.dropDuplicates(['AirportCode', 'City',])\
                    .select('AirportCode', 'State', 'City', 'ConstructionYear')\
                    .orderBy('ConstructionYear', ascending = False).show()

+-----------+--------------+--------------+----------------+
|AirportCode|         State|          City|ConstructionYear|
+-----------+--------------+--------------+----------------+
|        MYR|South Carolina|  Mirtle Beach|            2013|
|        AUS|         Texas|        Austin|            1999|
|        MAF|         Texas|       Midland|            1999|
|        BUF|      New York|   Cheektowaga|            1997|
|        ALB|      New York|        Albany|            1996|
|        DEN|      Colorado|        Denver|            1995|
|        SAV|South Carolina|      Savannah|            1994|
|        IAH|         Texas|       Houston|            1990|
|        MCO|       Florida|       Orlando|            1981|
|        RSW|       Florida|    Fort Myers|            1980|
|        MCI|        Misuri|        Kansas|            1972|
|        TPA|       Florida|         Tampa|            1971|
|        HOU|         Texas|      Houston |            1969|
|        SMF|    Califor

#### Conclusions to Q2:
    
Having ordered the airports by year of construction, we can look at the cities where they are located.
We decided to take a subset of the 20 most modern airports to look at the cities. 

Looking into this list, there are clearly some important cities, such as: Austin, Denver, Albany and Houston. However, the rest of the cities are not considered as large or important, and there are even some really small cities such as: Mirtle Beach, Midland and Cheektowaga. 

- We can then assume that new airports have been created as a compliment to other larger airports in the state, or as a neccesity to big areas that did not have a close airport because they were not close to a large city. 

#### 3. Do airports with more terminals have larger delays, or is the opposite true?

In [38]:
print("Average Arrival Delays and Delay Severity for the 25 Airports with the MOST TERMINALS:")
print("Keep in mind that Delay severities are categorized as follows: 1-Annoying, 2-Impactful, 3-Unacceptable")
CompleteDF.groupBy("AirportCode", "NumTerminals")\
            .agg(avg("ArrDelay").alias("Avg_Delay"),\
            avg("DelaySeverity").alias("Avg_Severity"))\
            .orderBy("NumTerminals", ascending = False).show(25)

Average Arrival Delays and Delay Severity for the 25 Airports with the MOST TERMINALS:
Keep in mind that Delay severities are categorized as follows: 1-Annoying, 2-Impactful, 3-Unacceptable
+-----------+------------+------------------+------------------+
|AirportCode|NumTerminals|         Avg_Delay|      Avg_Severity|
+-----------+------------+------------------+------------------+
|        LAX|           9|48.175604626708726|1.8286014721345951|
|        PHL|           7| 54.62300319488818| 1.961661341853035|
|        FLL|           5|47.524096385542165|1.9518072289156627|
|        IAH|           5|              79.0|1.8333333333333333|
|        BWI|           4| 46.78449612403101|1.8387596899224805|
|        CRP|           4|43.333333333333336|1.7666666666666666|
|        SFO|           4|  85.7556270096463|2.3987138263665595|
|        TPA|           4| 44.64536741214057|1.8083067092651757|
|        MCI|           3| 46.03931203931204|1.8083538083538084|
|        TUS|           3| 54.

In [39]:
print("Average Arrival Delays and Delay Severity for the 25 Airports with the LEAST TERMINALS:")
print("Keep in mind that Delay severities are categorized as follows: 1-Annoying, 2-Impactful, 3-Unacceptable")
CompleteDF.groupBy("AirportCode", "NumTerminals")\
            .agg(avg("ArrDelay").alias("Avg_Delay"),\
            avg("DelaySeverity").alias("Avg_Severity"))\
            .orderBy("NumTerminals", ascending = True).show(25)

Average Arrival Delays and Delay Severity for the 25 Airports with the LEAST TERMINALS:
Keep in mind that Delay severities are categorized as follows: 1-Annoying, 2-Impactful, 3-Unacceptable
+-----------+------------+------------------+------------------+
|AirportCode|NumTerminals|         Avg_Delay|      Avg_Severity|
+-----------+------------+------------------+------------------+
|        HRL|           1|50.107142857142854|1.8392857142857142|
|        GSO|           1|              60.0|               2.0|
|        PVD|           1| 46.20645161290322|1.9548387096774194|
|        SAV|           1|              31.0|               2.0|
|        DAL|           1| 43.63193277310924|1.7949579831932774|
|        OKC|           1| 50.94358974358974|1.8461538461538463|
|        MHT|           1| 44.61842105263158|1.8289473684210527|
|        BNA|           1|  45.5974025974026|1.7948051948051948|
|        BOI|           1|              48.6|           1.83125|
|        MRY|           1|52.

#### Conclusions to Q3 (terminals):
    
The range for number of terminals varies mostly from 1 to 3, except for some outliers (airports with 4, 5, 7 and 9 terminals) but these outliers only sum up to 8 US airports. 

Looking at the average delay severity, and knowing that almost all airports have only one terminal, there is not a apparent relation between the two factors. However, let's take a deeper look :

- The average severity goes from 1 (annoying) to 3 (unnacceptable), and all airports have between 1.7 and 2, even when comparing the two tables from the output.

The conclusion that we can extract from this is that airports build their terminals according to the usage they are going to have --> same delay regardless of the number of terminals.

#### What about the Runways?

In [40]:
print("Average Arrival Delays and Delay Severity for 25 airports with the MOST RUNWAYS:")
print("Keep in mind that Delay severities are categorized as follows: 1-Annoying, 2-Impactful, 3-Unacceptable")
CompleteDF.groupBy("AirportCode", "NumRunways")\
            .agg(avg("ArrDelay").alias("Avg_Delay"),\
            avg("DelaySeverity").alias("Avg_Severity"))\
            .orderBy("NumRunways", ascending = False).show(25)

Average Arrival Delays and Delay Severity for 25 airports with the MOST RUNWAYS:
Keep in mind that Delay severities are categorized as follows: 1-Annoying, 2-Impactful, 3-Unacceptable
+-----------+----------+------------------+------------------+
|AirportCode|NumRunways|         Avg_Delay|      Avg_Severity|
+-----------+----------+------------------+------------------+
|        DTW|         6| 46.05940594059406|1.7623762376237624|
|        DEN|         6| 48.09792284866469|1.8724035608308605|
|        MDW|         5|54.752066115702476|1.9513314967860422|
|        IAD|         5|50.014084507042256|1.8309859154929577|
|        MCO|         5| 46.45985401459854|1.7980535279805352|
|        IAH|         5|              79.0|1.8333333333333333|
|        HOU|         4|47.167655786350146| 1.824925816023739|
|        CLE|         4|45.350515463917525|1.8453608247422681|
|        LAX|         4|48.175604626708726|1.8286014721345951|
|        BWI|         4| 46.78449612403101|1.838759689922480

In [41]:
print("Average Arrival Delays and Delay Severity for 25 airports with the LEAST RUNWAYS:")
print("Keep in mind that Delay severities are categorized as follows: 1-Annoying, 2-Impactful, 3-Unacceptable")
CompleteDF.groupBy("AirportCode", "NumRunways")\
            .agg(avg("ArrDelay").alias("Avg_Delay"),\
            avg("DelaySeverity").alias("Avg_Severity"))\
            .orderBy("NumRunways", ascending = True).show(25)

Average Arrival Delays and Delay Severity for 25 airports with the LEAST RUNWAYS:
Keep in mind that Delay severities are categorized as follows: 1-Annoying, 2-Impactful, 3-Unacceptable
+-----------+----------+------------------+------------------+
|AirportCode|NumRunways|         Avg_Delay|      Avg_Severity|
+-----------+----------+------------------+------------------+
|        MYR|         1|              18.0|               1.0|
|        RSW|         1| 45.71739130434783|1.7826086956521738|
|        SAN|         1|50.898656898656895|1.9157509157509158|
|        MRY|         2|52.142857142857146|1.9642857142857142|
|        BUF|         2|            54.775|               1.9|
|        CMH|         2| 62.68141592920354|1.9646017699115044|
|        FLL|         2|47.524096385542165|1.9518072289156627|
|        BOI|         2|              48.6|           1.83125|
|        AUS|         2|  46.6764705882353|1.7904411764705883|
|        BUR|         2| 46.83375959079284|1.83375959079283

#### Conclusions to Q3 (runways):

The insights gathered from this analysis is similar to the previous analysis for the terminals.

- Only 6 airports out of 76 have more than 4 runways
    - It would be interesting to consider that airports are built following a building plan executed many years back that appears to be effective. They do not vary the main characteristics from one airport to the other, just because they know it works, and also because it is the best way to divide the square meters to be the most efficient possible.
    
In this case however, the Average_Delay and Average_Delay_Severity seems to be slightly higher for airports with more Runways, which is the opposite of what we were expecting. Airports with more Runways usually receive a higher number of flights.

So far, it seems that Delays are not apparently related to the Airport's Morphology, but rather to the volume of flights that they receive or to the size of the airport. Let's take a deeper look and analyze the delays compared to the average runway length.  

#### 4. Discretizing the arrival delay to relate it to the average length of runways / average number of arriving flights. Is there a relation? Support your conclusions with data

We will create three different dataframes that depend on the delay severity, so that we can hopefully see different characteristics for each group and understand the potential reasons for delays.:

- Annoying:     (15 - 30 min Delay)
- Impactful:    (30 - 60 min Delay)
- Unacceptable: ( > 60 min Delay)

In [42]:
annoying_delays = CompleteDF\
                    .where(CompleteDF["DelaySeverity"] == 1)

impactful_delays = CompleteDF\
                    .where(CompleteDF["DelaySeverity"] == 2)

unacceptable_delays = CompleteDF\
                    .where(CompleteDF["DelaySeverity"] == 3)

For each severity delay dataframe we will:

1. Create a new column with the number of annoying/ impacful/ unacceptable delays per airport.
2. Calculate a ratio:

    - Average lenght of runway / the number of annoying flights

This ratio will give us the Meters of Runway that are dedicated to each Flight per Airport. 

What we want to do with this calculation, is to check if the less runway meters given to each flight, the higher the delays.

It is important to mention that there are airports that only received 1 flight, and their ratio is the actual runway length. That is why we will analyze the description of the runway ratio for each group and look for insights. 

#### Annoying flights:

In [43]:
annoying_flights = annoying_delays\
                    .groupBy("AirportCode").agg(f.count("AirportCode").alias("NumAnnoyFlights"))\
                    .orderBy("NumAnnoyFlights", ascending=False)

In [44]:
annoyingDF = annoying_flights.join(annoying_delays, ["AirportCode"])
annoyingDF = annoyingDF.drop("NumFlights")

In [45]:
annoying_ratio = annoyingDF.withColumn("FlightRunway_Ratio", col("AvgRunwayLength") / col("NumAnnoyFlights"))

In [46]:
print("FlightRunway_Ratio column tells us the Runway Length dedicated to each Flight for a given airport, for all the Delays considered as Annoying (15 - 30) min Delay:")
annoying_ratio.dropDuplicates(['AirportCode'])\
                    .select('AirportCode', 'FlightRunway_Ratio', 'AvgRunwayLength', 'NumAnnoyFlights')\
                    .orderBy('FlightRunway_Ratio', ascending = False).show()

FlightRunway_Ratio column tells us the Runway Length dedicated to each Flight for a given airport, for all the Delays considered as Annoying (15 - 30) min Delay:
+-----------+------------------+---------------+---------------+
|AirportCode|FlightRunway_Ratio|AvgRunwayLength|NumAnnoyFlights|
+-----------+------------------+---------------+---------------+
|        MYR|            2897.0|           2897|              1|
|        EWR|            2750.0|           2750|              1|
|        ROC|            1835.0|           1835|              1|
|        EUG|            1050.0|           2100|              2|
|        IAH|            1000.0|           3000|              3|
|        BFL|             700.0|           2800|              4|
|        LGB|             525.0|           2100|              4|
|        FAT|             382.0|           2674|              7|
|        JAN| 326.9230769230769|           8500|             26|
|        MRY|             250.0|           2000|          

In [47]:
annoying_ratio.select("FlightRunway_Ratio").describe().show()

+-------+------------------+
|summary|FlightRunway_Ratio|
+-------+------------------+
|  count|              8101|
|   mean|24.270954203184793|
| stddev|  66.2308492865673|
|    min|4.3316831683168315|
|    max|            2897.0|
+-------+------------------+



#### Impactful delays

In [48]:
impactful_flights = impactful_delays\
                    .groupBy("AirportCode").agg(f.count("AirportCode").alias("NumImpactFlights"))\
                    .orderBy("NumImpactFlights", ascending=False)

In [49]:
impactfulDF = impactful_flights.join(impactful_delays, ["AirportCode"])
impactfulDF = impactfulDF.drop("NumFlights")

In [50]:
impactful_ratio = impactfulDF.withColumn("FlightRunway_Ratio", col("AvgRunwayLength") / col("NumImpactFlights"))

In [51]:
print("FlightRunway_Ratio column tells us the Runway Length dedicated to each Flight for a given airport, for all the Delays considered as Impactful (30 - 60) min Delay:")
impactful_ratio.dropDuplicates(['AirportCode'])\
                    .select('AirportCode', 'FlightRunway_Ratio', 'AvgRunwayLength', 'NumImpactFlights')\
                    .orderBy('FlightRunway_Ratio', ascending = False).show()

FlightRunway_Ratio column tells us the Runway Length dedicated to each Flight for a given airport, for all the Delays considered as Impactful (30 - 60) min Delay:
+-----------+------------------+---------------+----------------+
|AirportCode|FlightRunway_Ratio|AvgRunwayLength|NumImpactFlights|
+-----------+------------------+---------------+----------------+
|        IAH|            3000.0|           3000|               1|
|        GSO|            2600.0|           2600|               1|
|        SAV|            2550.0|           2550|               1|
|        EWR|            1375.0|           2750|               2|
|        FAT|             668.5|           2674|               4|
|        LGB|             525.0|           2100|               4|
|        BFL| 466.6666666666667|           2800|               6|
|        JAN|447.36842105263156|           8500|              19|
|        COS|             333.1|           3331|              10|
|        RSW|             225.0|           36

In [52]:
impactful_ratio.select("FlightRunway_Ratio").describe().show()

+-------+------------------+
|summary|FlightRunway_Ratio|
+-------+------------------+
|  count|              6050|
|   mean| 32.22099173553719|
| stddev| 80.52843184060251|
|    min|5.2395209580838324|
|    max|            3000.0|
+-------+------------------+



#### Unacceptable delays

In [53]:
unacceptable_flights = unacceptable_delays\
                    .groupBy("AirportCode").agg(f.count("AirportCode").alias("NumUnaccFlights"))\
                    .orderBy("NumUnaccFlights", ascending=False)

In [54]:
unacceptableDF = unacceptable_flights.join(unacceptable_delays, ["AirportCode"])
unacceptableDF = unacceptableDF.drop("NumFlights")

In [55]:
unacceptable_ratio = unacceptableDF.withColumn("FlightRunway_Ratio", col("AvgRunwayLength") / col("NumUnaccFlights"))

In [56]:
print("FlightRunway_Ratio column tells us the Runway Length dedicated to each Flight for a given airport, for all the Delays considered as Unacceptable (> 60) min Delay:")
unacceptable_ratio.dropDuplicates(['AirportCode'])\
                    .select('AirportCode', 'FlightRunway_Ratio', 'AvgRunwayLength', 'NumUnaccFlights')\
                    .orderBy('FlightRunway_Ratio', ascending = False).show()

FlightRunway_Ratio column tells us the Runway Length dedicated to each Flight for a given airport, for all the Delays considered as Unacceptable (> 60) min Delay:
+-----------+------------------+---------------+---------------+
|AirportCode|FlightRunway_Ratio|AvgRunwayLength|NumUnaccFlights|
+-----------+------------------+---------------+---------------+
|        IAH|            1500.0|           3000|              2|
|        EUG|            1050.0|           2100|              2|
|        JAN| 944.4444444444445|           8500|              9|
|        BFL| 466.6666666666667|           2800|              6|
|        RSW|             360.0|           3600|             10|
|        COS| 302.8181818181818|           3331|             11|
|        MRY| 285.7142857142857|           2000|              7|
|        CRP| 285.7142857142857|           2000|              7|
|        AMA|            271.75|           3261|             12|
|        FAT|             267.4|           2674|         

In [57]:
unacceptable_ratio.select("FlightRunway_Ratio").describe().show()

+-------+------------------+
|summary|FlightRunway_Ratio|
+-------+------------------+
|  count|              5478|
|   mean| 34.52665206279664|
| stddev|  66.7038696669687|
|    min| 4.985754985754986|
|    max|            1500.0|
+-------+------------------+



#### Conclusions to Q4:
    
Surprinsingly, our expectations were not met:

- The higher the severity of the delay, the higher the mean of the runway ratio. This means that the longer the delay of the flights (in average) the more meters of runway that the airport had available for each flight, which is the opposite as what we were expecting initially.

The means of each ratio are :
  - 24.27 for annoying delay flights
  - 32.22 for impactful delay flights
  - 34.53 for unacceptable delay flights

At this point, it is clear that number and length of the runways seem to have an inverse relation with Delays and their severity. As we stated before, it is reasonable to assume that large airports with more runways receive a higher number of flights. Let's take a look at the relation between number of flights and delays.

#### 5. Does the Number of Arriving Flights for each Airport impacts the Delays?

To answer this question we want to do a very direct approach that is easy to analyze and understand. We will divide the DataFrame in two, based on the mean for NumFlights:
- Airports that received the highest number of flights (> 2946).
- Airports that received the lowest number of flights (<= 2946).

Then we will calculate the Mean and Count for Arrival Delays and Delay's Severity. We would expect to see significantly higher Delays on airports that have the highest Number of Flights. Let's confirm:

In [58]:
CompleteDF.select("NumFlights").describe().show()

+-------+------------------+
|summary|        NumFlights|
+-------+------------------+
|  count|             19629|
|   mean|2946.7493504508634|
| stddev| 1974.047350692438|
|    min|                 1|
|    max|              6734|
+-------+------------------+



In [59]:
MostFlights = CompleteDF.where(f.col("NumFlights") > 2946)

In [60]:
MostFlights.groupBy("AirportCode", "NumFlights")\
            .agg(avg("ArrDelay").alias("Delay"),\
            avg("DelaySeverity").alias("Delay_Severity"))\
            .orderBy("NumFlights", ascending = False).show(5)

+-----------+----------+------------------+------------------+
|AirportCode|NumFlights|             Delay|    Delay_Severity|
+-----------+----------+------------------+------------------+
|        LAS|      6734| 55.53939008894536|1.9428208386277002|
|        MDW|      6255|54.752066115702476|1.9513314967860422|
|        PHX|      5513| 48.36252189141857|1.8266199649737302|
|        BWI|      4691| 46.78449612403101|1.8387596899224805|
|        OAK|      3916| 46.59293873312565|1.8442367601246106|
+-----------+----------+------------------+------------------+
only showing top 5 rows



Getting Mean and Count for Delays and Delay's Severity, for the Airports that received the most flights:

In [61]:
MostFlights.select("ArrDelay", "DelaySeverity").describe().show()

+-------+------------------+------------------+
|summary|          ArrDelay|     DelaySeverity|
+-------+------------------+------------------+
|  count|              8863|              8863|
|   mean|49.832675166422206|1.8702470946632066|
| stddev|46.189813330223664|0.8243901173263337|
|    min|                15|                 1|
|    max|               500|                 3|
+-------+------------------+------------------+



In [62]:
LeastFlights = CompleteDF.where(f.col("NumFlights") <= 2946)

In [63]:
LeastFlights.groupBy("AirportCode", "NumFlights")\
            .agg(avg("ArrDelay").alias("Delay"),\
            avg("DelaySeverity").alias("Delay_Severity"))\
            .orderBy("NumFlights", ascending = True).show(5)

+-----------+----------+-----+------------------+
|AirportCode|NumFlights|Delay|    Delay_Severity|
+-----------+----------+-----+------------------+
|        MYR|         1| 18.0|               1.0|
|        SAV|         1| 31.0|               2.0|
|        ROC|         1| 24.0|               1.0|
|        GSO|         2| 60.0|               2.0|
|        EWR|         9| 43.0|1.6666666666666667|
+-----------+----------+-----+------------------+
only showing top 5 rows



Getting Mean and Count for Delays and Delay's Severity, for the Airports that received the least flights:

In [64]:
LeastFlights.select("ArrDelay", "DelaySeverity").describe().show()

+-------+-----------------+------------------+
|summary|         ArrDelay|     DelaySeverity|
+-------+-----------------+------------------+
|  count|            10766|             10766|
|   mean|49.07523685677132|1.8631803826862345|
| stddev|45.04494226652239|0.8181331316226551|
|    min|               15|                 1|
|    max|              486|                 3|
+-------+-----------------+------------------+



#### Conclusions to Q5:
    
As a summary, these are the obtained results for this last business inquiry:

- Airports with MOST Flights:
    - Mean Delay: 49.83
    - Mean Delay Severity: 1.87
    - Count: 8,863


- Airports with LEAST Flights:
    - Mean Delay: 49.07
    - Mean Delay Severity: 1.86
    - Count: 10,766

We can see that the airports with a higher volume of flights have a slightly higher average of delays and delay severity, however; it is practically impercetible and not at all significant from a business perspective.
On the other hand, the number of delayed flights differs significantly on airports with a high volume of flights vs airports with a low volume. Opposite to what we expected, airports with the lowest volume have 21.5% more delayed flights than the airports with the highest volume.

*As a next stage to this project, we would have to generate insights on how are the type of delays related to the volume of flights, size and morphology of both departure, and arrival airports.

#### Does the morphology of the airport has an effect on the delays?

From the analysis that we have performed in this project, we can conclude that the morphology of the arrival airports does not have a direct, nor significant impact on the flight delays, or at least an impact that is interesting from a business point of view.