# Project 3
## Why Do Flights Get Delayed?


- **Dataset(s) to be used:** [https://www.kaggle.com/datasets/sriharshaeedala/airline-delay?resource=download]
- **Analysis question:** [Which factors predict arrival delays (minutes) at the carrier-airport-month level? In particular, does higher traffic volume and larger counts of weather- or NAS-related delays predict greater average arrival delay per flight, after adjusting for carrier?]
- **Columns that will (likely) be used:**
  - [`carrier_name`] 
  - [`airport_name`]
  - [`arr_flights`]
- (If you're using multiple datasets) **Columns to be used to merge/join them:**
- **Hypothesis**: I expect that the majority of arrival delays in the U.S. airline system are driven not by weather, but by operational factors including specifically late-arriving aircraft and airline-controlled issues (carrier delays).

In [23]:
# ensure the visualizations render properly across VSCode, Jupyter Book, etc.
# https://plotly.com/python/renderers/

import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"

In [24]:
import pandas as pd
import plotly.express as px

df = pd.read_csv("airline_delay.csv")  # adjust filename if needed

df.head()

Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2023,8,9E,Endeavor Air Inc.,ABE,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ...",89.0,13.0,2.25,1.6,...,0.0,5.99,2.0,1.0,1375.0,71.0,761.0,118.0,0.0,425.0
1,2023,8,9E,Endeavor Air Inc.,ABY,"Albany, GA: Southwest Georgia Regional",62.0,10.0,1.97,0.04,...,0.0,7.42,0.0,1.0,799.0,218.0,1.0,62.0,0.0,518.0
2,2023,8,9E,Endeavor Air Inc.,AEX,"Alexandria, LA: Alexandria International",62.0,10.0,2.73,1.18,...,0.0,4.28,1.0,0.0,766.0,56.0,188.0,78.0,0.0,444.0
3,2023,8,9E,Endeavor Air Inc.,AGS,"Augusta, GA: Augusta Regional at Bush Field",66.0,12.0,3.69,2.27,...,0.0,1.57,1.0,1.0,1397.0,471.0,320.0,388.0,0.0,218.0
4,2023,8,9E,Endeavor Air Inc.,ALB,"Albany, NY: Albany International",92.0,22.0,7.76,0.0,...,0.0,11.28,2.0,0.0,1530.0,628.0,0.0,134.0,0.0,768.0


In [25]:
df.shape
df.info()
df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171666 entries, 0 to 171665
Data columns (total 21 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   year                 171666 non-null  int64  
 1   month                171666 non-null  int64  
 2   carrier              171666 non-null  object 
 3   carrier_name         171666 non-null  object 
 4   airport              171666 non-null  object 
 5   airport_name         171666 non-null  object 
 6   arr_flights          171426 non-null  float64
 7   arr_del15            171223 non-null  float64
 8   carrier_ct           171426 non-null  float64
 9   weather_ct           171426 non-null  float64
 10  nas_ct               171426 non-null  float64
 11  security_ct          171426 non-null  float64
 12  late_aircraft_ct     171426 non-null  float64
 13  arr_cancelled        171426 non-null  float64
 14  arr_diverted         171426 non-null  float64
 15  arr_delay        

year                     0
month                    0
carrier                  0
carrier_name             0
airport                  0
airport_name             0
arr_flights            240
arr_del15              443
carrier_ct             240
weather_ct             240
nas_ct                 240
security_ct            240
late_aircraft_ct       240
arr_cancelled          240
arr_diverted           240
arr_delay              240
carrier_delay          240
weather_delay          240
nas_delay              240
security_delay         240
late_aircraft_delay    240
dtype: int64

In [26]:
# Compute Total Delay Sources (aggregates delays by type)
delay_cols = [
    "carrier_delay", 
    "weather_delay", 
    "nas_delay", 
    "security_delay", 
    "late_aircraft_delay"
]

total_delays = df[delay_cols].sum().sort_values(ascending=False)
total_delays

late_aircraft_delay    283144335.0
carrier_delay          246370897.0
nas_delay              157823639.0
weather_delay           38153170.0
security_delay           1265591.0
dtype: float64

In [27]:
fig = px.bar(
    x=total_delays.index,          
    y=total_delays.values,         
    title="Total Delay Minutes by Cause (2013–2023)",
    labels={"x": "Delay Cause", "y": "Total Delay Minutes"} 
)

fig.show()

### Which delay source is the biggest?

This chart shows the total number of delay minutes caused by each factor.
The results support the hypothesis:

- Late Aircraft Delay is the largest contributor (over 283 million minutes).
- Carrier Delay is second (about 246 million minutes).
- NAS Delay is third.
- Weather Delay is much smaller than popular belief.
- Security Delay is almost negligible.

This aligns with the idea that late inbound aircraft cause cascading delays across the system.

## Seasonal Patterns - how does weather contribute to delays seasonally?

In [28]:
df["day"] = 1
df["date"] = pd.to_datetime(df[["year", "month", "day"]])

monthly_delay = df.groupby("date")["arr_delay"].sum().reset_index()

fig = px.line(
    monthly_delay,
    x="date",
    y="arr_delay",
    title="Monthly Total Arrival Delay (2013–2023)",
    labels={"arr_delay": "Delay Minutes", "date": "Date"}
)

fig.show()

### Interpretation

- There are clear seasonal spikes in delays
    - These peaks consistently occur during mid-year (Summer travel) and near the end of the year (December holidays).
- External factors:
    - There is a dramatic trough throughout 2020 and early 2021, which directly relates to the reduction in air travel during the COVID-19 pandemic.
    - There is a spike around 2023 that may relate to post-pandemic travel surges.

This confirms that delays are not consistent over time.

## How Do Airline Carriers Compare in Delay Performance?

In [29]:
carrier_sums = df.groupby("carrier_name")[["arr_flights", "arr_del15"]].sum()
carrier_sums["delay_rate"] = carrier_sums["arr_del15"] / carrier_sums["arr_flights"]
carrier_delay_rate = carrier_sums.sort_values("delay_rate", ascending=False).reset_index()

worst_carriers = carrier_delay_rate.head(15).copy()
best_carriers = carrier_delay_rate.tail(15).copy()

worst_carriers["category"] = "Highest Delay Rates"
best_carriers["category"] = "Lowest Delay Rates"

combined_carriers = pd.concat([worst_carriers, best_carriers])

In [30]:
fig = px.bar(
    combined_carriers,
    x="carrier_name",
    y="delay_rate",
    color="category",
    barmode="group",
    title="Carriers With Lowest and Highest Delay Rates",
)
fig.show()

### Interpretation - Airline Carrier Reliability

- The carriers with the least reliable performance are found at the far left of the chart (blue category), where rates are consistently between 20% and 25%.
    -  Frontier Airlines Inc. and JetBlue Airways are the clear leaders in this category, with a quarter of their flights experiencing a significant delay.
- The carriers with the most reliable performance are found at the far right of the chart (red/orange category), where rates drop down to approximately 12.5%.
    - Hawaiian Airlines Inc. and Endeavor Air Inc. are the most reliable carriers shown.

The analysis reveals a _**performance gap between the best and worst carriers**_. A traveler flying with Frontier (or JetBlue) is approximately twice as likely to experience a delay (25% rate) as a traveler flying with Hawaiian Airlines (12.5% rate).

**However**, The carrier data reveals a clear and consistent performance gradient among all measured airlines. Only a handful of carriers stand far apart from the rest. There are no sudden drop-offs; the difference between the 15th worst carrier and the 15th best is small. _**Most U.S. airlines have similar delay rates, operating within a narrow band of overall reliability.**_

## How Do Delay Rates Differ Across Airports?

In [31]:
airport_sums = df.groupby("airport_name")[["arr_flights", "arr_del15"]].sum()
airport_sums["delay_rate"] = airport_sums["arr_del15"] / airport_sums["arr_flights"]
airport_delay_rate = airport_sums.reset_index()

airport_delay_rate_filtered = airport_delay_rate[
    airport_delay_rate["arr_flights"] >= 10
].copy()

sorted_airports = airport_delay_rate_filtered.sort_values("delay_rate", ascending=True)

sorted_airports

Unnamed: 0,airport_name,arr_flights,arr_del15,delay_rate
250,"Mobile, AL: Mobile International",12.0,0.0,0.000000
173,"Inyokern, CA: Inyokern Airport",160.0,7.0,0.043750
110,"Elko, NV: Elko Regional",6453.0,460.0,0.071285
211,"Lewiston, ID: Lewiston Nez Perce County",7963.0,581.0,0.072962
53,"Butte, MT: Bert Mooney",6903.0,517.0,0.074895
...,...,...,...,...
3,"Aguadilla, PR: Rafael Hernandez",17708.0,5585.0,0.315394
411,"Wilmington, DE: New Castle",1260.0,406.0,0.322222
373,"Stockton, CA: Stockton Metro",1375.0,448.0,0.325818
76,"Cold Bay, AK: Cold Bay Airport",262.0,90.0,0.343511


In [32]:
# Function to shorten the name of airports -- Example: 'Topeka, KS: Topeka Regional' -> 'Topeka Regional'
def simplify_name(full_name):
    parts = full_name.split(':')
    
    if len(parts) > 1:
        return parts[1].strip() 
    else:
        return full_name.strip()

In [33]:
worst_airports = sorted_airports.tail(15)
best_airports = sorted_airports.head(15)

best_airports['shortened_airport_name'] = best_airports['airport_name'].apply(simplify_name)
worst_airports['shortened_airport_name'] = worst_airports['airport_name'].apply(simplify_name)

worst_airports["category"] = "Highest Delay Rates"
best_airports["category"] = "Lowest Delay Rates"

combined = pd.concat([worst_airports, best_airports])

fig = px.bar(
    combined,
    x="shortened_airport_name",
    y="delay_rate",
    color="category",
    barmode="group",
    title="Airports With Lowest and Highest Delay Rates",
)
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

### Interpretation - Airport Reliability

- The airports with the most reliable performance are found in the red category (Lowest Delay Rates), where rates fall between 4% and 10%.
- The airports with the least reliable performance are in the blue category (Highest Delay Rates), where rates are consistently between 25% and 33%.

The analysis reveals a _**massive performance gap**_.
- A traveler flying into one of the worst airports is approximately 3 to 8 times more likely to experience a delay than a traveler flying into one of the best airports on this list.

It is also shown that _**neither the best nor the worst performers are major national hubs**_. This disproves the hypothesis that large airports are the primary source of delays. Instead, operational reliability is highly volatile and highly concentrated in specific regional and smaller airports.

## Conclusion

This project evaluates the factors that contribute most to arrival delays across U.S. airports. I began with the hypothesis that operational issues (specifically late-arriving aircraft and airline-controlled delays) would play a larger role in national delay patterns than weather-related causes. The findings strongly support this expectation.

##### 1. Primary Drivers and Trends
- Across the dataset, late aircraft delays and carrier delays are by far the largest contributors, together accounting for the majority of total delay minutes nationwide. Weather delays, while still significant, make up a noticeably smaller share, and security delays are minimal in comparison. 
- Weather does contribute to delays seasonally, with clear patterns showing consistent peaks during the summer and winter travel periods .

##### 2. Performance at the Carrier Level
- Across airline carriers, a clear performance gap emerges. Less reliable carriers, such as Frontier Airlines and JetBlue Airways, show delay rates near 20–25%, while the most reliable carriers, including Hawaiian Airlines and Endeavor Air, demonstrate rates near 12–15% . 
- However, the analysis shows that the risk is highly concentrated at the outliers (the top 3 worst and the bottom 3 best). For the vast majority of U.S. carriers, performance is tightly clustered, suggesting that most airlines operate within a similar band of efficiency governed by the national air traffic system.

##### 3. Volatility at the Airport Level
- The airport-level analysis further illustrates how these operational delays vary geographically. The combined chart highlights substantial disparities, that some regional airports experience extremely high proportions of delayed arrivals, while others maintain much lower rates.
- Neither group is dominated by major national hubs, demonstrating that delay reliability is not tied to airport size, but rather to localized operational performance.


**Overall, the findings provide clear evidence that operational inefficiencies (both at the carrier level and within highly variable regional airports) drive the majority of arrival delays in the U.S. aviation system. Efforts to improve airline scheduling, airport operations, and aircraft turnaround procedures are likely to yield the greatest reductions in nationwide arrival delays.**