---
title: California Flight Delay Analysis
exports:
  - format: pdf
    template: arxiv_two_column
    output: pdf_builds/main.pdf
bibliography:
  - references.bib
---

## Introduction

This project analyzes and models flight delay behavior in California using data from the U.S. Department of Transportation. It specifically studies domestic flights departing from California in 2015 to understand how delay patterns vary over airlines, airports, and different metrics of time. The main objectives are to look at airline and airport reliability, explore clustering structures among airlines and airports, and evaluate how well we can predict flight delays using statistical and machine learning methods.

## Research Questions

This project focuses on this set of descriptive and predictive questions:

1. **Reliability:** Which airline and airport tend to be more reliable with being on time and not having many delays for travelers?
2. **Structures:** Do airports and/or airlines cluster into groups with similar delay behavior?
3. **Time Patterns:** How do delays vary by **month**, **day of the month**, **day of the week**, and **hour/time of the day**?
4. **Prediction:** How well can we predict if a flight will be delayed by using the available data?

## About the Data

The dataset used in this project is the U.S. Flight Delays Dataset (@USDOTFlightDelays2015), available on Kaggle:  
https://www.kaggle.com/datasets/usdot/flight-delays  

The data comes from from the Bureau of Transportation Statistics and contains on-time performance records of domestic U.S. flights across 14 airlines and 322 airports during 2015. Due to the large size of the dataset, a subset is used in this project focusing only on flights leaving from California airports.

The dataset includes flight timing variables, airline and airport I.D.s, and delay measures for departure and arrivals of flights. 

## Key Details to Remember, Assumptions and Limitations
1. **Defining "On Time":** A flight is defined as **on time** if its departure or arrival delay is **less than or equal to 15 minutes**. This is consistent with how the Bureau of Transportation Statistics defines a flight as **on time**.
2. **Not Claiming Causes to Delays:** This project is predominantly **observational** in which we describe many patterns rather than focussing on making causal claims since we do not have access to the outside data that would contribute to delaying these specific flights in this dataset.
3. **Justification for Filtering:** For the purpose of this project, this dataset was filtered heavily to reduce its vast size but not too much to where we would have minimal data to work with. We filtered to just flights leaving and arriving in California because California is a huge state with many airports and airlines as it is. Traveling by flight culture is also a well practiced activity by many Californians so knowing how delays affect California flights would be beneficial to Californian travelers.
4. **Assumption 1:** Since we did not collect this data ourselves, we are assuming the measuring of the delay variables to be *consistent* across all airlines and airports.
6. **Assumption 2:** When we our Time analysis, we are assuming that flights within the same group (grouped by month, week, day or hour) are comparable *on average*.
7. **Limitation 1:** We assume that the data we filtered through is a *fair representation* of operational conditions in *California for the year 2015* but this does not extend to other states or years. In other words, our project *does not generalize* to other states or years.
8. **Limitation 2:** Our observations of delay patterns may reflect congestion, delay propagation, scheduling density, etc. but we do not know nor claim specific causalty without further additional data about factors that affect certain flights in the data specifically.

Here is a preview of our data:

## Import Packages

In [1]:
import pandas as pd
import numpy as np
from helper_functions import get_full_data, summarize_df

## Loading the Data

In [5]:
#These tables were derived in the get_data.ipynb notebook
airlines = pd.read_csv("data/airlines.csv")
airports = pd.read_csv("data/airports.csv")
flights = pd.read_csv("data/filtered_flights.csv")
airlines.head()

Unnamed: 0,IATA_CODE,AIRLINE
0,UA,United Air Lines Inc.
1,AA,American Airlines Inc.
2,US,US Airways Inc.
3,F9,Frontier Airlines Inc.
4,B6,JetBlue Airways


In [6]:
airports.head()

Unnamed: 0,IATA_CODE,AIRPORT,CITY,STATE,COUNTRY,LATITUDE,LONGITUDE
0,ABE,Lehigh Valley International Airport,Allentown,PA,USA,40.65236,-75.4404
1,ABI,Abilene Regional Airport,Abilene,TX,USA,32.41132,-99.6819
2,ABQ,Albuquerque International Sunport,Albuquerque,NM,USA,35.04022,-106.60919
3,ABR,Aberdeen Regional Airport,Aberdeen,SD,USA,45.44906,-98.42183
4,ABY,Southwest Georgia Regional Airport,Albany,GA,USA,31.53552,-84.19447


In [7]:
flights.head()

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,...,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,CANCELLED,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,2.0,...,4.0,750,741.0,-9.0,0,0.0,0.0,0.0,0.0,0.0
1,1,1,4,US,840,N171US,SFO,CLT,20,18.0,...,11.0,806,811.0,5.0,0,0.0,0.0,0.0,0.0,0.0
2,1,1,4,AA,258,N3HYAA,LAX,MIA,20,15.0,...,8.0,805,756.0,-9.0,0,0.0,0.0,0.0,0.0,0.0
3,1,1,4,DL,806,N3730B,SFO,MSP,25,20.0,...,6.0,602,610.0,8.0,0,0.0,0.0,0.0,0.0,0.0
4,1,1,4,US,2013,N584UW,LAX,CLT,30,44.0,...,8.0,803,753.0,-10.0,0,0.0,0.0,0.0,0.0,0.0


In [8]:
list(flights.columns)

['MONTH',
 'DAY',
 'DAY_OF_WEEK',
 'AIRLINE',
 'FLIGHT_NUMBER',
 'TAIL_NUMBER',
 'ORIGIN_AIRPORT',
 'DESTINATION_AIRPORT',
 'SCHEDULED_DEPARTURE',
 'DEPARTURE_TIME',
 'DEPARTURE_DELAY',
 'TAXI_OUT',
 'WHEELS_OFF',
 'SCHEDULED_TIME',
 'ELAPSED_TIME',
 'AIR_TIME',
 'DISTANCE',
 'WHEELS_ON',
 'TAXI_IN',
 'SCHEDULED_ARRIVAL',
 'ARRIVAL_TIME',
 'ARRIVAL_DELAY',
 'CANCELLED',
 'AIR_SYSTEM_DELAY',
 'SECURITY_DELAY',
 'AIRLINE_DELAY',
 'LATE_AIRCRAFT_DELAY',
 'WEATHER_DELAY']

# Exploratory Data Analysis

## Airlines and Airports

## Months and Days

## Predictions

## Conclusion

## Author Contributions
Below, it is described what each team member did in this project.

#### Jakob Bjerre Eriksen
.

#### Keyla Jaylin Barcenas
.

#### Keval Darshan Amin
.

#### Noa Adriana Gonzalez
.

## Other

```{bibliography}
```