# Final Report Requirements
- For this phase, please turn in your notebook and your business write-up (this write-up should be part of your notebook).
- This final report should be one complete, well-manicured notebook that tells a story. It should have a beginning, middle, and end. 
- It is expected that you have done many experiments. What would like to see is a summary of the key experiments (successes and surprises) 
- In short, your report should be business like in structure and language. Please try to use tables and diagrams as much as possible to shed light on and support your findings. 
- In your final report, write a gap analysis of your best pipeline.
- Finally, the abstract, the discussion, and the conclusions are key. So please devote time to fleshing these out carefully.
- NO CODE IN THIS NOTEBOOK!

# Final Report Rubric

## In-class presentation (10 pts)
In-Class Presentation should have a logical and business flow to it. In more detail, your In-Class Presentation should have a logical and scientific flow to it with main sections for each of the following:

- a title slide (with the project name, Group Number, the team member names, and photos).
- an abstract slide
- Make sure it has an outline slide with good descriptive section headings
- Team names, photos
- Project description
- Some summary visual EDA
- Feature engineering and Top features
- Overview of Modeling Pipelines explored
- Results and discussion of results (Accuracy, ROC/AUC, etc.. from this phase and previous phases)
- Conclusions (best performing model, number of features, top 10 best features, hyper-parameters) and next steps

## Team and project meta information (10 pts)
Please provide the following:
* Team ID
* The complete list of team members and project meta information (e.g., **email**).

* Credit assignment plan updates (who does/did what and when, amount of effort in terms of person-hours, start and end dates, estimated improvement in terms of key metrics) in Table format
No Credit assignment plan means ZERO points
A credit assignment plan not in Table format means ZERO points
No start and end dates and (budgeted) hours of effort mean an incomplete plan. This may result in zero points.

## Project Abstract (10 pts)
- Final Abstract: The final form of the abstract! It should have everything covered in previous phases, plus the new experiments and the final model selected, as well as your final results (report the number!)
- Make sure to describe what your focused on and accomplished in this project (include this phase and previous phases). Have a look at the expectations with regard to a good abstract.

## Data and feature engineering (10 pts)
- Summarize the data lineage and key data transformations (joins)
- List of feature families explored and explanation of each
- List of features within each family and description of each, along with THEIR EDA
- Please refer to experiments showing the value of each feature/family

## Neural Network (MLP) (10 pts)
You are expected to train a Neural Network
- Implement Neural Network (NN) model
- Experiment with at least 2 different Network architectures and report results.
- Must show training and performance scores, **including training curves by epoch**

## Leakage (10 pts)
- Define what is leakage and provide a a hypothetical example of leakage
- Go through your Pipeline and check if there is any leakage.
- Are you violating any cardinal sins of ML?
- Describe how your pipeline does not suffer from any leakage problem and does not violate any cardinal sins of ML

## Modeling Pipelines (10 pts)
Expectations here are to provide the following in sections and subsections:

- A visualization of the modeling pipeline (s) and subpipelines if necessary
- Families of input features and count per family
- Number of input features
- Hyperparameters and settings considered
- Loss function used (data loss and regularization parts) in latex
- Number of experiments conducted
- Experiment table with the following details per experiment:
    - Baseline experiment
    - Any additional experiments
    - Final model tuned
    - best results (1 to three) for all experiments you conducted with the following details
    - Computational configuration used
    - Wall time for each experiment

## Results and discussion of results (20 pts)
Expectations here are to provide the following: The goal of Discussion’ section is present an interpretation of key results , which means explain, analyse, and compare them (results from all the phases). Often, this part is the most important, simply because it lets the researcher take a step back and give a broader look at all experiments conducted. Do not discuss any outcomes not presented in the results part.

Make sure to provide the following in sections and subsections:
- Your experiments are properly enumerated/tabulated and discussed (accurate descriptions, performance metrics)
- Discuss results not substantiated in your experimental section above in the modeling pipelines
- Provide gap analysis

## Conclusion (10 pts)
Expectations here are to address the following following in your conclusion in a main section by itself (150 words or less):

- Restate your project focus and explain why it’s important. Make sure that this part of the conclusion is concise and clear.
- Restate your hypothesis (e.g., ML pipelines with custom features can accurately predict .......)
- Summarize main points of your project: Remind your readers your key points. (e.g, best features, best model, hyper-parameters and so on)
- Discuss the significance of your results
- Discuss the future of your project.

## Extra credit
- Deep learning (5 points)
- Recent data (5 points)

# Phase III Project Report
__`Team 4-1`__

`April 19, 2025`

`Phase III led by Erica Landreth`

## Authored By:

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/authors.png?raw=true" alt="ML Pipeline" style="width: 500px">
<div>

## Phase Leader Plan
| Phase |  Phase Leader | Phase Leader Email|
|:---:|:---:|:---:|
| **Phase 0, HW5**: Finalize Teams, and submitting HW5 | Danielle Yoseloff | dyoseloff@berkeley.edu |
| **Phase 1**: Project Plan, describe datasets, joins, tasks, and metrics  | Mohamed Bakr | |
|**Phase 2**: EDA, baseline pipeline, Scalability, Efficiency, Distributed/parallel Training, and Scoring Pipeline| Shruti Gupta | sguptaray@berkeley.edu |
|**Phase 3**: Select the optimal algorithm, fine-tune and submit a final report| Erica Landreth | |
m.baker@berkeley.edu
sguptaray@berkeley.edu

dyoseloff@berkeley.edu

%md
## Credit assignment plan 

| Phase | team Meamber | Tasks | Hrs|
|:---:|:---:|:---:|:---:|
|**PHASE 0**| Danielle Yoseloff | Forming Team, Create Slack Channel, and team introduction |  2 |
|**PHASE 1**| Danielle Yoseloff | Machine algorithms and metrics | 8 |
|| | Pipeline Graph | 1 |
||Erica Landreth | Abstract and Report Editing | 3 |
||| EDA | 2.5 |
||| Data Description | 8 |
|| Shruti Gupta | EDA | 4 |
|| |Missing & Null Value Exploration | 4 |
|| Mohamed Bakr | Phase Leader Table, Credit Assigment plan, and GANTT chart |  8 |
||| Digesting the Data and Checkpointing Strategy | 4 |
|||Report editing and review| 2 |
|**PHASE 2**| Danielle Yoseloff | Feature Engineering | 15|
|| | Slides and Report| 8|
||Erica Landreth | EDA and Cleaning | 11.5 |
||| Pipeline and Cross Validation Development | 6.5 |
||| Feature Engineering | 9.5 |
||| Slides and Report | 12 |
|| Shruti Gupta | EDA and Cleaning | 15 |
|| | Feature Engineering | 10 |
|| | Hyperparameter Tuning and Analysis | 6 |
|| | Report | 6 |
|| Mohamed Bakr | Setting Up Work Environment and GitHub| 1 |
|| | Join and OTPW EDA | 12|
|| | Join Pipeline | 16|
|| | Slides and Report | 8 |
|**PHASE 3**| Danielle Yoseloff | | |
||Erica Landreth | | |
|| Shruti Gupta | | |
|| Mohamed Bakr | | |


**Detailed Plan and GANTT Chart:** https://docs.google.com/spreadsheets/d/1E4A3SaTAEjh9owH4SBUMv987bktwrW4Q6TXCZ5LJ6Xg/edit?usp=sharing

**Note:** Phase 3 plans are tentative and subject to change.

## Abstract 

According to a 2019 FAA study, national airline delay-related costs exceeded $8 billion due to increased operating expenses.[2] Equipping airports with predictive systems for flight disruptions enables proactive mitigation strategies to absorb operational shocks and prevent cascading delays throughout the system. Therefore, our team aims to design a machine learning model to make a classification prediction whether a flight will be disrupted (defined as delayed or cancelled) versus on schedule two hours before its scheduled departure. We will rely on historic Department of Transportation (DoT) flight data and associated National Oceanic and Atmospheric Association (NOAA) weather station reports from the years 2015 to 2021. All results discussed in this report are with respect to a 1-year subset of data selected from the full dataset ranging over 2019.

Logistic regression was chosen for a baseline model because of its suitability for a binary outcome, interpretability, and strong performance on linearly separable data. F2 score will be used to evaluate model performance, reflecting the airports' priority to penalize false negatives (i.e., incorrectly predicting disrupted flights as on schedule). Initial results presented are based on airport, flight, and weather features at the airport of origin observed at least 2 hours before scheduled departure time, as well as daily and weekly seasonality components extracted from a Prophet model trained on the 1-year subset. The baseline model achieved an F2 score of 0.4631 on the training set and 0.4858 on the test set. To improve these results, we focused on estimating recency-based features to track arriving aircrafts given 2-hour stale information. We also introduced interaction terms to capture more nuanced delay trends, accounting for weather and airport conditions that may not be captured by individual features alone. The expanded model achieved an F2 score of 0.5895 on train and 0.5732 on test with only slight improvements after adding in interactions, suggesting that the recency-based feature family better informed the model's prediction of flight disruptions and slightly improved its generalization capabilites. However, the model still struggles to predict disruptions.
 
Moving forward to handle the full 5-year dataset, we we will pursue more advanced architectures (random forest, XGBoost, multi-layer-perceptron neural networks, and grpah-based neural networks). We will also engineer graph features to more effectively capture the anticipated effect of delay propagation and introduce yearly seasonality components from a Prophet model trained on the full dataset.





### Research Objective

Our primary customer is the airport management and administration; therefore, our aim is to use machine learning models to make a binary prediction of flight disruptions 2 hours prior to the scheduld flight departure time using the models described above. We define a disruption as a delay (according to the FAA definition, a flight that departs 15 minutes or more after its scheduled departure), or a cancellation. We consider flight cancellations as functionally analogous to long-term delays, similar to those reported as exceeding 24 hours. This approach is based on the idea that cancellations, like long delays, can disrupt resource allocation and operational flow. 

## Data Description

For this phase of our analysis, we focused on flight and weather data from 2019. This section describes the data sources we used, and defines the fields relevent to our analysis.

### Data size and source

We used the following data sources for our modeling an analysis:

| Dataset Name     | Dataset Size    | Dataset Description      |Dataset Source   |
| :-------------: | ------------- | ------------- |  ------------- |
| Flights | 2019 1 year data: 14,844,074 rows by 109 columns | DoT historical flight data from the years 2015-2021 (2019 subset) | [4] |
| Weather | 2019 1 year data: 59,270,147 rows by 130 columns | NOAA weather conditions for the corresponding time period | [1], [3] |
| Stations | 5,004,169 rows by 12 columns | The weather station data defines the distances from various weather stations to various airports. |  |
| Airports | 57,421 rows by 12 columns | The airport dataset provides airport metadata and identifiers necessary for joins. |  |



### Data dictionary

This section defines the variables from each source that we used for our initial modeling and analysis.

#### Flights data

The flights data provide metadata for a given flight, and will also help us to study time-series trends and aggregate delay statistics by characteristics such as airport and airline. The below definitions were informed by DoT documentation [4].

| Column |  Raw Data Type | Meaning | Intended Use |
|:---:|:---:|:---:|:---:|
| QUARTER | Integer | Quarter | Categorical variable to capture seasonal-periodic trends |
| MONTH | Integer | Month | Categorical variable to capture month-periodic trends |
| DAY_OF_WEEK | Integer | 1-7: Monday-Sunday | Categorical variable to capture week-periodic trends |
| FL_DATE | String | Flight date | Used in flight timestamp UTC conversion |
| OP_UNIQUE_CARRIER | String | Unique flight carrier ID | Airline categorical variable |
| TAIL_NUM | String | Aircraft tail number (registration code) | Create time-based tracking features |
| ORIGIN | String | Origin airport IATA code | Join to airports data; create route tracking features; match to seasonal components  |
| DEST | String | Destination airport IATA code | Join to airports data; create time-based tracking feature |
| CRS_DEP_TIME | Integer | Scheduled departure time (local, HHMM format) | Create time-based tracking features |
| DEP_TIME | Integer | Actual departure time (local, HHMM format) | Create time-based tracking features |
| DEP_DELAY | Double | Departure delay (min) | Define Boolean departure disruption status; create time-based tracking features |
| TAXI_OUT | Double | Time taxiing out (min) | Create time-based tracking features |
| TAXI_IN | Double | Time taxiing in (min) | Create time-based traffic flow |
| CRS_ARR_TIME | Integer | scheduled arrival time (local, HHMM format) | Create time-based tracking features |
| ARR_TIME | Integer | Actual arrival time (local, HHMM format) | Create time-based tracking features |
| ARR_DELAY | Double | Arrival delay (min) | Create time-based tracking features |
| CANCELLED | Double | 1.0/0.0: Cancelled/not cancelled | Define Boolean departure disruption status |
| CRS_ELAPSED_TIME | Double | Scheduled flight duration (min) | Represent anticipated flight length; create time-basd tracking features |
| ACTUAL_ELAPSED_TIME | Double | Actual flight duration (min) | Create time-based tracking features |
| AIRTIME | Double | Time between take-off and landing (min) | Represent flight length |
| DISTANCE | Double | Distance between origin and destination airports | Represent flight length |
| YEAR | Integer | Year | Time series feature engineering |
| DAY_OF_MONTH | Integer| 1:31 | Categorical variable to capture month-periodic trends |
| ORIGIN_CITY_NAME | Origin Airport, City Name |||
| DEST_CITY_NAME | Destination Airport, City Name	|||

We chose to drop some variables from the full flights table based on redundancy, the proportion of missing values, and relevance to our analysis. These include alternate representations of airport and airline ID's and diversion information.


#### Weather data

The weather data allows us to define weather conditions relevant to an individual flight, as well as characterize longer-term regional weather trends. The below definitions were informed by NOAA documentation [1] and [3].

| Column |  Raw Data Type | Meaning | Intended Use |
|:---:|:---:|:---:|:---:|
| STATION | String | Weather station ID | Key for joining to stations data |
| DATE | String | Date the time (UTC) of weather report | Filter weather reports in time |
| YEAR | Int | Year | Time series feature engineering |
| LATITUDE | String | Station latitude (degrees North) | Characterize station location |
| LONGITUDE | String | Station longitude (degrees East) | Characterize station location |
| REPORT_TYPE | String | Weather report type | Filter to relevant report types |
| HourlyDewPointTemperature | String | Dew point temperature (degrees F) | Define weather conditions |
| HourlyDryBulbTemperature | String | Air temperature (degrees F) | Define weather conditions |
| HourlyPrecipitation | String | Precipitation amount (in) | Define weather conditions |
| HourlyPresentWeatherType | String | String code defining present weather *e.g.* rain or hail | Parse report to fill in missing information |
| HourlyPressureChange | String | Change in pressure (in Hg) | Define weather conditions |
| HourlyRelativeHumidity | String | Relative humidity (percentage) | Define weather conditions |
| HourlyVisibility | String | Horizontal visibility (mi) | Define weather conditions |
| HourlyWetBulbTemperature | String | Wet bulb temperature (degrees F) | Define weather conditions |
| HourlyWindGustSpeed | String | Wind gust speed (mph) | Define weather conditions |
| HourlyWindSpeed | String | Wind speed (mph) | Define weather conditions |
| NAME | String | Weather Station Name | Used to Identify Weather Stations |
| REM | String | Remarks Data Section | Used for imputing some of the missing values |

We chose to drop some variables from the full weather table based on redundancy, the proportion of missing values, and relevance to our analysis. These include alternate station identifiers, daily and monthly averages, and station backup/maintenance information.

#### Weather station data

The weather station data defines the distances from various weather stations to various airports.

| Column |  Raw Data Type | Meaning | Intended Use |
|:---:|:---:|:---:|:---:|
| station_id | String | Weather station ID | Key for joining to weather data |
| lat | Double | Station latitude (degrees North) | Characterize station location |
| lon | Double | Station longitude (degrees East) | Characterize station location |
| neighbor_name | String | Airport name | Sanity check for joins |
| neighbor_state | String | Airport state | Sanity check for joins |
| neighbor_call | String | Airport ICAO code | Key for joining to airport data |
| neighbor_lat | Double | Airport latitude (degrees North) | Characterize airport location |
| neighbor_lon | Double | Airport longitude (degrees East) | Characterize airport location |
| distance_to_neighbor | Double | Haversine Distance (mi) from station to airport | Find weather stations near a given airport |


We chose to drop some variables from the full stations table based on redundancy and relevance to our analysis. These include alternate station and airport identifiers.

#### Airport data

The airport dataset provides airport metadata and identifiers necessary for joins.

| Column |  Raw Data Type | Meaning | Intended Use |
|:---:|:---:|:---:|:---:|
| icao_code | String | Airport ICAO code | Join to stations data |
| type | String | Airport type | Characterize airport operations |
| iso_region | String | ISO code of airport region | Filtering and sanity check for joins |
| iata_code | String | Airport IATA code | Join to flights data |
| coordinates | String | Airport latitude and longitude | Characterize airport location |

We chose to drop some variables from the full airports table based on redundancy and relevance to our analysis. These include alternative identifiers and local, categorical location codes.

###Joining strategy

We chose to join the flights data with raw weather observations instead of using OTPW data for two key reasons:  
* First, OTPW provides weather only at the origin, while we needed weather conditions at both origin and destination.  
* Second, OTPW uses weather at the scheduled departure time, which risks data leakage in a predictive modeling context.

To address this, we developed a multi-step pipeline with several checkpoints, as outlined below:

1. We created a UDF to extract time zones based on each airport’s latitude and longitude (parsed from the `coordinates` column in the airport codes table). The result was a helper table containing `icao_code`, `latitude`, `longitude`, and `timezone`, covering 2,237 unique airports.

2. We cleaned and deduplicated the flights data, retaining only necessary features. The dataset was reduced from a dimension of **14.8M x 109** to **7.4M x 29**.

3. We joined the flights data with the airport codes table twice—once on the `ORIGIN` IATA code and once on the `DEST` IATA code—to obtain the corresponding `icao_code`, `type`, and `iso_region`.

4. We recalculated the distance between all weather stations and airports using the Haversine formula to get accurate proximity values in kilometers.

5. Before joining flights with stations, we identified seven missing `icao_code`s that were not present in the stations dataset. We augmented the stations data by computing distances for those airports using their coordinates from the airport codes table and saved the updated result for reuse.

6. To improve join efficiency, we filtered the stations dataset down to only the closest station per airport. This reduced the station-airport combination dataset from ~5 million rows to just 2,236 rows, significantly reducing shuffle during the join.

7. We joined the flights dataset with the filtered stations data to retrieve the `station_id`, `station_lat`, `station_lon`, `airport_lat`, `airport_lon`, and `station_distance` for both origin and destination.

8. After ensuring there were no missing `icao_code`s in the time zone helper table, we enriched the flights data by joining it with time zone info using `icao_code`. This enabled us to convert the scheduled departure time into UTC (`sched_depart_utc`) and compute `two_hours_prior_depart_UTC` and `four_hours_prior_depart_UTC` using UDFs.

9. The weather dataset was preprocessed by selecting relevant features, converting date and time to UTC, and filtering to only include station-date combinations that matched those in the flights dataset. This reduced the weather data from a dimension of **59.2M x 130** to **4.4M x 18**.

10. Finally, we joined the flights data with weather data twice: once for the origin station and once for the destination station. For both, we matched on station ID and filtered for weather records where the UTC timestamp was between two and four hours before scheduled departure. (See the data description section for selected weather features.)

The full join pipeline took approximately **24 minutes** using **3–6 workers**, producing a final DataFrame of **7.4M x 100** and a Parquet file of ~1.78 GB for the one-year dataset.

All location-specific columns were prefixed with `origin_` or `dest_` to clearly indicate their reference point.

To validate the pipeline, we tested it first on a 3-month sample before scaling to one year. We ensured data quality and maintained full lineage tracking throughout. All joins had a **100% match rate**, except for the weather join, which had a **99.86% match rate** for both origin and destination—expected due to slight gaps in available weather records.

This pipeline provides a robust and well-validated dataset that serves as the foundation for downstream feature engineering and modeling.

<div style="text-align: center; line-height: 6; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/Join%20Pipeline.jpeg?raw=true" alt="Join Pipeline" style="width: 200px">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/Join%20pipeline%20ERD.png?raw=true" alt="Join ERD" style="width: 200px">

<div>

## EDA

### Missing Value Analysis

#### Weather

The most critical missing values in the weather data were location-based; without latitude or longitude information we could not match the observation to the nearest airport. To identify stations, we extracted the USAF and WBAN codes from the first and second halves of the given weather station ID and parsed the ICAO code from the text report column ("REM"). We then matched whichever attribute was available to the stations dataset to fill in identifying information to the weather data and filtered out stations not in the United States or its territories. Missing feature observations in the weather dataset could be derived from sensor malfunctions and were often compounded to result in several hours or days in a row of missing data, even despite prolific duplicates. Duplicates are defined as multiple reports emitted from the same station at the same time. Therefore, our deduplication rule was simply to keep the record with the least null values in the hourly-level columns (our columns of interest). The de-duplicated dataset with location identifiers was then used as the weather base for the join. 

<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/RawWeatherNulls.png?raw=true" alt="Null Counts: Raw Weather Data" style="width: 200px">


_Above: Nulls Distribution in Selected Raw Weather Columns_

To address missing values in the weather data used for modeling, we first parsed the remarks column which contains METAR reports to extract relevant values. In cases where the METAR reports contained insufficient information or were also missing, we prioritized spatially-based imputation. This decision was based on the fact that the weather data matched to each flight was already two hours stale, limiting the usefulness of interpolation over time. Airports were geohashed using the python-geohash package at a precision level of 2, which clusters airports into coarse regional buckets to enable spatially coherent imputation. A more granular precision level resulted in not enough airports per bucket, whereas the less granular level was too broad and would not adequately capture region-specific weather conditions. 

<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/geohash.jpg?raw=true" alt="Null Counts: Raw Weather Data" style="width: 100px; display: inline-block;"> 

_Above: Example of Geohashed Regions on 1y Training Data_

For each missing weather observation, we attempted to impute values by pulling the most recent non-null weather reports timestamped between 2–6 hours prior to the flight's scheduled departure from other airports within the same geohash bucket in an attempt to capture immediate recent weather status and events.

In cases where multiple stations within the geohash region had valid reports in that time window, we selected the most recent single record rather than computing an average, to reduce computational complexity. In cases where all stations in a region were down—due to widespread outages or technical issues—we implemented a fallback strategy by computing an exponential moving average (EMA) over the last 8 non-null records prior to the missing timestamp. This parameter was tuned to sufficiently capture remaining nulls without being unnecessarily wide. We chose the EMA approach to balance responsiveness to recent trends with the need to smooth over noise. Importantly, this method does not introduce label leakage: because all weather data were sourced from a 2–4 hour window prior to each flight's scheduled departure, no future data relative to the prediction target was used.

#### Flights

The flights dataset had true duplicates for each record, which was expected due to its information being recorded at origin and destination airports. The columns attributing delay minutes to causes (carrier, NAS, weather, security, late aircraft) were missing over 50% of their values, so we elected not to use them in the analysis. Time-related columns like arrival time, actual elapsed time, or departure time contained missing values only in the case of cancelled flights, or, in rare cases, diverted flights. Diverted flights made up just .27% of the training dataset and diversion-related columns were extremely sparse, so we elected to drop these columns for modeling and analysis. 

The TAIL_NUM column is essential for relating multiple flights by the same aircraft, and contained only .29% missing values in the training set, so nulls were treated as a missing value indicator which was inherited by dependent features. We also encountered cases where the same aircraft appeared scheduled to depart to different destinations at the same UTC departure timestamp. These apparent duplicates occurred exclusively when one of the records experienced a severe delay or cancellation, so we concluded that they were not true duplicates but reflected inconsistencies from when system snapshots were recorded.

### Raw Features Analysis

Disrupted flights constitute 21% of the training dataset; of these, 10% are caused by cancellations and 90% are caused by delays over 15 minutes. 


Correlation analysis was conducted on the training dataset, which consists of the first three-quarters of a year-long dataset spanning 2019. Initial analysis focused on the relationship between weather features and a constructed departure delay indicator, which identifies flights that were delayed by more than 15 minutes or canceled. The Spearman correlation results indicated relatively low correlations overall. The strongest correlations with flight delays were observed for precipitation amount (measured in hundredths of an inch) and wind gust speeds, both of which were positively associated with flight disruptions. Additionally, several weather features were found to be highly correlated with one another, suggesting that they may not contribute additional variance or new information to the model.

<div style="text-align: center; line-height: 6; padding-top: 30px; padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/spearmancorr_weather.png?raw=true" alt="Join Pipeline" style="width: 50%; height: auto; display: inline-block;">



We also explored airport-specific associations with delay. In this figure, the marker size represents the relative proportion of flights departing from the airport during the first three-quarters of the year, with a minimum size for visibility, and color represents continuous delay amount in minutes. Less busy airports appear to have more severe delays. This motivated our inclusion of the categorical origin airport type (small, medium, or large size) during the modeling phase. Delays do not appear to be concentrated in regional patterns, and locations outside the continental US did not exhibit significantly different behavior. The visual is limited because it does not display cancellations.

<div style="text-align: center; line-height: 6; padding-top: 30px; padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/airportdelays.jpg?raw=true" alt="Join Pipeline" style="width: 50%; height: auto; display: inline-block;">




## Feature Engineering

To augment the features available natively in the flights and weather datasets, we engineered features related to prior flights and seasonality.

### Prior Flight / Recency Features

#### Overview
Our primary modeling focus was on incorporating recency features, based on the hypothesis that operational status indicators from the preceding flight leg of an aircraft—such as whether the aircraft was delayed, cancelled, departed, or arrived—would be highly predictive of disruption outcomes for the aircraft at the current origin. This decision was further supported by initial exploratory data analysis (discussed further below): Spearman correlation coefficients between raw features and the target variable revealed limited signal in most static flight attributes, and distributional comparisons showed little meaningful variation. This aligns with domain intuition: disruptions in the aircraft’s prior leg are likely to propagate and impact on-time performance at the next origin, which is supported by the sucess in model performance using these engineered features.

Among the recency-based features we created, we selected the following estimates (i.e., calculated only from information known at or before 2 hours before scheduled departure) for modeling:

1. Binary indicators capturing prior flight’s status:
   - Whether it departed from its previous origin
   - Whether it was delayed at its previous origin
   - Whether it was cancelled at its previous origin
   - Whether it arrived at the current origin

2. Continuous timing features (in minutes):
   - Departure delay at the previous origin
   - Air time of the prior flight
   - Turnaround time between the prior arrival and scheduled departure of the current flight


When incorporating aircraft tracking data, we focused on addressing two major concerns: data quality issues and leakage.

_Data Quality_

We defined a prior flight by three conditions:

1. Consistent aircraft identified by tail number
2. The aircraft's immediate previous destination matches the current origin
3. The aircraft left its immediate previous origin less than 24 hours before the current flight's T-2 scheduled departure time

Our first condition assumes that a flight's actual tail number, i.e. assigned aircraft, is known at the time of evaluation. The second condition is motivated by observed inconsistencies in aircraft flight routes. For example, in one day an aircraft arrives at airport A, yet the next record of the same aircraft shows it departing from airport B, with no flight record of its flight from A to B. This condition intends to enforce data integrity by ensuring a prior flight really is the flight that aircraft completed to arrive at the current origin. The third condition is also motivated by the possibility of missing flight records and upholding the integrity of the meaning of a prior flight. There exist records where a plane's prior flight to its current origin may be several days or even months in the past. We believe a "prior flight" that happened too far in the past does not affect current flight delay in the way we are are hoping to capture via these recency features. Furthermore, because we don't understand the context for these gaps we consider the possibility that true prior flight activity records are not present. 

These filters helped reduce the risk of incorporating misleading features derived from incomplete route chains and uphold the expected meaning of our engineered features.

_Leakage_

We only wanted to incorporate information that would be known at the threshold T-2 hours before the scheduled departure time. This influenced the variables considered in our calculations, based on whether the estimated or actual timestamp data would be available, and how much of the continuous time duration data would be available.

Two core assumptions were made: Firstly, that all prior flights are scheduled more than 2 hours before a record's scheduled departure time. Secondly, that an airport would know at the time threshold whether the immediate prior flight of an aircraft was cancelled. This is because we do not know at what point a flight is declared cancelled.




#### Methods

We began by calculating a threshold timestamp: 26 hours prior to each flight’s scheduled departure. Using this, we generated lagged features over the aircraft tail number (i.e., unique aircraft identifier), including origin and destination airports, scheduled and actual departure times, delays, and arrival times.

Contingent on meeting the prior flight information criteria, we created the following features

_Cancellation_: 
- Indicator (boolean): A binary flag indicating whether the previous flight was cancelled. No restriction on timing was applied, as cancellations are often logged early and knowing about them before the prediction threshold (2 hours prior to departure) aligns with our use case.

_Delays:_
- Continuous variable (minutes): Estimated delay of the prior flight, computed based on available data:

   - If the prior flight scheduled departure and recorded departure were both before the threshold, we  simply used the true recorded delay value.
   - If the prior flight was scheduled to depart before the threshold, but did not have a recorded true departure time yet, we did not attempt to estimate what the further delay might be. Instead, we essentially made the assumption that it departed at the 2 hours prior UTC time by recording the delay as the difference between the threshold and the prior flight scheduled departure time. In the future, this could be fine tuned by setting a default parameter relative to the estimated prior flight time or estimated based on some other indicator, but it only accounted for a small proportion of cases and we did not want to introduce additional computational overhead.
   - If the prior flight was scheduled to depart after the threshold, and the route information met the standard, we assumed there would be no delay, as we do not have cause to believe there might be. This could also be tuned by calculating average delay for that route, but as this only represented a small portion of cases, we similarly hesitated to introduce computationally intensive operations.

   - If the prior flight was cancelled or the route information was missing, leaving us without data on the prior flight, we filled in the delay calculation with the most recent non-null delay data from the same route’s previous leg (i.e., the same origin-destination pair). Since we don't have the specific prior leg information, we instead look for the most recent available instance for the same route (same origin and destination) and use the delay from that prior flight as a proxy for the current flight’s delay. 
    
       - This decision is based on the understanding that operational disruptions, including delays, are often correlated within the same route. Delays from one leg of a flight route are likely to impact subsequent flights on the same route. The rationale was further validated by EDA on the initial engineered features, which showed that when the prior flight's destination did not match the current flight's origin—an issue that occurred in 3% of the training dataset—58% of those cases led to disrupted outcomes (delays or cancellations). 



- Indicator (boolean): If the prior flight was estimated to have been delayed, or known to have been cancelled, the delay indicator was set to True.


_Departures_: 

- Indicator (boolean): If and only if the known prior flight departure time met the data quality standard and was before the threshold the boolean prior flight departure indicator was set to true. 

- Estimator (timestamp): The prior departure time was estimated by adding the estimated delay calculation to the scheduled departure time.

_Arrivals_:

- Indicator (boolean): If and only if the prior flight known arrival time met the data quality standard and was before the threshold, the indicator was set to True.

- Estimator (timestamp):
   - If the prior flight arrived before the 2-hour window, we filled this in with the true arrival time.
   - If the prior flight was known to have departed before the threshold, we filled this in by adding the estimated elapsed time to the known departure time.
   - Otherwise, we simply added the estimated elapsed time to the estimated departure time.


_Turnaround time_: 

- To calculate the estimated amount of time the aircraft had between arriving and departing, we took the difference between the estimated arrival time of the previous flight and the estimated departure time of the current flight. If the previous flight was not confirmed, we again estimated this from the calculated turnaround time from the last record of this route being flown.

### Seasonality Features

We expect fluctuation in flight delays on a number of timescales, as travel demand and airports' ability to meet that demand vary over time. For example, delays vary throughout the day with the volume of traffic at an airport, and as delays from earlier in the day impact later flights. Delays also vary throughout the year as travel habits change--*e.g.* consider spring break, winter holidays, summer travel, or ski trips. We have engineered seasonality features to capture these effects quantitatively, in order to provide the model input about what seasonal effects may be at play for a given record.

To produce these features, we trained seasonality models using the Prophet Python library [5]. For a given training dataset (for each cross-validation fold and overall), a Prophet model was fit for each airport using the UTC departure time as the time field and departure delay in minutes as the outcome variable. Each model assumed linear growth, an uncertainty interval width of 90%, and included weekly, daily, and yearly seasonality components. Each model was used to forecast predictions one week into the future, with an hourly frequency (*i.e.* to get the daily and weekly components for each hour throughout the week). These components, along with the airport identifier, were stored in a lookup table. The example below shows the seasonality components for Boston Logan International Airport (BOS), trained on the January to September 2019 training set.

<div style="text-align: center; line-height: 6; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/BOS_seasonality_ex_v2.png?raw=true" alt="BOS Seasonality" style="transform: scale(0.2);">
<div>

To apply these seasonality components to the modeling data, the modeling data was joined to this lookup table on airport, day of week, and hour of day, to get the relevant daily and weekly seasonality components for each record. The following table summarizes the resulting features.

| Feature |  Data Type | Description |
|:---:|:---:|:---:|
| daily | Float | Daily seasonality component (offset from trend) in minutes |
| weekly | Float | Weekly seasonality component (offset from trend) in minutes |

Because these seasonality components are *trained* from data, we had to be mindful of leakage when creating these features. To ensure that our cross-validation and overall test sets were not contaminated with test data information, we trained a seaparate seasonality model for each cross-validation fold and the overall dataset, utilizing the relevant training dataset in each case (*e.g* the seasonality trained on CV fold 1 training data was applied to the CV fold 1 training and test sets).

The below figure summarizes the Spearman correlation of the seasonality features with the outcome variable for the full 2019 (1 year) dataset. We see that daily seasonality is moderately correlated with our outcome, but the weekly feature is only very weekly correlated.

<div style="text-align: center; line-height: 6; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/seasonality_correlation.png?raw=true" alt="Seasonality Heatmap" style="width: 200px">
<div>

Note that we have chosen to omit yearly seasonality from our feature set here because our full training set has data only for 9 months of the year. Once our training data spans at least a full year, we will add yearly and holiday-based seasonality features.

### Feature Engineering Next Steps

Given the correlation of the prior flight features to our outcome, incorporating additional recency and frequency features seems promising to improve our predictions. We'd like to extend this idea to airport-related features, *e.g.* capturing the average delay and proportion of flights delayed at the origin airport between 2 and 4 hours prior to a scheduled flight. We suspect this will provide valuable information to the model about broader, airport-wide conditions that may lead to delay.

In addition, we would like to explore and incorporate graph-based features. Preliminary ideas include airport degree and betweenness centrality.

Finally, as mentioned above, we will augment our existing seasonality models with yearly and holiday-based components.

## Modeling

### Modeling Pipeline

The following steps and diagram outline our end-to-end modeling workflow. The remainder of this section provides additional details for each step.

1. **Ingestion:** Load raw data into Spark DataFrames
2. **Feature selection:** Drop unnecessary columns
3. **Join:** Combine data sources into joined DataFrame
4. **Feature engineering and imputation, part I:** Add "non-trained" engineered features and fill missing values using time series methods
5. **Split:** Divide data into training, validation, and test splits
6. **Sample:** Undersample training data
7. **Feature engineering and imputation, part II:** Incorporate or fill features based on training data characteristics
8. **Define machine learning pipelines:** Create Spark Pipeline objects for feature transformations and modeling
9. **Hyperparameter tuning:** Use cross-validation to train a model that balances performance and generalizability
10. **Model training:** Train final model(s) using chosen hyperparameters
11. **Model evaluation:** Assess trained models on test data

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/ml_pipeline.png?raw=true" alt="ML Pipeline" style="width: 500px">
<div>

### Ingestion, Feature Selection, and Join

Raw data were loaded from the provided parquet files and unnecessary columns were dropped on the basis of relevance (as discussed in **Data Dictionary**) or missing value status (as discussed in **Missing Value Analysis**). The weather and flights data were joined (as discussed in **Joining Strategy**) and saved out to an intermediate parquet file.

#### Feature Engineering and Imputation, part I

In our processing pipeline, we make the distinction between processing steps that are "trained" verus "non-trained." Non-trained processing steps are those which can be applied to a single record in isolation, or which depend only on past-time information for a given record. These steps can be performed before splitting the data without introducing leakage.

In contrast, trained features are those which are trained on a reference dataset, and therefore must be trained on the training set only and applied after splitting the data to avoid test set leakage.

In "part I" of feature engineering and imputation, we addressed the non-trained feature processing steps. These include the weather data imputation and prior flight feature engineering, as discussed in the sections **Missing Value Analysis** and **Prior Flight Features**, respectively. Results from this step were written out to an intermediate parquet file.

#### Train, Test, and Cross Validation Splits

For the five year dataset, we trained our machine learning models on the first four years (2015-2018) and tested on a held out set consisting of the last year (2019). To validate our models and tune hyperparameters, the training set was further split into 5 cross-validation folds with 20% overlap. The folds and overlap were defined in terms of number of days (*i.e.* the folds were split so that each included the same number of days' worth of data), with the assumption that this would produce splits with comparable numbers of record. See the table below for the date limits (each date cutoff corresponds to midnight UTC) of data included in each split.

Each fold had approximately **UPDATE** records in train and test each, and the full train and test sets containted 24,321,796 and 18,108,793 records, respectively. See the table below for date limits of data included in each split, for each fold. Note that the maximum times are exclusive.

| Modeling Case |  Train Time Period | Test Time Period |
|:---:|:---:|:---:|
| CV Fold 1 | 12/31/14 - 10/09/2015 | 10/09/2015 - 07/17/2016 |
| CV Fold 2 | 08/14/2015 - 05/21/2016 | 05/21/2016 - 02/27/2017 |
| CV Fold 3 | 03/27/2016 - 01/01/2017 | 01/01/2017 - 10/10/2017 |
| CV Fold 4 | 11/08/2016 - 08/14/2017 | 08/14/2017 - 05/23/2018 |
| CV Fold 5 | 06/22/2017 - 03/27/2018 | 03/27/2018 - 01/01/2019 |
| Overall | 12/31/14 - 01/01/2019 | 01/01/2019 - 01/01/2020 |

#### Sampling Strategy


The computations of datapoint distances used in oversampling/SMOTE is not favorable considering the size of our data, so we took advantage of the robustness of our observations and used undersampling to address the class imbalance in our outcome variable. Our current undersampling method creates a balanced dataset where the number of samples in each class is roughly equal. However, in Phase III, we will refine this strategy to ensure variables of interest are not disproportionally influenced due to the time series quality of the data.

#### Feature Engineering and Imputation, part II

In this step, we addressed the "trained" processing steps discussed above. This consisted of applying the seasonality models (as discussed in the section **Seasonality Features**).

#### Machine Learning Pipelines

We used the Pyspark Pipelines API to transform our chosen features and train machine learning models. This subsection explains our outcome variable, which features we used for modeling, and how we defined our Pipeline to transform those features and perform modeling.

#### Outcome variable

We have chosen to pursue a binary classification problem characterizing a flight's departure delay status. Our outcome variable is a binary flag indicating whether or not a flight was either delayed (using the FAA definition of 15 or more minutes late) or cancelled. We choose to include cancellations in our "delay" case since they have similar consequences for our stakeholder in our business case of airport resource management.

#### Feature Families

We explored six different feature families in our modeling experiements. The below table summarizes the features in each family:

| Feature Family (# Features) | Feature Name | Type | Raw or Engineered | Description |
|:---:|:---:|:---:|:---:|:---:|
| **Numeric Weather Features (9)** | origin_HourlyDewPointTemperature | Float | Raw | Hourly dew point temp. at origin airport |
|  | origin_HourlyDryBulbTemperature | Float | Raw | Hourly dry temp. at origin airport |
|  | origin_HourlyPrecipitation | Float | Raw | Hourly precipitation at origin airport |
|  | origin_HourlyPressureChange | Float | Raw | Hourly pressure change at origin airport |
|  | origin_HourlyRelativeHumidity | Float | Raw | Hourly relative humidity at origin airport |
|  | origin_HourlyVisibility | Float | Raw | Hourly visibility at origin airport |
|  | origin_HourlyWetBulbTemperature | Float | Raw | Hourly wet bulb temp. at origin airport |
|  | origin_HourlyWindGustSpeed | Float | Raw | Hourly wind gust speed at origin airport |
|  | origin_HourlyWindSpeed | Float | Raw | Hourly wind speed at origin airport |
| **Flight Metadata (7)** | OP_UNIQUE_CARRIER | Categorical | Raw | Carrier (airline) |
|  | ORIGIN_ICAO | Categorical | Raw | Origin airport |
|  | DEST_ICAO | Categorical | Raw | Destination airport |
|  | origin_type | Categorical | Raw |  Origin airport type |
|  | dest_type | Categorical | Raw |  Destination airport type |
|  | DISTANCE | Float | Raw |  Distance between origin and destination airports |
|  | CRS_ELAPSED_TIME | Float | Raw |  Scheduled flight duration |
| **Date Information (5)** | YEAR | Categorical | Raw | Year of flight date |
|  | QUARTER | Categorical | Raw | Quarter of flight date |
|  | MONTH | Categorical | Raw | Month of flight date |
|  | DAY_OF_MONTH | Categorical | Raw | Day of month of flight |
|  | DAY_OF_WEEK | Categorical | Raw | Day of week of flight |
| **Seasonality Components (2)** | daily | Float | Engineered | Daily seasonlity component from Prophet model |
|  | weekly | Float | Engineered | Weekly seasonality component from Prophet model |
| **Prior Flight Features (6)** | priorflight_elapsed_time_calc_raw | Float | Engineered | Estimated prior flight duration |
|  | turnaround_time_calc | Float | Engineered | Calculated time between prior flight arrival and present flight departure |
|  | priorflight_depdelay_calc | Float | Engineered | Estimated prior flight delay |
|  | priorflight_isdeparted | Boolean | Engineered | Indicates whether prior flight has departed |
|  | priorflight_isarrived_calc | Boolean | Engineered | Indicates whether prior flight has arrived |
|  | priorflight_isdelayed_calc | Boolean | Engineered | Indicates whether prior flight was delayed |
| **Interaction Terms (2)** | TempPrecipitation | Float | Engineered | Interaction term of origin_HourlyPrecipitation and origin_HourlyWetBulbTemperature |
| | AirportTurnaround | Float | Engineered | Interaction term of origin_type_vec and turnaround_time_calc |

Note that we have chosen not to employ feature selection techniques so far in our modeling, since the total number of features in use is relatively small. As we explore additional engineered features in Phase III, we will implement feature selection techniques like PCA or Lasso regression as appropriate.

#### Feature Transformations

Our final feature set consisted of both numeric and categorical features, which were handled separately within the pipeline. Numeric features were scaled using Spark's MinMaxScaler. Categorical features were one-hot encoded using Spark's StringIndexer (maps text categories to integers) and OneHotEncoder (one-hot encodes integer categories). For some models, interaction terms were also added via Spark Interaction objects. The transformed numeric, categorical, and interaction features were assembed using Spark's VectorAssember, to be used as input to the classification object.

##### Spark Pipeline Diagram (interaction terms not depicted)
<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/SparkPipeline.png?raw=true" alt="Spark Pipeline" style="width: 500px">
<div>

#### Modeling

For our baseline model our final transformed features were input to a Spark LogisticRegression object to complete our pipeline and define the machine learning model. We chose logistic regression for our baseline model because of its suitability for a binary outcome, interpretability, and strong performance on linearly separable data. This final pipeline was fit using the training data, then "transformed" the test data to output predictions for evaluation.

#### Modeling Experiments, Hyperparameter Tuning, and Training

In our modeling exploration, we evaluated four logistic regression models: three models without regularization and one with regularization:
- **Baseline (BL):** Used Numeric Weather, Flight Metadata, Date Information, and Seasonality Component feature families.
- **Additional Features (AF):** Used all features from BL, plus the Prior Flight feature family.
- **Interaction (INT):** Used all features from AF, plus the Interaction Terms feature family.
- **Regularization (REG):** Used all features from INT, with a regParam\* and elasticNetParam\* hyperparameter values of 0.1 and 0.0, respectively. The regularization parameters were chosen via a grid search the values 0.0, 0.01, and 0.1 for regParam and 0.0, 0.5, and 1.0 for elasticNetParam. The hyperparameters that yielded the highest average cross-validation F2 score were chosen for the final model.

*The regParam hyperparameter specifies how much the model is penalized for having large weights (larger values indicating more penalty). The elasticNetParam hyperparameter specifies the balance between Lasso and Ridge regression (0 corresponds to Ridge and 1 to Lasso).

Each of the four models was first trained on each fold of the cross-validation data and output predictions to evaluate cross-validation performance. They were then trained on the full training dataset and output predictions for the held out test set. For all training, binary crossentropy loss (equation below) was used as the loss function.

$$J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\log(\hat{p}^{(i)})+(1-y^{(i)})\log(1-\hat{p}^{(i)}))]$$

#### Model Evaluation

In choosing an appropriate metric to evaluate our classification performance, we weighed the cost of each classification error:

**Type I error** (*false positive:*) The model predicts a delay, but the flight departs on time) this may cause confusion and unnecessary changes from air traffic control. however, it's more acceptable if we are prioritizing caution and want to minimize unexpected delays.

**Type II error** (*false negative:*) The model predicts the flight will be on time, but it ends up delayed) Passengers and crew aren’t prepared for the delay, potentially leading to missed connections, poor customer satisfaction, and operational disruptions. This type is more costly if unexpected delays cause major disruptions, and you aim to avoid them at all costs.

We estimate that Type II errors are roughly twice as costly as Type I errors to our stakeholder, therefore we choose the F-beta metric with a beta value of 2, to reflect the emphasis on recall over precision. The metric is defined below, where FP and FN refer to the number of false positives and false negatives the model predicts, respectively.

$$F\beta-Score = \frac{\beta^2 + 1}{\frac{\beta^2(TP + FN)}{TP} + \frac{TP+FP}{TP}}$$

## Results and Discussion

The following table summarizes the models' performance on the cross validation splits and the held-out test set.

### F2 Scores by Model

| Split | **BL** |  **AF**  | **INT** | **REG** |
|:---:|:---:|:---:|:---:|:---:|
| CV1 | 0.4 | 0.58 | 0.58 | 0.58 |
| CV2 | 0.38 | 0.56 | 0.56 | 0.58 |
| CV3 | 0.46 | 0.6 | 0.6 | 0.6 |
| CV4 | 0.6 | 0.64 | 0.62 | 0.62 |
| CV5 | 0.47 | 0.57 | 0.57 | 0.56 |
| CV Avg. | 0.46 | 0.59 | 0.59 | 0.59 |
| Held Out | 0.49 | 0.57 | 0.57 | 0.27 |

AF and INT outperform the baseline (BL) on cross-validation, showing that adding features and modeling complexity improves the model's ability to predict flight disruptions. In contrast, the regularized model (REG) suffers a severe drop in performance on the hold-out set. Both AF and INT perform similarly with similar generalization capabilites, suggesting that 
the few interaction terms added are insufficient to cause significant improvements. All models were run on a cluster with 3-6 workers.

### Baseline (BL)


#### Test Set Performance (F2 Score)
|                    | Predicted Negative | Predicted Positive |
|--------------------|--------------------|--------------------|
| Actually Negative    | 921,316 (TN) | 617,280 (FP) |
| Actually Positive   | 121,060 (FN) | 208,163 (TP)|


The baseline model took approximately 27 minutes to run 5-fold cross-validation on the training set and 7 minutes to evaluate on the held-out set. Interestingly, the baseline primarily predicted false positives, which may be a consequence of excessive downsampling. However, as desired, we avoided the more costly outcome of false negatives. The greatest share of false positives were predicted for flights originating from ORD, suggesting that the model over-predicts risk of disruption at that airport. Notably, the model did not predict any false positives for canceled flights, suggesting cancellation-related signals in the data are strong and distinguishable enough to avoid confusion with delayed flights.










### Engineered Features (AF)

#### Test Set Performance (F2 Score)
|                    | Predicted Negative | Predicted Positive |
|--------------------|--------------------|--------------------|
| Actually Negative    | 1,178,979 (TN) | 359,617 (FP) |
| Actually Positive   | 112,143 (FN) | 217,080 (TP)|



This version of the model incorporated the recency features discussed above: an indicator for whether the aircraft was known to have departed from the prior origin, an indicator for whether the aircraft known to have arrived at the current origin, and an indicator for whether the aircraft was estimated to be delayed at its prior origin. We also incorporated estimators for how much an aircraft was estimated to be delayed at its prior origin and what the turnaround time between its arrival to and departure from the current origin airport would be. 

The model took 36 minutes to run 5-fold cross-validation on the training set and 13 minutes to evaluate the held-out set.

Compared to the baseline, misclassifications remained skewed toward false positives, but the distribution between false positives and false negatives was more balanced overall. Notably, many of the false positives from the baseline model were now correctly classified as true negatives. This model did not significantly improve the classification of false negatives; most false negatives classified by the baseline model did not change their designation even with the recency features. Overall, adding in the recency feature family helped the model to correctly predict negatives, but did not appear to help the model correctly predict positives compared to the baseline.

### Interactions (INT)

#### Test Set Performance (F2 Score)
|                    | Predicted Negative | Predicted Positive |
|--------------------|--------------------|--------------------|
| Actually Negative    | 1,180,245 (TN) | 358,351 (FP) |
| Actually Positive   | 112,314 (FN) | 216,909 (TP)|


To better capture the impact of adverse snow and ice weather events—particularly delays related to de-icing—we introduced an interaction between the hourly wet bulb temperature and precipitation at the origin airport. In parallel, to account for differences in how turnaround windows affect scheduling depending on airport resources, we added an interaction between airport type (small, medium, or large) and the estimated turnaround time.


This model took approximately 43 minutes to run 5-fold cross-validation on the training set and 9 minutes to evaluate the held-out set.

Despite these new interaction terms, performance improvements were minimal. The majority of false negatives remained unchanged, with fewer than 1% reclassified as true positives. A small portion of false positives were correctly reclassified as true negatives, but overall, these interactions did not significantly shift the model’s predictive behavior. This suggests that the current interactions may not be strong enough to meaningfully impact model performance, and that further or more targeted interaction terms may be necessary to see substantial gains.




### Regularization (REG)


#### Test Set Performance (F2 Score)
|                    | Predicted Negative | Predicted Positive |
|--------------------|--------------------|--------------------|
| Actually Negative    | 1,504,215 (TN) | 34,381 (FP) |
| Actually Positive   | 252,597 (FN) | 76,626 (TP)|

We choose to introduce a regularization in order to penalize the model for especially large coefficients. The intended effect is to discourage overfitting and therefore produce a more generalizeable final model. A regularization parameter of .1 was chosen through a grid search as described above. Notably, the grid search chose an elastic net mixing parameter of 0, indicating complete ridge regularization with no lasso regularization component. This makes sense because we do not yet have a large number of features especially relative to the number of observations.  

This model took approximately 5 hours and 36 minutes to run the hyperparameter grid search (9 cases, an average of 37 minutes per case) over 5-fold cross-validation and 9 minutes to run on the held-out test set.

The regularized model performed very similarly to the interactions model during cross-validation. However, it performed much worse on the held-out training set. There was an improvement in the false positive rate compared to the interactions model, but a worsening in the false negative rate. Since F2 emphasizes false negatives as more costly, this resulted in a fairly dramatic decrease in F2. 


Initially, we expected that the ridge regularization parameter would prevent overfitting on the training set by penalizing potentially unnecessary complexity introduced by noisy weather features highly correlated with one another. However, this poor performance on the held-out test set suggests that the regularized model’s ability to generalize is compromised, not due to noise or irrelevant features, but potentially because the regularization may have overly constrained the model’s ability to learn from the data and capture important patterns.



## Conclusion

Accurate prediction of flight delays enables airports to prepare for operational shocks, in order to prevent delay propagation. Our aim is to design a machine learning classificaiton model to predict whether a flight will be delayed, using historical flight and weather data. In this phase, we introduced engineered features for prior flight status and seasonality. Training logistic regression models on these features, plus numeric weather and flight metadata features from the raw data, we produced models that achieved F2 scores ranging from 0.27 to 0.57 on a held-out set. The models with engineered features outperformed those without them, which supports the predictive power of the engineered features. In the next phase, we plan to introduce additional engineered features capturing airport-level delay recency and frequency, the network structure of flights and airports, and long-term seasonality. We will also explore nonlinear models including XGBoost, Random Forest, and Multi-Layer Perceptron; and will train these models using a five-year dataset. We anticipate that these additions will allow us to produce a robust and performant model to suit our stakeholders' needs.

## Code Notebooks

- Directory and raw data prep: [0.01-mas-dir-and-raw-data-prep](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/4262628234468304?o=4021782157704243#command/7738973093567460)
- Weather data cleaning: [0.03-sg-weather-clean](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/905158478261251?o=4021782157704243#command/8643339954781431)
- Flights data initial cleaning: [1.04-eil-flights-cleaning](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1032319343318287?o=4021782157704243#command/7738973093566523)
- Flights data EDA: [1.06-eil-flights-eda](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/905158478261252?o=4021782157704243#command/8643339954781548)
- Join pipeline: [0.08-mas-data-join-pipeline](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/905158478261859?o=4021782157704243#command/7738973093566465)
- Prior flight feature engineering: [1.10-dy-joined-prior-feat-eng](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1556213624083044?o=4021782157704243#command/7738973093573204)
- Seasonality, cross-validation, and modeling setup/development: [3.11-eil-joined-modeling](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1556213624087093?o=4021782157704243#command/7738973093574471)
- Joined data cleaning: [0.12-sg-joined-cleaning](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1556213624085050?o=4021782157704243#command/7738973093573832)
- Joined data cleaning and feature engineering: [1.12-sg-joined-cleaning-engineering](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3581936500212487?o=4021782157704243#command/3581936500212488)
- Joined data modeling: [3.12-sg-joined-modeling](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1556213624087634?o=4021782157704243#command/6823299216256987)
- Modeling postprocessing: [3.13-modeling-analysis](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3581936500213637?o=4021782157704243#command/6823299216258268)

## Presentation

- [Slide Deck](https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/Phase_II_Presentation.pdf) presented on April 2, 2025

## Bibliography

<ol>
    <li>"Federal Climate Complex Data Documentation for Integrated Surface Data (ISD)." NOAA NCEI, 12 Jan. 2018, https://www.ncei.noaa.gov/data/global-hourly/doc/isd-format-document.pdf. Accessed 16 Mar. 2025.</li>
    <li>Lee, Kangoh. “Airline operational disruptions and loss-reduction investment.” Transportation Research Part B: Methodological, vol. 177, Nov. 2023, p. 102817, https://doi.org/10.1016/j.trb.2023.102817. </li>
    <li>“Local Climatological Data (LCD) Dataset Documentation.” Local Climatological Data (LCD) Data, NOAA NCEI, www.ncei.noaa.gov/pub/data/cdo/documentation/LCD_documentation.pdf. Accessed 16 Mar. 2025.</li>
    <li>"Reporting Carrier On-Time Performance (1987-present)." Bureau of Transportation Statistics, https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ. Accessed 16 Mar. 2025.</li>
    <li>Taylor, Sean and Letham Benjamin. "Forecasting at scale." PeerJ Preprints 5:e3190v2, 2017. https://doi.org/10.7287/peerj.preprints.3190v2</li>
    <li>“Understanding the Reporting of Causes of Flight Delays and Cancellations.” Bureau of Transportation Statistics, US Department of Transportation, 15 Apr. 2024, www.bts.gov/topics/airlines-and-airports/understanding-reporting-causes-flight-delays-and-cancellations. </li>
</ol>