# Final Report Requirements
- For this phase, please turn in your notebook and your business write-up (this write-up should be part of your notebook).
- This final report should be one complete, well-manicured notebook that tells a story. It should have a beginning, middle, and end. 
- It is expected that you have done many experiments. What would like to see is a summary of the key experiments (successes and surprises) 
- In short, your report should be business like in structure and language. Please try to use tables and diagrams as much as possible to shed light on and support your findings. 
- In your final report, write a gap analysis of your best pipeline.
- Finally, the abstract, the discussion, and the conclusions are key. So please devote time to fleshing these out carefully.
- NO CODE IN THIS NOTEBOOK!

# Final Report Rubric

## In-class presentation (10 pts)
In-Class Presentation should have a logical and business flow to it. In more detail, your In-Class Presentation should have a logical and scientific flow to it with main sections for each of the following:

- a title slide (with the project name, Group Number, the team member names, and photos).
- an abstract slide
- Make sure it has an outline slide with good descriptive section headings
- Team names, photos
- Project description
- Some summary visual EDA
- Feature engineering and Top features
- Overview of Modeling Pipelines explored
- Results and discussion of results (Accuracy, ROC/AUC, etc.. from this phase and previous phases)
- Conclusions (best performing model, number of features, top 10 best features, hyper-parameters) and next steps

## Team and project meta information (10 pts)
Please provide the following:
* Team ID
* The complete list of team members and project meta information (e.g., **email**).

* Credit assignment plan updates (who does/did what and when, amount of effort in terms of person-hours, start and end dates, estimated improvement in terms of key metrics) in Table format
No Credit assignment plan means ZERO points
A credit assignment plan not in Table format means ZERO points
No start and end dates and (budgeted) hours of effort mean an incomplete plan. This may result in zero points.

## Project Abstract (10 pts)
- Final Abstract: The final form of the abstract! It should have everything covered in previous phases, plus the new experiments and the final model selected, as well as your final results (report the number!)
- Make sure to describe what your focused on and accomplished in this project (include this phase and previous phases). Have a look at the expectations with regard to a good abstract.

## Data and feature engineering (10 pts)
- Summarize the data lineage and key data transformations (joins)
- List of feature families explored and explanation of each
- List of features within each family and description of each, along with THEIR EDA
- Please refer to experiments showing the value of each feature/family

## Neural Network (MLP) (10 pts)
You are expected to train a Neural Network
- Implement Neural Network (NN) model
- Experiment with at least 2 different Network architectures and report results.
- Must show training and performance scores, **including training curves by epoch**

## Leakage (10 pts)
- Define what is leakage and provide a a hypothetical example of leakage
- Go through your Pipeline and check if there is any leakage.
- Are you violating any cardinal sins of ML?
- Describe how your pipeline does not suffer from any leakage problem and does not violate any cardinal sins of ML

## Modeling Pipelines (10 pts)
Expectations here are to provide the following in sections and subsections:

- A visualization of the modeling pipeline (s) and subpipelines if necessary
- Families of input features and count per family
- Number of input features
- Hyperparameters and settings considered
- Loss function used (data loss and regularization parts) in latex
- Number of experiments conducted
- Experiment table with the following details per experiment:
    - Baseline experiment
    - Any additional experiments
    - Final model tuned
    - best results (1 to three) for all experiments you conducted with the following details
    - Computational configuration used
    - Wall time for each experiment

## Results and discussion of results (20 pts)
Expectations here are to provide the following: The goal of Discussion’ section is present an interpretation of key results , which means explain, analyse, and compare them (results from all the phases). Often, this part is the most important, simply because it lets the researcher take a step back and give a broader look at all experiments conducted. Do not discuss any outcomes not presented in the results part.

Make sure to provide the following in sections and subsections:
- Your experiments are properly enumerated/tabulated and discussed (accurate descriptions, performance metrics)
- Discuss results not substantiated in your experimental section above in the modeling pipelines
- Provide gap analysis

## Conclusion (10 pts)
Expectations here are to address the following following in your conclusion in a main section by itself (150 words or less):

- Restate your project focus and explain why it’s important. Make sure that this part of the conclusion is concise and clear.
- Restate your hypothesis (e.g., ML pipelines with custom features can accurately predict .......)
- Summarize main points of your project: Remind your readers your key points. (e.g, best features, best model, hyper-parameters and so on)
- Discuss the significance of your results
- Discuss the future of your project.

## Extra credit
- Deep learning (5 points)
- Recent data (5 points)

# Phase III Project Report
__`Team 4-1`__

`April 19, 2025`

`Phase III led by Erica Landreth`

## Authored By:

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/authors.png?raw=true" alt="ML Pipeline" style="width: 500px">
<div>

## Phase Leader Plan
| Phase |  Phase Leader | Phase Leader Email|
|:---:|:---:|:---:|
| **Phase 0, HW5**: Finalize Teams, and submitting HW5 | Danielle Yoseloff | dyoseloff@berkeley.edu |
| **Phase 1**: Project Plan, describe datasets, joins, tasks, and metrics  | Mohamed Bakr | m.baker@berkeley.edu |
|**Phase 2**: EDA, baseline pipeline, Scalability, Efficiency, Distributed/parallel Training, and Scoring Pipeline| Shruti Gupta | sguptaray@berkeley.edu |
|**Phase 3**: Select the optimal algorithm, fine-tune and submit a final report| Erica Landreth | erica.landreth@berkeley.edu |

%md
## Credit assignment plan 

| Phase | team Meamber | Tasks | Hrs|
|:---:|:---:|:---:|:---:|
|**PHASE 0**| Danielle Yoseloff | Forming Team, Create Slack Channel, and team introduction |  2 |
|**PHASE 1**| Danielle Yoseloff | Machine algorithms and metrics | 8 |
|| | Pipeline Graph | 1 |
||Erica Landreth | Abstract and Report Editing | 3 |
||| EDA | 2.5 |
||| Data Description | 8 |
|| Shruti Gupta | EDA | 4 |
|| |Missing & Null Value Exploration | 4 |
|| Mohamed Bakr | Phase Leader Table, Credit Assigment plan, and GANTT chart |  8 |
||| Digesting the Data and Checkpointing Strategy | 4 |
|||Report editing and review| 2 |
|**PHASE 2**| Danielle Yoseloff | Feature Engineering | 15|
|| | Slides and Report| 8|
||Erica Landreth | EDA and Cleaning | 11.5 |
||| Pipeline and Cross Validation Development | 6.5 |
||| Feature Engineering | 9.5 |
||| Slides and Report | 12 |
|| Shruti Gupta | EDA and Cleaning | 15 |
|| | Feature Engineering | 10 |
|| | Hyperparameter Tuning and Analysis | 6 |
|| | Report | 6 |
|| Mohamed Bakr | Setting Up Work Environment and GitHub| 1 |
|| | Join and OTPW EDA | 12|
|| | Join Pipeline | 16|
|| | Slides and Report | 8 |
|**PHASE 3**| Danielle Yoseloff | | |
||Erica Landreth | Feature Engineering | 7 |
||  | Modeling and Hyperparameter Tuning | 8.5 |
||  | Project Management, Report, Slides, Figure Generation | 22.5 |
|| Shruti Gupta | | |
|| Mohamed Bakr | | |


**Detailed Plan and GANTT Chart:** https://docs.google.com/spreadsheets/d/1E4A3SaTAEjh9owH4SBUMv987bktwrW4Q6TXCZ5LJ6Xg/edit?usp=sharing

## Abstract

According to a 2019 FAA study, national airline delay-related costs exceeded $8 billion due to increased operating expenses.[2] Equipping airports with predictive systems for flight disruptions enables proactive mitigation strategies to absorb operational shocks and prevent cascading delays throughout the system. Therefore, our team attempted to design a machine learning classification model to predict whether an upcoming flight's departure would be disrupted or not, using information available two hours or more prior to the scheduled departure time, where "disrupted" flights consist of either delays over 15 minutes or cancellations.


Data was sourced from historic Department of Transportation (DoT) flight data and associated National Oceanic and Atmospheric Association (NOAA) weather station reports from the years 2015 to 2021. All results discussed in this report are with respect to a 5-year subset of the data (2015-2019, inclusive). 

F2 score was used to evaluate model performance, reflecting airports' priority to penalize false negatives (i.e., incorrectly predicting disrupted flights as on schedule). For the train set, F2 scores were computed within time-series cross-validation folds and averaged to summarize overall training scores. Logistic regression was chosen as the baseline model because of its suitability for a binary outcome and interpretability, achieving an F2 of .475 on train and .478 on test. Three advanced architectures were also considered and tuned to achieve their best performances: random forest (train: .475; test: .478), multilayer perceptron (train: .507; test: .525), XGBoost (train: .520; test: .535). Additionally, an ensemble was taken over their predictions and yielded an F2 score of .554 on the test set. 

The team focused on engineering recency- and network- related features that would inform the model of delay propagation patterns. **not sure what/how much to put here: describe what your focused on and accomplished in this project**




### Research Objective

Our primary customer is the airport management and administration; therefore, our aim is to use machine learning models to make a binary prediction of flight disruptions 2 hours prior to the scheduld flight departure time using the models described above. We define a disruption as a delay (according to the FAA definition, a flight that departs 15 minutes or more after its scheduled departure), or a cancellation. We consider flight cancellations as functionally analogous to long-term delays, similar to those reported as exceeding 24 hours. This approach is based on the idea that cancellations, like long delays, can disrupt resource allocation and operational flow. 

## Data Description

For this phase of our analysis, we focused on flight and weather data spanning the years 2015 through 2021. This section describes the data sources we used, and defines the fields relevent to our analysis.

### Data size and source

We used the following data sources for our modeling an analysis:

| Dataset Name     | Dataset Size    | Dataset Description      |Dataset Source   |
| :-------------: | ------------- | ------------- |  ------------- |
| Flights 2015-2021 |   74,177,433 rows by 109 columns | DoT historical flight data from the years 2015-2021 | [4] |
| Weather 2015-2021 | 898,983,399 rows by 128 columns | NOAA weather conditions for the corresponding time period | [1], [3] |
| Stations | 5,004,169 rows by 12 columns | The weather station data defines the distances from various weather stations to various airports. |  |
| Airports | 57,421 rows by 12 columns | The airport dataset provides airport metadata and identifiers necessary for joins. |  |



### Data dictionary

This section defines the variables from each source that we used for our modeling and analysis.

#### Flights data

The flights data provide metadata for a given flight, and will also help us to study time-series trends and aggregate delay statistics by characteristics such as airport and airline. The below definitions were informed by DoT documentation [4].

| Column |  Raw Data Type | Meaning | Intended Use |
|:---:|:---:|:---:|:---:|
| QUARTER | Integer | Quarter | Categorical variable to capture seasonal-periodic trends |
| MONTH | Integer | Month | Categorical variable to capture month-periodic trends |
| DAY_OF_WEEK | Integer | 1-7: Monday-Sunday | Categorical variable to capture week-periodic trends |
| FL_DATE | String | Flight date | Used in flight timestamp UTC conversion |
| OP_UNIQUE_CARRIER | String | Unique flight carrier ID | Airline categorical variable |
| TAIL_NUM | String | Aircraft tail number (registration code) | Create time-based tracking features |
| ORIGIN | String | Origin airport IATA code | Join to airports data; create route tracking features; match to seasonal components  |
| DEST | String | Destination airport IATA code | Join to airports data; create time-based tracking feature |
| CRS_DEP_TIME | Integer | Scheduled departure time (local, HHMM format) | Create time-based tracking features |
| DEP_TIME | Integer | Actual departure time (local, HHMM format) | Create time-based tracking features |
| DEP_DELAY | Double | Departure delay (min) | Define Boolean departure disruption status; create time-based tracking features |
| TAXI_OUT | Double | Time taxiing out (min) | Create time-based tracking features |
| TAXI_IN | Double | Time taxiing in (min) | Create time-based traffic flow |
| CRS_ARR_TIME | Integer | scheduled arrival time (local, HHMM format) | Create time-based tracking features |
| ARR_TIME | Integer | Actual arrival time (local, HHMM format) | Create time-based tracking features |
| ARR_DELAY | Double | Arrival delay (min) | Create time-based tracking features |
| CANCELLED | Double | 1.0/0.0: Cancelled/not cancelled | Define Boolean departure disruption status |
| CRS_ELAPSED_TIME | Double | Scheduled flight duration (min) | Represent anticipated flight length; create time-basd tracking features |
| ACTUAL_ELAPSED_TIME | Double | Actual flight duration (min) | Create time-based tracking features |
| AIRTIME | Double | Time between take-off and landing (min) | Represent flight length |
| DISTANCE | Double | Distance between origin and destination airports | Represent flight length |
| YEAR | Integer | Year | Time series feature engineering |
| DAY_OF_MONTH | Integer | Day of the month | Categorical variable to capture month-periodic trends |
| ORIGIN_CITY_NAME | String | Origin Airport, City Name | Join sanity check |
| DEST_CITY_NAME | String | Destination Airport, City Name	| Join sanity check |

We chose to drop some variables from the full flights table based on redundancy, the proportion of missing values, and relevance to our analysis. These include alternate representations of airport and airline ID's and diversion information.


#### Weather data

The weather data allows us to define weather conditions relevant to an individual flight, as well as characterize longer-term regional weather trends. The below definitions were informed by NOAA documentation [1] and [3].

| Column |  Raw Data Type | Meaning | Intended Use |
|:---:|:---:|:---:|:---:|
| STATION | String | Weather station ID | Key for joining to stations data |
| DATE | String | Date the time (UTC) of weather report | Filter weather reports in time |
| YEAR | Int | Year | Time series feature engineering |
| LATITUDE | String | Station latitude (degrees North) | Characterize station location |
| LONGITUDE | String | Station longitude (degrees East) | Characterize station location |
| REPORT_TYPE | String | Weather report type | Filter to relevant report types |
| HourlyDewPointTemperature | String | Dew point temperature (degrees F) | Define weather conditions |
| HourlyDryBulbTemperature | String | Air temperature (degrees F) | Define weather conditions |
| HourlyPrecipitation | String | Precipitation amount (in) | Define weather conditions |
| HourlyPresentWeatherType | String | String code defining present weather *e.g.* rain or hail | Parse report to fill in missing information |
| HourlyPressureChange | String | Change in pressure (in Hg) | Define weather conditions |
| HourlyRelativeHumidity | String | Relative humidity (percentage) | Define weather conditions |
| HourlyVisibility | String | Horizontal visibility (mi) | Define weather conditions |
| HourlyWetBulbTemperature | String | Wet bulb temperature (degrees F) | Define weather conditions |
| HourlyWindGustSpeed | String | Wind gust speed (mph) | Define weather conditions |
| HourlyWindSpeed | String | Wind speed (mph) | Define weather conditions |
| NAME | String | Weather Station Name | Used to Identify Weather Stations |
| REM | String | Remarks Data Section | Used for imputing some of the missing values |

We chose to drop some variables from the full weather table based on redundancy, the proportion of missing values, and relevance to our analysis. These include alternate station identifiers, daily and monthly averages, and station backup/maintenance information.

#### Weather station data

The weather station data defines the distances from various weather stations to various airports.

| Column |  Raw Data Type | Meaning | Intended Use |
|:---:|:---:|:---:|:---:|
| station_id | String | Weather station ID | Key for joining to weather data |
| lat | Double | Station latitude (degrees North) | Characterize station location |
| lon | Double | Station longitude (degrees East) | Characterize station location |
| neighbor_name | String | Airport name | Sanity check for joins |
| neighbor_state | String | Airport state | Sanity check for joins |
| neighbor_call | String | Airport ICAO code | Key for joining to airport data |
| neighbor_lat | Double | Airport latitude (degrees North) | Characterize airport location |
| neighbor_lon | Double | Airport longitude (degrees East) | Characterize airport location |
| distance_to_neighbor | Double | Haversine Distance (mi) from station to airport | Find weather stations near a given airport |


We chose to drop some variables from the full stations table based on redundancy and relevance to our analysis. These include alternate station and airport identifiers.

#### Airport data

The airport dataset provides airport metadata and identifiers necessary for joins.

| Column |  Raw Data Type | Meaning | Intended Use |
|:---:|:---:|:---:|:---:|
| icao_code | String | Airport ICAO code | Join to stations data |
| type | String | Airport type | Characterize airport operations |
| iso_region | String | ISO code of airport region | Filtering and sanity check for joins |
| iata_code | String | Airport IATA code | Join to flights data |
| coordinates | String | Airport latitude and longitude | Characterize airport location |

We chose to drop some variables from the full airports table based on redundancy and relevance to our analysis. These include alternative identifiers and local, categorical location codes.

## Data Lineage and Key Transformations

###Joining strategy



The team chose to construct a joined dataset using airports, flights, weather, and stations data to minimize leakage. In contrast to our modeling objective of making predictions exclusively with data obtained two hours _before_ scheduled flight departure, initial EDA on the provided OTPW datset revealed weather station records obtained _at_ the scheduled departure time.

To address this, we developed a multi-step pipeline to join the **7 years** of flights and weather data from 2015 to 2021 with several checkpoints, as outlined below:

1. We created a UDF to extract time zones based on each airport’s and station's latitude and longitude (parsed from the `coordinates` column in the airport codes stations table). The result was a 2 helper table:
    
    -  Airport time zones containing `icao_code`, `latitude`, `longitude`, and `timezone`, covering 2,237 unique airports.
    -  Stations time zones table containing `STATION`, `LATITUDE`, `LONGITUDE`, and `timezone` , coveering 2,939 unique weather stations. 

2. We cleaned and deduplicated the flights data, retaining only necessary features. The dataset was reduced from a dimension of **74.18M x 109** to **42.43M x 29**.

3. We joined the flights data with the airport codes table twice—once on the `ORIGIN` IATA code and once on the `DEST` IATA code—to obtain the corresponding `icao_code`, `type`, and `iso_region`.

4. We recalculated the distance between all weather stations and airports using the Haversine formula to get accurate proximity values in kilometers.

5. Before joining flights with stations, we identified seven missing `icao_code`s that were not present in the stations dataset. We augmented the stations data by computing distances for those airports using their coordinates from the airport codes table and saved the updated result for reuse.

6. To improve join efficiency, we filtered the stations dataset down to only the closest station per airport. This reduced the station-airport combination dataset from ~5 million rows to just 2,236 rows, significantly reducing shuffle during the join.

7. We joined the flights dataset with the filtered stations data to retrieve the `station_id`, `station_lat`, `station_lon`, `airport_lat`, `airport_lon`, and `station_distance` for both origin and destination.

8. After ensuring there were no missing `icao_code`s in the time zone helper table, we enriched the flights data by joining it with time zone info using `icao_code`. This enabled us to convert the scheduled departure time into UTC (`sched_depart_utc`) and compute `two_hours_prior_depart_UTC` and `four_hours_prior_depart_UTC` using UDFs.

9. The weather dataset was preprocessed by removig duplicates, filtering only USA locations, selecting relevant features, converting date and time to UTC, and filtering to only include station-date combinations that matched those in the flights dataset. This reduced the weather data from a dimension of **898.98M x 124** to **29.27M x 18**.

10. Finally, we joined the flights data with weather data and matched on station ID and filtered for weather records where the UTC timestamp was between two and four hours before scheduled departure. (See the data description section for selected weather features.)

The full join pipeline took approximately **4 and half hours** using **5–10 workers**, producing a final DataFrame of **42.43M x 78** and a Parquet file of ~6.24 GB for the 7 years dataset.

All location-specific columns were prefixed with `origin_`/ to clearly indicate their reference point.

To validate the pipeline, we tested it first on a 3-month sample and one year datasets before scaling to 7 years. We ensured data quality and maintained full lineage tracking throughout. All joins had a **100% match rate**, except for the weather join, which had a **99.86% match rate** for origin as expected due to slight gaps in available weather records.

This pipeline provides a robust and well-validated dataset that serves as the foundation for downstream feature engineering and modeling.

<div style="text-align: center; line-height: 6; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/Join%20Pipeline.jpeg?raw=true" alt="Join Pipeline" style="width: 200px">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/Join%20pipeline%20ERD.png?raw=true" alt="Join ERD" style="width: 200px">

<div>

### Key Transformations


The data had to be processed and cleaned prior to implementing the join pipeline: handling missing values and deduplicating records.

#### Weather

The most critical missing values in the weather data were location-based; without latitude or longitude information we could not match the observation to the nearest airport. To identify stations, we extracted the USAF and WBAN codes from the first and second halves of the given weather station ID and parsed the ICAO code from the text report column ("REM"). We then matched whichever attribute was available to the stations dataset to fill in identifying information to the weather data and filtered out stations not in the United States or its territories. Missing feature observations in the weather dataset could be derived from sensor malfunctions and were often compounded to result in several hours or days in a row of missing data, even despite prolific duplicates. Duplicates are defined as multiple reports emitted from the same station at the same time. Therefore, our deduplication rule was simply to keep the record with the least null values in the hourly-level columns (our columns of interest). The de-duplicated dataset with location identifiers was then used as the weather base for the join. 

<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/RawWeatherNulls.png?raw=true" alt="Null Counts: Raw Weather Data" style="width: 200px">


_Above: Nulls Distribution in Selected Raw Weather Columns_

To address missing values in the weather data used for modeling, we first parsed the remarks column which contains METAR reports to extract relevant values. In cases where the METAR reports contained insufficient information or were also missing, we prioritized spatially-based imputation. This decision was based on the fact that the weather data matched to each flight was already two hours stale, limiting the usefulness of interpolation over time. Airports were geohashed using the python-geohash package at a precision level of 2, which clusters airports into coarse regional buckets to enable spatially coherent imputation. A more granular precision level resulted in not enough airports per bucket, whereas the less granular level was too broad and would not adequately capture region-specific weather conditions. 

<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/geohash.jpg?raw=true" alt="Null Counts: Raw Weather Data" style="width: 100px; display: inline-block;"> 

_Above: Example of Geohashed Regions on 1y Training Data_

For each missing weather observation, we attempted to impute values by pulling the most recent non-null weather reports timestamped between 2–6 hours prior to the flight's scheduled departure from other airports within the same geohash bucket in an attempt to capture immediate recent weather status and events.

In cases where multiple stations within the geohash region had valid reports in that time window, we selected the most recent single record rather than computing an average, to reduce computational complexity. In cases where all stations in a region were down—due to widespread outages or technical issues—we implemented a fallback strategy by computing an exponential moving average (EMA) over the last 8 non-null records prior to the missing timestamp. This parameter was tuned to sufficiently capture remaining nulls without being unnecessarily wide. We chose the EMA approach to balance responsiveness to recent trends with the need to smooth over noise. Importantly, this method does not introduce label leakage: because all weather data were sourced from a 2–4 hour window prior to each flight's scheduled departure, no future data relative to the prediction target was used.

#### Flights [UPDATE STATS OR NOTE THAT THEY ARE ON 1Y DATA]

The flights dataset had true duplicates for each record, which was expected due to its information being recorded at origin and destination airports. The columns attributing delay minutes to causes (carrier, NAS, weather, security, late aircraft) were missing over 50% of their values, so we elected not to use them in the analysis. Time-related columns like arrival time, actual elapsed time, or departure time contained missing values only in the case of cancelled flights, or, in rare cases, diverted flights. Diverted flights made up just .27% of the training dataset and diversion-related columns were extremely sparse, so we elected to drop these columns for modeling and analysis. 

The TAIL_NUM column is essential for relating multiple flights by the same aircraft, and contained only .29% missing values in the training set, so nulls were treated as a missing value indicator which was inherited by dependent features. We also encountered cases where the same aircraft appeared scheduled to depart to different destinations at the same UTC departure timestamp. These apparent duplicates occurred exclusively when one of the records experienced a severe delay or cancellation, so we concluded that they were not true duplicates but reflected inconsistencies from when system snapshots were recorded.

We also explored airport-specific associations with delay. In this figure, the marker size represents the relative proportion of flights departing from the airport during the first three-quarters of the year, with a minimum size for visibility, and color represents continuous delay amount in minutes. Less busy airports appear to have more severe delays. This motivated our inclusion of the categorical origin airport type (small, medium, or large size) during the modeling phase. Delays do not appear to be concentrated in regional patterns, and locations outside the continental US did not exhibit significantly different behavior. The visual is limited because it does not display cancellations.







<div style="text-align: center; line-height: 6; padding-top: 30px; padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/airportdelays.jpg?raw=true" alt="Join Pipeline" style="width: 50%; height: auto; display: inline-block;">




## EDA

### Native Features

Five key feature families were considered: weather, flight, recency, seasonality, airport tendency, and specialty graph features.

#### Flights

Disrupted flights constitute 21% of the training dataset; of these, 10% are caused by cancellations and 90% are caused by delays over 15 minutes. 


<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/time_series_EDA.png?raw=true" alt="Time Series Plot" style="width: 100px; display: inline-block;"> 

[Erica INSERT EXPLANATION]

<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/delay_causes_EDA.png?raw=true" alt="Delay Minutes Plot" style="width: 100px; display: inline-block;"> 

[Erica INSERT EXPLANATION]


Correlation analysis was conducted on the training dataset, which consists of the first three-quarters of a year-long dataset spanning 2019. Initial analysis focused on the relationship between weather features and a constructed departure delay indicator, which identifies flights that were delayed by more than 15 minutes or canceled. The Spearman correlation results indicated relatively low correlations overall. The strongest correlations with flight delays were observed for precipitation amount (measured in hundredths of an inch) and wind gust speeds, both of which were positively associated with flight disruptions. Additionally, several weather features were found to be highly correlated with one another, suggesting that they may not contribute additional variance or new information to the model.

[**UPDATE HEATMAP WITH FULL DATASET**]


%md
We also explored airport-specific associations with delay. In this figure, the marker size represents the relative proportion of flights departing from the airport during the first three-quarters of the year, with a minimum size for visibility, and color represents continuous delay amount in minutes. Less busy airports appear to have more severe delays. This motivated our inclusion of the categorical origin airport type (small, medium, or large size) during the modeling phase. Delays do not appear to be concentrated in regional patterns, and locations outside the continental US did not exhibit significantly different behavior. The visual is limited because it does not display cancellations.





<div style="text-align: center; line-height: 6; padding-top: 30px; padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/airportdelays.jpg?raw=true" alt="Join Pipeline" style="width: 50%; height: auto; display: inline-block;">




#### Weather

<div style="text-align: center; line-height: 6; padding-top: 30px; padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/spearmancorr_weather_5yr.png?raw=true" alt="Join Pipeline" style="width: 50%; height: auto; display: inline-block;">



## Feature Engineering

To augment the features available natively in the flights and weather datasets, we engineered features related to prior flights, lagged delay statistics, graph features, and seasonality. **UPDATE ORDER???**

### Prior Flight / Recency Features
[done on 5yr data]

<div style="text-align: center; line-height: 6; padding-top: 30px; padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/turnaround_hist.png?raw=true" alt="Turnaround Time Histogram" style="width: 50%; height: auto; display: inline-block;">

<div style="text-align: center; line-height: 6; padding-top: 30px; padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/spearmancorr_prioirflight_5yr.png?raw=true" alt="Prior Flight Correlation Matrix" style="width: 50%; height: auto; display: inline-block;">

Prior flight features 



#### Overview
One of our primary engineering focuses was recency features, based on the hypothesis that operational status indicators from the preceding flight leg of an aircraft—such as whether the aircraft was delayed, cancelled, departed, or arrived—would be highly predictive of disruption outcomes for the aircraft at the current origin. This decision was further supported by initial exploratory data analysis (discussed further below): Spearman correlation coefficients between raw features and the target variable revealed limited signal in most static flight attributes, and distributional comparisons showed little meaningful variation. This aligns with domain intuition: disruptions in the aircraft’s prior leg are likely to propagate and impact on-time performance at the next origin, which is supported by the sucess in model performance using these engineered features.

Among the recency-based features we created, we selected the following estimates (i.e., calculated only from information known at or before 2 hours before scheduled departure) for modeling:

1. Binary indicators capturing prior flight’s status:
   - Whether it departed from its previous origin
   - Whether it was delayed at its previous origin
   - Whether it was cancelled at its previous origin
   - Whether it arrived at the current origin

2. Continuous timing features (in minutes):
   - Departure delay at the previous origin
   - Air time of the prior flight
   - Turnaround time between the prior arrival and scheduled departure of the current flight


When incorporating aircraft tracking data, we focused on addressing two major concerns: data quality issues and leakage.

_Data Quality_

We defined a prior flight by three conditions:

1. Consistent aircraft identified by tail number
2. The aircraft's immediate previous destination matches the current origin
3. The aircraft left its immediate previous origin less than 24 hours before the current flight's T-2 scheduled departure time

Our first condition assumes that a flight's actual tail number, i.e. assigned aircraft, is known at the time of evaluation. The second condition is motivated by observed inconsistencies in aircraft flight routes. For example, in one day an aircraft arrives at airport A, yet the next record of the same aircraft shows it departing from airport B, with no flight record of its flight from A to B. This condition intends to enforce data integrity by ensuring a prior flight really is the flight that aircraft completed to arrive at the current origin. The third condition is also motivated by the possibility of missing flight records and upholding the integrity of the meaning of a prior flight. There exist records where a plane's prior flight to its current origin may be several days or even months in the past. We believe a "prior flight" that happened too far in the past does not affect current flight delay in the way we are are hoping to capture via these recency features. Furthermore, because we don't understand the context for these gaps we consider the possibility that true prior flight activity records are not present. 

These filters helped reduce the risk of incorporating misleading features derived from incomplete route chains and uphold the expected meaning of our engineered features.

_Leakage_

We only wanted to incorporate information that would be known at the threshold T-2 hours before the scheduled departure time. This influenced the variables considered in our calculations, based on whether the estimated or actual timestamp data would be available, and how much of the continuous time duration data would be available.

Two core assumptions were made: Firstly, that all prior flights are scheduled more than 2 hours before a record's scheduled departure time. Secondly, that an airport would know at the time threshold whether the immediate prior flight of an aircraft was cancelled. This is because we do not know at what point a flight is declared cancelled.




#### Methods

We began by calculating a threshold timestamp: 26 hours prior to each flight’s scheduled departure. Using this, we generated lagged features over the aircraft tail number (i.e., unique aircraft identifier), including origin and destination airports, scheduled and actual departure times, delays, and arrival times.

Contingent on meeting the prior flight information criteria, we created the following features

_Cancellation_: 
- Indicator (boolean): A binary flag indicating whether the previous flight was cancelled. No restriction on timing was applied, as cancellations are often logged early and knowing about them before the prediction threshold (2 hours prior to departure) aligns with our use case.

_Delays:_
- Continuous variable (minutes): Estimated delay of the prior flight, computed based on available data:

   - If the prior flight scheduled departure and recorded departure were both before the threshold, we  simply used the true recorded delay value.
   - If the prior flight was scheduled to depart before the threshold, but did not have a recorded true departure time yet, we did not attempt to estimate what the further delay might be. Instead, we essentially made the assumption that it departed at the 2 hours prior UTC time by recording the delay as the difference between the threshold and the prior flight scheduled departure time. In the future, this could be fine tuned by setting a default parameter relative to the estimated prior flight time or estimated based on some other indicator, but it only accounted for a small proportion of cases and we did not want to introduce additional computational overhead.
   - If the prior flight was scheduled to depart after the threshold, and the route information met the standard, we assumed there would be no delay, as we do not have cause to believe there might be. This could also be tuned by calculating average delay for that route, but as this only represented a small portion of cases, we similarly hesitated to introduce computationally intensive operations.

   - If the prior flight was cancelled or the route information was missing, leaving us without data on the prior flight, we filled in the delay calculation with the most recent non-null delay data from the same route’s previous leg (i.e., the same origin-destination pair). Since we don't have the specific prior leg information, we instead look for the most recent available instance for the same route (same origin and destination) and use the delay from that prior flight as a proxy for the current flight’s delay. 
    
       - This decision is based on the understanding that operational disruptions, including delays, are often correlated within the same route. Delays from one leg of a flight route are likely to impact subsequent flights on the same route. The rationale was further validated by EDA on the initial engineered features, which showed that when the prior flight's destination did not match the current flight's origin—an issue that occurred in 3% of the training dataset—58% of those cases led to disrupted outcomes (delays or cancellations). 



- Indicator (boolean): If the prior flight was estimated to have been delayed, or known to have been cancelled, the delay indicator was set to True.


_Departures_: 

- Indicator (boolean): If and only if the known prior flight departure time met the data quality standard and was before the threshold the boolean prior flight departure indicator was set to true. 

- Estimator (timestamp): The prior departure time was estimated by adding the estimated delay calculation to the scheduled departure time.

_Arrivals_:

- Indicator (boolean): If and only if the prior flight known arrival time met the data quality standard and was before the threshold, the indicator was set to True.

- Estimator (timestamp):
   - If the prior flight arrived before the 2-hour window, we filled this in with the true arrival time.
   - If the prior flight was known to have departed before the threshold, we filled this in by adding the estimated elapsed time to the known departure time.
   - Otherwise, we simply added the estimated elapsed time to the estimated departure time.


_Turnaround time_: 

- To calculate the estimated amount of time the aircraft had between arriving and departing, we took the difference between the estimated arrival time of the previous flight and the estimated departure time of the current flight. If the previous flight was not confirmed, we again estimated this from the calculated turnaround time from the last record of this route being flown.

### Delay Statistics Features

##### Overview

Many flight delays can be attributed to widespread events happening regionally or at a specific airport, *e.g.* a grounding due to lightning or a computer system glitch. In this way, having information about other flights that have recently been delayed at a given airport can provide valuable insight into whether an upcoming flight will be delayed.

With this in mind, we created engineered features to describe the average delay amount and average proportion of flights disrupted at the origin airport for a given record, considering flights departing between two and four hours prior to the record of interest.

##### Methods

[**ERICA FILL IN**]


| Feature |  Data Type | Description |
|:---:|:---:|:---:|
| mean_dep_delay | Float | Mean departure delay of flights at origin airport departing 2-4 hours prior to the flight of interest |
| prop_delayed | Float | Proportion of flights delayed at origin airport, of those departing 2-4 hours prior to the flight of interest |

<div style="text-align: center; line-height: 6; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/delay_stats_correlation.png?raw=true" alt="Delay Stats Heatmap" style="width: 100px">
<div>

##### Methods

### Seasonality Features


#### Overview

We expect fluctuation in flight delays on a number of timescales, as travel demand and airports' ability to meet that demand vary over time. For example, delays vary throughout the day with the volume of traffic at an airport, and as delays from earlier in the day impact later flights. Delays also vary throughout the year as travel habits change--*e.g.* consider spring break, winter holidays, summer travel, or ski trips. We have engineered seasonality features to capture these effects quantitatively, in order to provide the model input about what seasonal effects may be at play for a given record.


#### Methods
To produce these features, we trained seasonality models using the Prophet Python library [5]. For a given training dataset (for each cross-validation fold and overall), a Prophet model was fit for each airport using the UTC departure time as the time field and departure delay in minutes as the outcome variable. Each model assumed linear growth, an uncertainty interval width of 90%, and included weekly, daily, yearly, and holiday (based on Prophet's US holiday lookup functionality) seasonality components. Each model was used to forecast predictions one week into the future, with an hourly frequency (*i.e.* to get the daily and weekly components for each hour throughout the week), and one year into the future with a daily frequency (*i.e.* to get the yearly and holiday components for each day of the year). These components, along with the airport identifier, were stored in lookup tables. The example below shows the seasonality components for Boston Logan International Airport (BOS), trained on the 2015-2018 training set.

<div style="text-align: center; line-height: 6; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/BOS_seasonality_full_data.png?raw=true" alt="BOS Seasonality" style="transform: scale(0.2);">
<div>

To apply these seasonality components to the modeling data, the modeling data was joined to this lookup table. The daily and weekly components were joined on airport, day of week, and hour of day; and the yearly and holiday components were joined on airport, month, and day of month. The following table summarizes the resulting features.

| Feature |  Data Type | Description |
|:---:|:---:|:---:|
| daily | Float | Daily seasonality component (offset from trend) in minutes |
| weekly | Float | Weekly seasonality component (offset from trend) in minutes |
| yearly | Float | Yearly seasonality component (offset from trend) in minutes |
| holidays | Float | Holiday-related seasonality component (offset from trend) in minutes |

Because these seasonality components are *trained* from data, we had to be mindful of leakage when creating these features. To ensure that our cross-validation and overall test sets were not contaminated with test data information, we trained a seaparate seasonality model for each cross-validation fold and the overall dataset, utilizing the relevant training dataset in each case (*e.g* the seasonality trained on CV fold 1 training data was applied to the CV fold 1 training and test sets). This does cause some leakage when the seasonality is applied to the same training set that it was trained on, but avoids leakage during evaluation on the test set.

The below figure summarizes the Spearman correlation of the seasonality features with the outcome variable for the full 5 year (2015-2019) dataset. We see that daily seasonality is moderately correlated with the outcome, and yearly seasonality to a lesser extent, but the weekly and holiday features are only very weekly correlated with the outcome.

<div style="text-align: center; line-height: 6; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/seasonality_correlation_full.png?raw=true" alt="Seasonality Heatmap" style="width: 200px">
<div>

### Graph Features [Pending]

## Modeling

### Modeling Pipeline

The following steps and diagram outline our end-to-end modeling workflow. The remainder of this section provides additional details for each step.

1. **Ingestion:** Load raw data into Spark DataFrames
2. **Feature selection:** Drop unnecessary columns
3. **Join:** Combine data sources into joined DataFrame
4. **Feature engineering and imputation:** Add engineered features and fill missing values using time series methods
5. **Split:** Divide data into training, validation, and test splits
6. **Sample:** Undersample training data
7. **Define machine learning pipelines:** Create Spark Pipeline objects for feature transformations and modeling
8. **Hyperparameter tuning:** Use cross-validation to train a model that balances performance and generalizability
9. **Model training:** Train final model(s) using chosen hyperparameters
10. **Model evaluation:** Assess trained models on test data

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/ml_pipeline.png?raw=true" alt="ML Pipeline" style="width: 500px">
<div>

### Ingestion, Feature Selection, and Join

Raw data were loaded from the provided parquet files and unnecessary columns were dropped on the basis of relevance (as discussed in **Data Dictionary**) or missing value status (as discussed in **Missing Value Analysis**). The weather and flights data were joined (as discussed in **Joining Strategy**) and saved out to an intermediate parquet file.

#### Feature Engineering and Imputation

In our processing pipeline, we make the distinction between processing steps that are "trained" verus "non-trained." Non-trained processing steps are those which can be applied to a single record in isolation, or which depend only on past-time information for a given record. These steps can be performed before splitting the data without introducing leakage.

In contrast, trained features are those which are trained on a reference dataset, and therefore must be trained on the training set only and applied after splitting the data to avoid test set leakage.

In "part I" of feature engineering and imputation, we addressed the non-trained feature processing steps. These include the weather data imputation and prior flight and lagged delay statistics feature engineering, as discussed in the sections **Missing Value Analysis**, **Prior Flight Features**, and **Lagged Delay Features**, respectively. Results from this step were written out to an intermediate parquet file.

In this step, we addressed the "trained" processing steps discussed above. This consisted of applying the seasonality models (as discussed in the section **Seasonality Features**).

[**UPDATE**]

#### Train, Test, and Cross Validation Splits

For the five year dataset, we trained our machine learning models on the first four years (2015-2018) and tested on a held out set consisting of the last year (2019). To validate our models and tune hyperparameters, the training set was further split into 5 cross-validation folds with 20% overlap. The folds and overlap were defined in terms of number of days (*i.e.* the folds were split so that each included the same number of days' worth of data), with the assumption that this would produce splits with comparable numbers of record. See the table below for the date limits (each date cutoff corresponds to midnight UTC) of data included in each split.

Each fold had approximately 4.5 million records in train and test each (before downsampling the training data), and the full train and test sets containted 24,313,177 (before downsampling) and 7,409,309 records, respectively. See the table below for date limits of data included in each split, for each fold. Note that the maximum times are exclusive.

| Modeling Case |  Train Time Period | # Train Records(Downsampled) | Test Time Period | # Test Records |
|:---:|:---:|:---:|:---:|:---:|
| CV Fold 1 | 12/31/14 - 10/09/2015 | 1,840,924 | 10/09/2015 - 07/17/2016 | 4,337,763 |
| CV Fold 2 | 08/14/2015 - 05/21/2016 | 1,426,778 | 05/21/2016 - 02/27/2017 | 4,323,858 |
| CV Fold 3 | 03/27/2016 - 01/01/2017 | 1,591,596 | 01/01/2017 - 10/10/2017 | 4,418,889 |
| CV Fold 4 | 11/08/2016 - 08/14/2017 | 1,805,438 | 08/14/2017 - 05/23/2018 | 4,889,432 |
| CV Fold 5 | 06/22/2017 - 03/27/2018 | 1,742,243 | 03/27/2018 - 01/01/2019 | 5,587,812 |
| Overall | 12/31/14 - 01/01/2019 | 9,353,252 | 01/01/2019 - 01/01/2020 | 7,409,309 |

#### Sampling Strategy


The computations of datapoint distances used in oversampling/SMOTE is not favorable considering the size of our data, so we took advantage of the robustness of our observations and used undersampling to address the class imbalance in our outcome variable. We randomly sampled the majority class without replacement to createa a balanced dataset where the number of samples in each class was roughly equal.

%md
#### Machine Learning Pipelines

We used the Pyspark Pipelines API to transform our chosen features and train machine learning models. This subsection explains our outcome variable, which features we used for modeling, and how we defined our Pipeline to transform those features and perform modeling.

##### Outcome variable

We have chosen to pursue a binary classification problem characterizing a flight's departure disruption status. Our outcome variable is a binary flag indicating whether or not a flight was either delayed (using the FAA definition of 15 or more minutes late) or cancelled. We choose to include cancellations in our "disrupted" case since they have similar consequences for our stakeholders.

##### Feature Families

We explored 7 different feature families in our modeling experiements. The below table summarizes the features in each family:

| Feature Family (# Features) | Raw Feature Name | Type | Raw or Engineered | Description | Alias |
|:---:|:---:|:---:|:---:|:---:|:---:|
| **Numeric Weather Features (9)** | origin_HourlyDewPointTemperature | Float | Raw | Hourly dew point temp. at origin airport |Weather-DewTemp|
|  | origin_HourlyDryBulbTemperature | Float | Raw | Hourly dry temp. at origin airport |Weather-DryBulbTemp|
|  | origin_HourlyPrecipitation | Float | Raw | Hourly precipitation at origin airport |Weather-Precipitation|
|  | origin_HourlyPressureChange | Float | Raw | Hourly pressure change at origin airport |Weather-Pressure|
|  | origin_HourlyRelativeHumidity | Float | Raw | Hourly relative humidity at origin airport |Weather-Humidity|
|  | origin_HourlyVisibility | Float | Raw | Hourly visibility at origin airport |Weather-Visibility|
|  | origin_HourlyWetBulbTemperature | Float | Raw | Hourly wet bulb temp. at origin airport |Weather-WetBulbTemp|
|  | origin_HourlyWindGustSpeed | Float | Raw | Hourly wind gust speed at origin airport |Weather-WindGust|
|  | origin_HourlyWindSpeed | Float | Raw | Hourly wind speed at origin airport |Weather-Wind|
| **Flight Categorical Metadata (5)** | OP_UNIQUE_CARRIER | Categorical | Raw | Carrier (airline) |Flight-Carrier|
|  | ORIGIN_ICAO | Categorical | Raw | Origin airport |Flight-Origin|
|  | DEST_ICAO | Categorical | Raw | Destination airport |Flight-Dest|
|  | origin_type | Categorical | Raw |  Origin airport type |Flight-OType|
|  | dest_type | Categorical | Raw |  Destination airport type |Flight-DType|
| **Date Information (5)** | YEAR | Categorical | Raw | Year of flight date |Date-Year|
|  | QUARTER | Categorical | Raw | Quarter of flight date |Date-Quarter|
|  | MONTH | Categorical | Raw | Month of flight date |Date-Month|
|  | DAY_OF_MONTH | Categorical | Raw | Day of month of flight |Date-DoM|
|  | DAY_OF_WEEK | Categorical | Raw | Day of week of flight |Date-DoW|
| **Seasonality Components (4)** | daily | Float | Engineered | Daily seasonlity component from Prophet model |Seasonality-D|
|  | weekly | Float | Engineered | Weekly seasonality component from Prophet model |Seasonality-W|
|  | yearly | Float | Engineered | Yearly seasonality component from Prophet model |Seasonality-Y|
|  | holidays | Float | Engineered | Holiday-based seasonality component from Prophet model |Seasonality-H|
| **Prior and Current Flight Details (8)** | priorflight_elapsed_time_calc_raw | Float | Engineered | Estimated prior flight duration |PF-Duration|
|  | turnaround_time_calc | Float | Engineered | Calculated time between prior flight arrival and present flight departure |PF-TurnaroundT|
|  | priorflight_depdelay_calc | Float | Engineered | Estimated prior flight delay |PF-DelayT|
|  | priorflight_isdeparted | Boolean | Engineered | Indicates whether prior flight has departed |PF-IsDepart|
|  | priorflight_isarrived_calc | Boolean | Engineered | Indicates whether prior flight has arrived |PF-IsArrive|
|  | priorflight_isdelayed_calc | Boolean | Engineered | Indicates whether prior flight was delayed |PF-IsDelay|
|  | DISTANCE | Float | Raw |  Distance between origin and destination airports |Distance|
|  | CRS_ELAPSED_TIME | Float | Raw |  Scheduled flight duration |CF-SchedDuration|
| **Lagged Delay Stat Features (2)** | mean_dep_delay | Float | Engineered | Mean departure delay at origin airport 2-4 hours prior |Lag-MeanDelay|
|  | prop_delayed | Float | Engineered | Proportion of flights delayed at origin airport 2-4 hours prior |Lag-PropDelay|
| **Graph Features (1)** | pagerank | Float | Engineered | Origin airport PageRank value |GF-Pagerank|

Note that we have chosen not to employ feature selection techniques in our modeling, since the total number of features in use is relatively small.

The importance of the features families was explored via two methods: modeling experiments and XGBoost importance scoring. We have provided aliases of the variable names for the purpose of discussion

For the modeling experiments, we trained one model per feature family that excluded that feature family from the input features, to asses the impact to the modeling results. For more details, see the **Modeling Experiments** section. From this analysis, we found that removing either the seasonality, weather, or lagged delay statistics families negatively impacted model performance, while removing the other feature families actually improved performance, all else equal.

We also explored feature importance by extracting the importance scores from our trained XGBoost model. The below figure summarizes the sum of weight, average cover, and average gain by family. [**EXPLAIN MORE**]

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/feat_imp_fam_v3.png?raw=true" alt="Feature Importance by Family" style="width: 500px">
<div>

Note that in the figure above, "time" refers to the Lagged Delay Stat feature family.

Examining the feature family importance metrics, we see that the highest weights are represented by weather features, even though weather features have a lower average gain. This could be be because the weather family has the most continuous features, so it could be used in more splits based on different ranges. In other words, the weather features are used for splits often, but don't usually result in large improvements in model performance. Metadata features, like the origin airport and flight carrier, are not used often for splits often, as shown by their lower weight, but tend to affect many samples, as shown by their higehr average cover. 

According to the average gain graph, time features, i.e. mean departure delay and delay proportion, contribute the most to model improvement, even though they are not the most frequently used. This implies they have the most value per use. Prior flight features account for the second highest increase in average gain, are used often for splits and affect many samples. Even though all prior flight features are calculated with information that is at at least 2 hours stale at the time of scheduled departure, they are impactful in improving model performance. This suggests that further development should go into prior flight features. 

Somewhat surprisingly, the categorical date features account for a larger share of the average gain increase than the seasonality features. This could suggest redundancy or interference between the two, where the presence of one makes the other less informative to the model.

Overall, we note that engineered feature families account for the largest proportions of increases in average gain.

##### Feature Transformations

Our final feature set consisted of both numeric and categorical features, which were handled separately within the pipeline. Numeric features were scaled using Spark's MinMaxScaler. Categorical features were one-hot encoded using Spark's StringIndexer (maps text categories to integers) and OneHotEncoder (one-hot encodes integer categories). The transformed numeric, categorical, and interaction features were assembed using Spark's VectorAssembler, to be used as input to the classification object.



###### Spark Pipeline Diagram

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/data_transformation_pipeline.png?raw=true" alt="Data Transformation Pipeline" style="width: 500px">
<div>

##### Modeling

We explored four machine learning modeling architectures throughout our experiments. In all cases, time series cross fold validation was used to obtain training and validation scores. For the multilayer perceptron and XGBoost models, all five training models developed for cross-validation were used to generate predictions on the test set, then exponentially weighted, such that more recent folds had a higher weight, and averaged to produce a final prediction. After some experimentation, the exponential weighting parameter used was .5. For the random forest and logistic regression models, one single pipeline was fit to the entire training set after cross-validation was completed, and used to predict on the test set.








###### Logistic Regression

We chose logistic regression for our baseline model because of its suitability for a binary outcome, interpretability, and strong performance on linearly separable data. With a simple set up and hyperparameter tuning, logistic regression was a straightforward but strong method with our engineered fearures.

###### Random Forest
Random forest was chosen because bagged methods perform well at scale. Additionally, we hoped that using an ensemble would mitigate overfitting while also providing interpretable feature importance artifacts. Another benefit of tree methods is their ability to handle null values.

###### XGBoost
Similarly to random forest, XGBoost was chosen for its interpretability and its capacity to optimally leverage our engineered features. We anticipated a slower training time compared to random forest due to the sequential boosted nature but felt that the gains in performance were valuable.

###### Multi-layer Perceptron
A multilayer perceptron model was also chosen for its flexibility and ability to handle complex relationships through deep learning techniques. The tradeoffs of this model include its low interpretability and higher training time, and its reliance on finding the optimal hyperparameter combination. 

#### Modeling Experiments, Hyperparameter Tuning, and Training

We performed modeling experiments to assess and compare the performance of our four chosen architectures, the effect of adjusting hyperparameters, and the importance of feature families. For hyperparameter tuning, we used the Python hyperopt library, which uses Bayesian optimization to efficiency explore the parameter space. See more details in the **Modeling Experiments** section.

For each experiment, a model was first trained on each fold of the cross-validation data and output predictions to evaluate cross-validation performance. They were then trained on the full training dataset and output predictions for the held out test set. For all training, binary crossentropy loss (equation below) was used as the loss function.

$$J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\log(\hat{p}^{(i)})+(1-y^{(i)})\log(1-\hat{p}^{(i)}))]$$

#### Model Evaluation

In choosing an appropriate metric to evaluate our classification performance, we weighed the cost of each classification error:

**Type I error** (*false positive:*) The model predicts a delay, but the flight departs on time) this may cause confusion and unnecessary changes from air traffic control. however, it's more acceptable if we are prioritizing caution and want to minimize unexpected delays.

**Type II error** (*false negative:*) The model predicts the flight will be on time, but it ends up delayed) Passengers and crew aren’t prepared for the delay, potentially leading to missed connections, poor customer satisfaction, and operational disruptions. This type is more costly if unexpected delays cause major disruptions, and you aim to avoid them at all costs.

We estimate that Type II errors are roughly twice as costly as Type I errors to our stakeholder, therefore we choose the F-beta metric with a beta value of 2, to reflect the emphasis on recall over precision. The metric is defined below, where FP and FN refer to the number of false positives and false negatives the model predicts, respectively.

$$F\beta-Score = \frac{\beta^2 + 1}{\frac{\beta^2(TP + FN)}{TP} + \frac{TP+FP}{TP}}$$

## Modeling Experiments

### Hyperparameter Tuning

We used a Bayesian optimization approach to perform hyperparameter tuning for our random forest and multi-layer perceptron architectures. We chose not to perform tuning for the logistic regression model because our Phase II work found minimal impact from incorporating regularization or incorporating interaction features. We chose not to tune the XGBoost models because adjusting most of the available hyperparameters would have made our model more conservative, and we did not see overfitting present in the cross-validation.


#### Multi-layer Perceptron

| Description | CV Fold F2 Scores | Average Score | Test Score | Scores Std. Dev |
|:---:|:---:|:---:|:---:|:---:|
| Basline | [0.46, 0.51, 0.51, 0.47, 0.50] | 0.4912 | 0.4725 |
| No weather | [0.46, 0.51, 0.51, 0.47, 0.50] | 0.4912 | 0.4933 |
| Tuned | [0.45, 0.47, 0.49, 0.42, 0.46] | 0.4575* | 0.5254 |

Three main MLP variants were considered to support feature selection before implementing hyperparameter tuning experiments. The base (default) set of features present in each variant were: 

- Seasonality features: daily, weekly, yearly, holiday
- Delay statistic features: Mean departure delay, proportion delayed
- Graph features: pagerank

A model was run with limited flight features to help evaluate the predictive power of weather features.

- Flight features: flight distance, scheduled elapsed time
- Prior flight features: turnaround time calculation, estimated prior flight delay 
- Weather features: Hourly dew point temperature, dry and wet bulb temperatures, precipitation, pressure change, visibility, relative humidity, and wind and wind gust speeds

Next, a model was run without any weather features with an emphasis on airport and aircraft characteristic features:
- Flight features: Flight distance, scheduled elapsed time, airline carrier, quarter, month, day of month, day of week, origin type, origin region
- Prior flight features: turnaround time calculation, estimated prior flight delay, prior flight departed indicator, prior flight arrival indicator, prior flight disruption indicator (estimated to be delayed > 15 minutes or known to be cancelled), prior origin region, prior flight carrier.

Without the weather features, we noted this model actually performed better on the training and test sets without overfitting. This implies the weather features may have been adding unnecessary noise. Before proceeding to hyperparameter tuning, we therefore limited the set of weather features used; the final set of features, along with the default seasonality, delay statistic, and graph features was: 

- Weather features: Hourly dew point temperature, precipitation, wind gust speed, visibility, and pressure change 
- Flight features: Flight distance, scheduled elapsed time, airline carrier, quarter, month, day of month, day of week, origin type, origin region
- Prior flight features: turnaround time calculation, estimated prior flight delay, prior flight departed indicator, prior flight arrival indicator, prior flight disruption indicator (estimated to be delayed > 15 minutes or known to be cancelled), prior origin region, prior flight carrier.

The hyperparameters tuned were the hidden layers, step size, max iterations, and block size. The hidden layers determine the architecture and capacity of the neural network. The step size (learning rate) controls how much to adjust the model in response to the estimated error each time the model weights are updated. The max iterations set the number of times the model will work through the entire training dataset, and the block size specifies the number of training examples utilized in one iteration.

| Parameter       | Values                              |
|-----------------|-------------------------------------|
| hidden_layers   | [[64, 32], [32, 8, 4], [128, 32]]   |
| stepSize        | 0.01 to 0.3                         |
| maxIter         | [100, 200]                          |
| blockSize       | [64, 128]                           |


The best model had 2 hidden layers of sizes [128, 32], max iterations of 200, and a step size of .0524, achieving an average F2 score across training folds of .51. As mentioned previously, each fold model was used to produce test set predictions. The class 1 probability predictions were combined in an exponentially weighted average such that the most recent fold (fold 5) had the most weight, then thresholded to produce the final prediction. This approach yielded a final F2 test score of 0.5254. 

#### Random Forest

To tune the random forest models, we ran 10 iterations of hyperopt tuning for the parameters numTrees (choice between 20, 40, or 60), maxDepth (between 5 and 12), and featureSubsetStrategy (choice between "auto," "sqrt," and "log2"). In all cases, we included all feature families in the input feature set. We got the following results:

| Run | numTrees | maxDepth | featureSubsetStrategy | CV Fold F2 Scores | Average Score | Std. Dev. Score |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 60 | 7 | auto | [0.42, 0.46, 0.48, 0.40, 0.46] | 0.4455 | 0.0318 |
| 2 | 60 | 5 | auto | [0.41, 0.45, 0.47, 0.39, 0.44] | 0.4298 | 0.0317 |
| **3** | **20** | **10** | **auto** | **[0.44, 0.49, 0.50, 0.44, 0.47]** | **0.4676** | **0.0274** |
| 4 | 60 | 10 | auto | [0.48, 0.50, 0.43, 0.47, 0.46] | 0.4645 | 0.0301 |
| 5 | 20 | 7 | sqrt | [0.41, 0.47, 0.48, 0.44, 0.45] | 0.4493 | 0.0278 |
| 6 | 20 | 9 | sqrt | [0.42, 0.47, 0.49, 0.42, 0.46] | 0.4531 | 0.0325 |
| 7 | 40 | 11 | sqrt | [0.44, 0.48, 0.50, 0.43, 0.47] | 0.4654 | 0.0263 |
| 8 | 40 | 11 | sqrt | [0.44, 0.48, 0.50, 0.43, 0.47] | 0.4652 | 0.0270 |
| 9 | 60 | 9 | auto | [0.43, 0.48, 0.49, 0.41, 0.47] | 0.4573 | 0.0325 |
| 10 | 60 | 8 | auto | [0.43, 0.47, 0.49, 0.42, 0.46] | 0.4531 | 0.0306 |

From the above, we choose the third model with the highest average CV F2 score (bolded above), with parameters of: numTrees of 20, maxDepth of 10, featureSubsetStrategy of "auto." Training this model on the overall training data and evaluating on the test set, we get a test F2 score of 0.4564. To run the above hyperparameter tuning on 5-10 workers took 1 hour, 28 minutes.



### Feature Importance Experiments

As one method of assessing the importance of the feature families included in our models, we trained a model for each feature family that excluded those features from the model input. In all cases, we used the hyperparameters chosen by the hyperparameter tuning, discussed above. We got the following results:

| Case | CV Fold F2 Scores | Average Score | Std. Dev. Score |
|:---:|:---:|:---:|:---:|
| All features | [0.44, 0.49, 0.50, 0.44, 0.47] | 0.4676 | 0.0274 |
| No weather | [0.45, 0.47, 0.49, 0.42, 0.46] | 0.4575 | 0.0245 |
| No seasonality | [0.42, 0.48, 0.49, 0.40, 0.46] | 0.4502 | 0.0372 |
| No lagged delay stats | [0.45, 0.47, 0.48, 0.43, 0.45] | 0.4565 | 0.0187 |
| No prior flight | [0.45, 0.49, 0.49, 0.44, 0.50] | 0.4749 | 0.0267 |
| No graph | [0.42, 0.48, 0.50, 0.42, 0.48] | 0.4613 | 0.0379 |
| No date info | [0.44, 0.47, 0.51, 0.44, 0.48] | 0.4679 | 0.0286 |
| No flight metadata | [0.45, 0.49, 0.50, 0.44, 0.48] | 0.4713 | 0.0281 |


## Leakage [Erica update Prophet stuff]

### What is Leakage?

Leakage in machine learning happens when information from outside the training dataset—or from the future—accidentally sneaks into the model in a way that wouldn’t be available at prediction time. This can lead to overly optimistic performance during training and evaluation, but poor performance in real-world settings because the model is learning from data it shouldn’t have had access to.

### Example of Leakage

In our context, imagine you’re trying to predict if a flight will be disrupted two hours before departure. If you include a weather observation taken at the scheduled departure time or if you use the status of an incoming aircraft that hasn’t arrived yet—you’re giving the model future information it wouldn’t have access to in real-time. That’s leakage.

### Pipeline Review – Are We Leaking?

We were very intentional about avoiding leakage when building our pipeline. For example:
- We only used the scheduled departure time `CRS_DEP_TIME` instead of the actual departure time because we won't have access to the actual departure time in real time. 
- We aligned weather data to two hours before scheduled departure, never using data closer to the schduled departure or after takeoff.
- We carefully joined based on location and time zones to ensure realistic data alignment.
- We used the 2 hours before the scheduled departure time as a threshold to bring any other features.

However, we did identify three key points of potential leakage:

1. **Prior Flight Status Assumption**

    We made the decision to assume the status of the previous flight (e.g., cancellation or delay) was known two hours before the current flight.

      **Why it’s a problem:** In real-time, this may not always be true. The prior flight might still be en route or awaiting departure.

      **Impact:** The model might be trained with more accurate context than would exist at prediction time.

2. **PageRank on the Full data**

    We computed graph-based PageRank features using the entire flight network.
    
      **Why it’s a problem:** This approach allows “future” routes to influence the importance of nodes (airports) in the graph.

      **Impact:** It gives the model indirect access to information from the future, which wouldn’t be available in real time.

3. **Prophet Model Trained on the Full Year**

    Our seasonality features were generated using Prophet fitted to the entire dataset.

      **Why it’s a problem:** It includes future seasonal trends (e.g., holiday surges) that wouldn’t be known in advance.

      **Impact:** This violates the assumption that seasonal patterns must be learned only from past data. 

### Steps to Prevent Leakage Going Forward

We will be working to ensure the final models don't suffer from these issues:

- For **prior flight features**, we’re re-evaluating how we simulate real-time knowledge and will likely restrict this feature to more conservative assumptions or delay signals (e.g., scheduled vs. actual gaps).
- For **PageRank**, we plan to compute it incrementally over time windows—e.g., quarterly, monthly or weekly—so the model only sees past network patterns.

- For **seasonality**, we’ll retrain Prophet using only data prior to the prediction period using rolling windows-e.g., annually, quarterly,or monthly-to simulate realistic forecasting scenarios.

 

## Results and Discussion [UPDATE]

As a result of our hyperparameter tuning and feature selection, we get the following final results for our four model architectures. As a fifth architecture, we created an ensemble of the results of the four models.

### F2 Scores by Model

**TODO: Update with exp weighted eval**

| Model | CV Fold F2 Scores |  CV Score Avg  | CV Score SD | Test Set F2 | # Workers | Wall Time |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Logistic Regression | [0.47, 0.48, 0.52, 0.45, 0.46] | 0.4752 | 0.0270 | 0.4778 | 5-10 | 63 min (no tuning) |
| Random Forest | [0.44, 0.49, 0.50, 0.44, 0.47] | 0.4676 | 0.0274 |  | 5-10 | 98 min (with tuning) + FINAL EVAL??? |
| XGBoost | [0.48, 0.51, 0.53, 0.47, 0.51] | 0.5026 | 0.0245 |  | 5-10 | 20 min (no tuning) |
| Multilayer Perceptron | [0.486, 0.521, 0.531, 0.474, 0.521] | 0.5066 | 0.0250 | .5254 | 5-10? | 120 min (with tuning) |
| Ensemble | N/A | N/A | N/A | 0.5539 | 5-10 | 5 min (no tuning) |

**TODO: In subsections, summarize notable parameters used (and layers for MLP)**

**TODO: Note computational setup**

**TODO: Expalain lack of CV data for ensemble**


### Logistic Regression

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/lr_cf.png?raw=true" alt="LR Confusion Matrix" style="width: 500px">
<div>


Logistic regression was used as the baseline. Previous exploration on a one-year subset of data (2015) revealed limited improvement in F2 score after incorporating interaction features, so they were not used. 


| Feature Family | Features Used |
|:---:|:---:|
| Flight | CRS_ELAPSED_TIME, YEAR, MONTH, DAY_OF_MONTH, DAY_OF_WEEK, OP_UNIQUE_CARRIER, ORIGIN_ICAO, DEST_ICAO,  DISTANCE |
| Prior Flight | turnaround_time_calc, priorflight_depdelay_calc, priorflight_isdeparted, priorflight_isarrived, priorflight_isdelayed | 0.4676 | 0.0274 |  | 5-10? | 98 min (with tuning) + FINAL EVAL??? |

The logistic regression confusion matrix shows that msot o the incorrect predictions are due to false negatives rather than false positives. This is not preferable considering our business objective.

### Random Forest

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/rf_cf.png?raw=true" alt="RF Confusion Matrix" style="width: 500px">
<div>

The random forest achieves roughly equal proportions of false positives and negatives, and a somewhat improved performance predicting true positives as compared to the logistic regression.

### XGBoost

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/xgb_cf.png?raw=true" alt=XGB Confusion Matrix" style="width: 500px">
<div>

Even though the XGBoost model is the best at identifying true negatives, it struggles with identifying true positive on: it predicts less true positives compared to the other models, and more false negatives. Overall it exhibits similar behavior to the logistic regression, except for its improvements in predicting true negatives. 

### Multilayer Perceptron

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/mlp_cf.png?raw=true" alt="MLP Confusion Matrix" style="width: 500px">
<div>

The MLP model performs similarly to the random forest with a slight improvement in predicting both true negatives and true positives. 


<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/objectivehistory.png?raw=true" alt="MLP Confusion Matrix" style="width: 500px">
<div>

The model doesn't appear to be overfitting, as there is a steady decline in the objective function even as iterations continue. This justifies the choice of 200 iterations as the hyperparameter. Increasing iterations further could potentially lead to overfitting. Additionally, some fold models, such as for fold 3, show a more promising decline, while others might start overfitting with more iterations.

### Ensemble

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/avg_cf.png?raw=true" alt="Ensemble Confusion Matrix" style="width: 500px">
<div>


Overall, the ensemble model is not as good as the other models in predicting true negatives but performs the best in predicting true positives. It also has the lowest false negative score, and is the only model to have a higher false postiive prediction rate. Given this performance, it is the most suitable model for our priorities.

## Gap Analysis [needs work]


<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/positives.png?raw=true" alt="Gap Analysis by Day" style="width: 500px">
<div>




The ensemble captures the seasonal structure reasonably well but exhibits systematic underprediction of troughs (i.e., it overestimates the number of low-delay days) and fails to fully capture peaks (i.e., it underestimates the number of high-delay days). This indicates a conservative response to seasonal extremes.

The time series of false negatives display characteristics resembling white noise, yet with some discernible seasonal patterns. For instance, in June, where there is a known seasonal peak in delays, the model appropriately predicts fewer non-delays (i.e., it shows a slight trough in false negative prediction), aligning with real-world conditions. Conversely, in the fall, during periods with fewer delays, the model predicts more non-delays (i.e., a slight peak in false negatives).

  

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/misclass_by_hour.png?raw=true" alt="Gap Analysis by Hour" style="width: 500px">
<div>



False negatives are mostly evenly distributed, only following the minor peak for the morning influx of flights, whereas false positive counts increase as the total count increases and share the 5 PM peak.

Both false negative and false positive counts decline as total flight volume declines.

We analyze prediction type by delay rate, where delay rate is calculated as an average over all 5 years.

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/false_neg_by_delay_rate.png?raw=true" alt="False Neg by Delay Rate" style="width: 200px">
<div>




As the delay rate increases from 0.05 to 0.2, the false negative rate also rises, indicating that the model increasingly underpredicts delays even at airports with higher disruption rates. On the other hand, as expected, the false negative count is lower for higher-delay-rate-airports past the .2 threshold. This implies that model is figuring out what airports have higher delay rates, and correctly predicting less non-disruptions. The false negative rate varies widely around the 0.2 threshold, which is expected given that many data points cluster at this delay rate—i.e., many airports exhibit this level of disruption, making it harder for the model to extract a consistent predictive signal.

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/false_pos_by_delay_rate.png?raw=true" alt="False Pos by Delay Rate" style="width: 500px">
<div>


False positives increase as delay rates rise, again suggesting that the model recognizes certain airports as having frequent delays, but overcorrects—predicting delays at those airports even when none occur.


In addition to exploring our results with respect to flight features, we must also consider the weather. The below figure depicts the Spearman correlation between our numeric weather features and false positive and false negative status. In general, we see only weak correlation betwen false negative status and weather features. This is encouraging--it does not seem we are systematically missing flight delays during particular weather events. There are some moderate correlations between false positive status and temperatures and wind speeds. However, this may indicate events where extreme weather caused widespread disruption, and the model predicted false positives because of a high volume of disruption overall.

<div style="text-align: center; line-height: 10; padding-top: 30px;  padding-bottom: 30px;">
<img src="https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/weather_misclass_corr.png?raw=true" alt="Gap Analysis Weather Corr" style="width: 500px">
<div>

Considering the true and false positive and negative rates by airport type, we see similar behavior for medium and large airports, but worse misclassification behavior for small airports. In particular, small airports have a very high false positive rate. Though we are typically more conerned about false negatives, the extent of overprediction of delays in this case is worthy of further analysis in future modeling endeavors.


| Origin Airport Type | False Negative Rate | False Positive Rate | True Negative Rate | True Positive Rate |
|:---:|:---:|:---:|:---:|:---:|
| Large Airport | 0.067 | 0.299 | 0.488 | 0.146 |
| Medium Airport | 0.078 | 0.290 | 0.514 | 0.119 |
| Small Airport | 0.022 | 0.629 | 0.094 | 0.255 |


## Conclusion [UPDATE]

Accurate prediction of flight delays enables airports to prepare for operational shocks, in order to prevent delay propagation. Our aim is to design a machine learning classificaiton model to predict whether a flight will be delayed, using historical flight and weather data. In this phase, we introduced engineered features for prior flight status and seasonality. Training logistic regression models on these features, plus numeric weather and flight metadata features from the raw data, we produced models that achieved F2 scores ranging from 0.27 to 0.57 on a held-out set. The models with engineered features outperformed those without them, which supports the predictive power of the engineered features. In the next phase, we plan to introduce additional engineered features capturing airport-level delay recency and frequency, the network structure of flights and airports, and long-term seasonality. We will also explore nonlinear models including XGBoost, Random Forest, and Multi-Layer Perceptron; and will train these models using a five-year dataset. We anticipate that these additions will allow us to produce a robust and performant model to suit our stakeholders' needs.

## Code Notebooks

- Directory and raw data prep: [0.01-mas-dir-and-raw-data-prep](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/4262628234468304?o=4021782157704243#command/7738973093567460)
- Weather data cleaning: [0.03-sg-weather-clean](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/905158478261251?o=4021782157704243#command/8643339954781431), [0.03-mas-weather-cleanup](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1017866335725443?o=4021782157704243#command/1017866335725444)
- Flights data initial cleaning (1 year/Phase 2): [1.04-eil-flights-cleaning](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1032319343318287?o=4021782157704243#command/7738973093566523)
- Flights data EDA (1 year/Phase 2): [1.06-eil-flights-eda](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/905158478261252?o=4021782157704243#command/8643339954781548)
- Join pipeline: [0.08-mas-data-join-pipeline](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/905158478261859?o=4021782157704243#command/7738973093566465)
- Prior flight feature engineering: [1.10-dy-joined-prior-feat-eng](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1556213624083044?o=4021782157704243#command/7738973093573204)
- Seasonality, cross-validation, and modeling setup/development (1 year/Phase 2): [3.11-eil-joined-modeling](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1556213624087093?o=4021782157704243#command/7738973093574471)
- Joined data cleaning: [0.12-sg-joined-cleaning](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1556213624085050?o=4021782157704243#command/7738973093573832)
- Joined data cleaning and feature engineering: [1.12-sg-joined-cleaning-engineering](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3581936500212487?o=4021782157704243#command/3581936500212488)
- Joined data modeling: [3.12-sg-joined-modeling](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/1556213624087634?o=4021782157704243#command/6823299216256987)
- Hyperparametr Tuning: [3.15-mas-modeling-pipeline-with-tuning](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3336123436792754?o=4021782157704243#command/5381617944351471)
- Models Ensemble: [3.18-sg-ensemble](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3198055233260107?o=4021782157704243#command/3198055233260108)
- Modeling postprocessing: [3.13-modeling-analysis](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3581936500213637?o=4021782157704243#command/6823299216258268)
- Time-based feature engineering: [1.13-eil-joined-time-based-feat-eng](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3336123436788973?o=4021782157704243#command/3336123436788995)
- Initial 5 year modeling exploration and creation of 5 year seasonality data: [3.14-eil-5y-initial-modeling](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3336123436791044?o=4021782157704243#command/6352054888934311)
- Figure generation for report and slides: [2.15-eil-figures](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3336123436791855?o=4021782157704243#command/5381617944350124)
- Random Forest hyperparameter tuning [3.19-eil-modeling-pipeline-with-tuning](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/4472038552165641?o=4021782157704243#command/6352054888932701)
- Feature importance experiments [3.20-eil-feature-importance-experiments](https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/3198055233260353?o=4021782157704243)

## Presentation

- [Slide Deck](https://github.com/bakr-UCB/261-Final-Project/blob/main/reports/figures/Phase_III_Final_Presentation.pdf) presented on April 16, 2025

## Bibliography

<ol>
    <li>"Federal Climate Complex Data Documentation for Integrated Surface Data (ISD)." NOAA NCEI, 12 Jan. 2018, https://www.ncei.noaa.gov/data/global-hourly/doc/isd-format-document.pdf. Accessed 16 Mar. 2025.</li>
    <li>Lee, Kangoh. “Airline operational disruptions and loss-reduction investment.” Transportation Research Part B: Methodological, vol. 177, Nov. 2023, p. 102817, https://doi.org/10.1016/j.trb.2023.102817. </li>
    <li>“Local Climatological Data (LCD) Dataset Documentation.” Local Climatological Data (LCD) Data, NOAA NCEI, www.ncei.noaa.gov/pub/data/cdo/documentation/LCD_documentation.pdf. Accessed 16 Mar. 2025.</li>
    <li>"Reporting Carrier On-Time Performance (1987-present)." Bureau of Transportation Statistics, https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ. Accessed 16 Mar. 2025.</li>
    <li>Taylor, Sean and Letham Benjamin. "Forecasting at scale." PeerJ Preprints 5:e3190v2, 2017. https://doi.org/10.7287/peerj.preprints.3190v2</li>
    <li>“Understanding the Reporting of Causes of Flight Delays and Cancellations.” Bureau of Transportation Statistics, US Department of Transportation, 15 Apr. 2024, www.bts.gov/topics/airlines-and-airports/understanding-reporting-causes-flight-delays-and-cancellations. </li>
</ol>