# Calculating Historical Performance for Unique Flights (An Optimized Approach)

Each row in the flights dataset represents a unique flight, containing details such as flight date, carrier, origin, destination, scheduled departure time, arrival time, and performance metrics like delays and cancellations. To extract meaningful historical insights for each flight, I performed rolling window calculations to gather statistics based on recent flight history over a certain period of time. The idea for this aggregation approach to create new features was adapted from [Ashish Jain](https://github.com/aajains/springboard-datascience-intensive/blob/master/capstone_project/DataAcquisitionMerging_v1/data_acquisition_merging.ipynb).

#### Objective:
I wanted to compute historical performance statistics—such as delays, cancellations, and diversions—for each flight, using its recent history. For example, for a flight from Las Vegas (LAS) to Charlotte (CLT) on August 6th, 2023, at 11:59 PM, I wanted to know how many times a flight on the same route, operated by the same airline, and within the same time window, was delayed or canceled over the past 10 days, 20 days, 30 days and 90 days.

#### Key Variables Defining a Unique Flight:
1. **Carrier**: The airline operating the flight.
2. **Origin**: The departure airport.
3. **Destination**: The arrival airport.
4. **Departure Window**: A specific time range during which the flight departs (e.g., morning, afternoon).

For each flight, I gathered historical data based on these four variables.

#### Process Overview:
1. **Grouping by Flight Attributes**: First step is to group flights by their carrier, route (origin-destination pair), and departure window. 
    * Use the function `create_route_ids()` to create a new dataframe containing unique routes made up of origin destination pairs and assign each route a `route_id`. This id was then added to each flight in the flights dataframe. 
    * Use the function `create_time_windows()` to create a new categorical column called `departure_window` with the categories and times below. These windows are based on the scheduled departure time in the local timezone. 

        | Category         | Time Range    |
        |------------------|---------------|
        | Overnight        | 12 AM - 4 AM  |
        | Early morning    | 4 AM - 6 AM   |
        | Morning          | 6 AM - 11 AM  |
        | Midday           | 11 AM - 1 PM  |
        | Early afternoon  | 1 PM - 3 PM   |
        | Afternoon        | 3 PM - 5 PM   |
        | Evening          | 5 PM - 7 PM   |
        | Night            | 7 PM - 10 PM  |
        | Late night       | 10 PM - 12 AM |


   
2. **Rolling Window Aggregations**: Next, for each flight, use the `calculate_flight_performance_aggregations(df, windows)` function calculate performance statistics (e.g., delay metrics, cancellations, and diversions) over rolling time windows (e.g., 10, 20, 30, and 90 days) that look back from the flight's scheduled departure date. The statistics include:
   - Median and maximum departure/arrival delays
   - Count of canceled flights
   - Count of diversions
   - Count of flights

3. **Handling Duplicate Data**: The aggregations are done for all specified time windows for each flight in the flights dataset. Some duplicates occurred when an airline had multiple flights on the same route and in the same time window. This caused the merge with the original flights dataset to result in more rows than expected, as one flight could be matched with multiple rows of aggregated data. After computing the rolling window statistics, I optimized the deduplication process by sorting the aggregated dataframe by key columns (route_id, airline_mkt, departure_window, scheduled_departure_datetime) and keeping the row with the highest number of n_flights. This ensured that for each combination of key columns, the row with the most complete flight data (i.e., the highest number of flights) is retained.

This deduplication was applied with the function, `drop_agg_duplicates()` before merging the rolling statistics back into the original flights dataframe.

4. **Time Windows**: Flights departing at similar times are grouped into “time windows” (e.g., early morning, afternoon), allowing for tracking consistent patterns in flight operations over time. This reduces the granularity of individual departure times and simplifies grouping.

#### Optimized Approach:
Rather than looping through each flight row by row, I leveraged vectorized operations with `pandas` to group flights by carrier, route, and departure window and applied rolling windows efficiently. This dramatically reduces computation time and allows the process to effectively scale large datasets with millions of rows. After calculating the aggregations, I merged the results back into the original flight dataset.

By combining these rolling window statistics with the original flight data, I enhanced the dataset with valuable historical features that can be used for further analysis and modeling.

#### Future Steps for EDA
1. **Handling Rows with `n_flights == 0` or `n_flights == NaN`**

    During the rolling window aggregation process, some rows resulted in `n_flights == 0`, indicating that there were no relevant flights within the specified window for the given route, carrier, and departure window. This can happen for several reasons:
    * Sparse Data: Some routes or carriers may not have had enough flights during the rolling window, especially for less common routes or carriers.
    * Edge Cases: Flights near the start or end of the dataset may not have enough prior or subsequent data to fill the rolling window (example flights in the first 10 days of the year)

    Rows with `n_flights == 0` often have NaN values for other aggregated columns (e.g., delays or cancellations), as no data was available to compute these statistics.

    *Handling Strategy:*
    * Preserving Data: Rather than dropping rows with `n_flights == 0`, I preserved them in the dataset. These rows will be handled during Exploratory Data Analysis (EDA) or preprocessing steps, where I can choose to filter them out, impute missing values, or flag them for special handling, depending on the analysis needs.
    * Analysis Consideration: These rows are important to consider because they reflect scenarios where there is limited historical data for specific flights, which can impact the accuracy of predictions or analysis. By keeping them, I maintain transparency in the data and can address any gaps during the modeling process.

In [1]:
import sys
import os

# Get the current working directory
current_dir = os.getcwd()

# Add the parent directory to sys.path
parent_dir = os.path.abspath(os.path.join(current_dir, '..'))
sys.path.insert(0, parent_dir)

In [2]:
import pandas as pd
import numpy as np

from config.config import DATA_PATH


In [3]:
pd.set_option('display.max_columns', None)

In [4]:
flights = pd.read_csv(DATA_PATH + '/interim/2023-performance-data-clean.csv',  parse_dates=['scheduled_departure_datetime'])

  flights = pd.read_csv(DATA_PATH + '/interim/2023-performance-data-clean.csv',  parse_dates=['scheduled_departure_datetime'])


In [5]:
flights.head()

Unnamed: 0,year,quarter,month,day_of_month,day_of_week,marketing_airline_id,flight_number_marketing_airline,operating_airline_id,tail_number,origin_airport_id,origin_city_market_id,origin,origin_state,dest_airport_id,dest_city_market_id,dest,dest_state,dep_delay,dep_del15,departure_delay_groups,taxi_out,taxi_in,arr_delay,cancelled,cancellation_code,diverted,scheduled_elapsed_time,actual_elapsed_time,distance,distance_group,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,div_airport_landings,code_share_flight,origin_city,destination_city,scheduled_departure_datetime,scheduled_arrival_datetime,actual_departure_datetime,actual_arrival_datetime,airline_mkt,airline_ops,origin_timezone,destination_timezone,scheduled_departure_datetime_utc,actual_departure_datetime_utc,scheduled_arrival_datetime_utc,actual_arrival_datetime_utc,is_holiday
0,2023,3,8,6,7,20416,2252,20416,N978NK,12889,32211,LAS,NV,11057,31057,CLT,NC,,,,,,,1,B,0,263.0,,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-06 23:59:00,2023-08-07 07:22:00,,,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-07 07:00:00+00:00,,2023-08-07 11:00:00+00:00,,0
1,2023,3,8,7,1,20416,2252,20416,N974NK,12889,32211,LAS,NV,11057,31057,CLT,NC,76.0,1.0,5.0,19.0,9.0,69.0,0,,0,263.0,256.0,1916.0,8,3.0,0.0,1.0,0.0,65.0,0.0,0,Las Vegas,Charlotte,2023-08-07 23:59:00,2023-08-08 07:22:00,2023-08-08 01:15:00,2023-08-08 08:31:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-08 07:00:00+00:00,2023-08-08 08:15:00+00:00,2023-08-08 11:00:00+00:00,2023-08-08 12:31:00+00:00,0
2,2023,3,8,9,3,20416,2252,20416,N519NK,12889,32211,LAS,NV,11057,31057,CLT,NC,-11.0,0.0,-1.0,14.0,10.0,-13.0,0,,0,258.0,256.0,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-09 23:10:00,2023-08-10 06:28:00,2023-08-09 22:59:00,2023-08-10 06:15:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-10 06:00:00+00:00,2023-08-10 05:59:00+00:00,2023-08-10 10:00:00+00:00,2023-08-10 10:15:00+00:00,0
3,2023,3,8,10,4,20416,2252,20416,N532NK,12889,32211,LAS,NV,11057,31057,CLT,NC,-8.0,0.0,-1.0,12.0,7.0,-30.0,0,,0,258.0,236.0,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-10 23:10:00,2023-08-11 06:28:00,2023-08-10 23:02:00,2023-08-11 05:58:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-11 06:00:00+00:00,2023-08-11 06:02:00+00:00,2023-08-11 10:00:00+00:00,2023-08-11 09:58:00+00:00,0
4,2023,3,8,12,6,20416,2252,20416,N529NK,12889,32211,LAS,NV,11057,31057,CLT,NC,4.0,0.0,0.0,18.0,6.0,-4.0,0,,0,258.0,250.0,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-12 23:10:00,2023-08-13 06:28:00,2023-08-12 23:14:00,2023-08-13 06:24:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-13 06:00:00+00:00,2023-08-13 06:14:00+00:00,2023-08-13 10:00:00+00:00,2023-08-13 10:24:00+00:00,0


In [6]:
flights.shape

(7276990, 52)

In [7]:
flights.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7276990 entries, 0 to 7276989
Data columns (total 52 columns):
 #   Column                            Dtype         
---  ------                            -----         
 0   year                              int64         
 1   quarter                           int64         
 2   month                             int64         
 3   day_of_month                      int64         
 4   day_of_week                       int64         
 5   marketing_airline_id              int64         
 6   flight_number_marketing_airline   int64         
 7   operating_airline_id              int64         
 8   tail_number                       object        
 9   origin_airport_id                 int64         
 10  origin_city_market_id             int64         
 11  origin                            object        
 12  origin_state                      object        
 13  dest_airport_id                   int64         
 14  dest_city_market_i

In [8]:
dep_delay_99th = flights['dep_delay'].quantile(0.99)
arr_delay_99th = flights['arr_delay'].quantile(0.99)

# Clip values at the 99th percentile
flights['dep_delay_clipped'] = flights['dep_delay'].clip(upper=dep_delay_99th)
flights['arr_delay_clipped'] = flights['arr_delay'].clip(upper=arr_delay_99th)

In [9]:
def create_route_ids(df):
    """
    Assign unique route IDs to each origin and destination airport pair.

    Parameters:
    ----------
    flights : pandas.DataFrame
        The original DataFrame of flight data, which must contain 'origin' and 'dest' columns.

    Returns:
    -------
    flights : pandas.DataFrame
        The DataFrame with two new columns:
        - 'route_id': A unique identifier for each route (origin-destination pair).
    """

    routes = df.groupby(['origin', 'dest'])[['origin', 'dest']].size().reset_index(name='flight_count')
    routes.index += 1
    routes = routes.reset_index().rename(columns={'index':'route_id'})

    # create a dictionary with route tuples and route id
    route_mapping = routes.set_index(['origin', 'dest'])['route_id'].to_dict()

    # add the route id for each flight
    df['route_id'] = df[['origin', 'dest']].apply(tuple, axis=1).map(route_mapping)

    return df, route_mapping

In [10]:
flights, routes = create_route_ids(flights)

In [11]:
flights['route_id'].nunique()

6714

In [12]:
def create_time_windows(df, datetime_col):
    """
    Categorize flights into departure time windows.

    Time windows (e.g., morning, afternoon) are assigned based on the hour of day using 
    predefined bins.

    Parameters:
    ----------
    flights : pandas.DataFrame
        The original DataFrame of flight data, which must contain a datetime column to 
        convert such as the 'scheduled_departure_datetime' column.

    Returns:
    -------
    flights : pandas.DataFrame
        The DataFrame with two new columns:
        - 'departure_window': A categorical time window label (e.g., morning, afternoon).
    """

    bins = [0, 4, 6, 11, 13, 15, 17, 19, 22, 24]
    labels = ['overnight', 'early morning', 'morning', 'midday', 'early afternoon', 'afternoon', 'evening', 'night', 'late night']

    df['hour_of_day'] = df[datetime_col].dt.hour

    df['departure_window'] = pd.cut(df['hour_of_day'], bins=bins, labels=labels, right=False)

    return df

In [13]:
flights = create_time_windows(flights, 'scheduled_departure_datetime')

In [14]:
flights.head()

Unnamed: 0,year,quarter,month,day_of_month,day_of_week,marketing_airline_id,flight_number_marketing_airline,operating_airline_id,tail_number,origin_airport_id,origin_city_market_id,origin,origin_state,dest_airport_id,dest_city_market_id,dest,dest_state,dep_delay,dep_del15,departure_delay_groups,taxi_out,taxi_in,arr_delay,cancelled,cancellation_code,diverted,scheduled_elapsed_time,actual_elapsed_time,distance,distance_group,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,div_airport_landings,code_share_flight,origin_city,destination_city,scheduled_departure_datetime,scheduled_arrival_datetime,actual_departure_datetime,actual_arrival_datetime,airline_mkt,airline_ops,origin_timezone,destination_timezone,scheduled_departure_datetime_utc,actual_departure_datetime_utc,scheduled_arrival_datetime_utc,actual_arrival_datetime_utc,is_holiday,dep_delay_clipped,arr_delay_clipped,route_id,hour_of_day,departure_window
0,2023,3,8,6,7,20416,2252,20416,N978NK,12889,32211,LAS,NV,11057,31057,CLT,NC,,,,,,,1,B,0,263.0,,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-06 23:59:00,2023-08-07 07:22:00,,,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-07 07:00:00+00:00,,2023-08-07 11:00:00+00:00,,0,,,3228,23,late night
1,2023,3,8,7,1,20416,2252,20416,N974NK,12889,32211,LAS,NV,11057,31057,CLT,NC,76.0,1.0,5.0,19.0,9.0,69.0,0,,0,263.0,256.0,1916.0,8,3.0,0.0,1.0,0.0,65.0,0.0,0,Las Vegas,Charlotte,2023-08-07 23:59:00,2023-08-08 07:22:00,2023-08-08 01:15:00,2023-08-08 08:31:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-08 07:00:00+00:00,2023-08-08 08:15:00+00:00,2023-08-08 11:00:00+00:00,2023-08-08 12:31:00+00:00,0,76.0,69.0,3228,23,late night
2,2023,3,8,9,3,20416,2252,20416,N519NK,12889,32211,LAS,NV,11057,31057,CLT,NC,-11.0,0.0,-1.0,14.0,10.0,-13.0,0,,0,258.0,256.0,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-09 23:10:00,2023-08-10 06:28:00,2023-08-09 22:59:00,2023-08-10 06:15:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-10 06:00:00+00:00,2023-08-10 05:59:00+00:00,2023-08-10 10:00:00+00:00,2023-08-10 10:15:00+00:00,0,-11.0,-13.0,3228,23,late night
3,2023,3,8,10,4,20416,2252,20416,N532NK,12889,32211,LAS,NV,11057,31057,CLT,NC,-8.0,0.0,-1.0,12.0,7.0,-30.0,0,,0,258.0,236.0,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-10 23:10:00,2023-08-11 06:28:00,2023-08-10 23:02:00,2023-08-11 05:58:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-11 06:00:00+00:00,2023-08-11 06:02:00+00:00,2023-08-11 10:00:00+00:00,2023-08-11 09:58:00+00:00,0,-8.0,-30.0,3228,23,late night
4,2023,3,8,12,6,20416,2252,20416,N529NK,12889,32211,LAS,NV,11057,31057,CLT,NC,4.0,0.0,0.0,18.0,6.0,-4.0,0,,0,258.0,250.0,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-12 23:10:00,2023-08-13 06:28:00,2023-08-12 23:14:00,2023-08-13 06:24:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-13 06:00:00+00:00,2023-08-13 06:14:00+00:00,2023-08-13 10:00:00+00:00,2023-08-13 10:24:00+00:00,0,4.0,-4.0,3228,23,late night


In [15]:
def calculate_flight_performance_aggregations(df, windows):
    """
    Calculate rolling window flight performance aggregations over specified time windows.

    This function calculates statistics like mean, median, max of departure and arrival delays,
    as well as the number of cancellations, diversions, and the number of flights, for each 
    flight based on its carrier, route, and departure window.

    Parameters:
    ----------
    df : pandas.DataFrame
        The input DataFrame containing flight data, including columns such as 'airline_mkt', 
        'route_id', 'departure_window', and 'scheduled_departure_datetime'.
        
    windows : list of str
        A list of rolling window durations (e.g., ['10D', '20D', '30D']) to calculate the
        rolling aggregations over. These should be formatted as Pandas-compatible time strings.

    Returns:
    -------
    pandas.DataFrame
        A DataFrame containing the rolling statistics for each window. Each statistic will be 
        suffixed with the corresponding window (e.g., 'dep_delay_mean_10D', 'n_flights_20D').
        The resulting DataFrame includes statistics like:
        - dep_delay_mean, dep_delay_median, dep_delay_max
        - arr_delay_mean, arr_delay_median, arr_delay_max
        - cancelled_sum, div_airport_landings_sum
        - n_flights (number of flights in the window)
    """
    # Sort dataframe by scheduled departure datetime column
    df = df.sort_values('scheduled_departure_datetime')

    df['dep_delay'] = df['dep_delay_clipped']
    df['arr_delay'] = df['arr_delay_clipped']

    # Group dataframe to create unique flights based on airline, route, and departure window
    grouped_df = df.groupby(['airline_mkt', 'route_id', 'departure_window'], observed=False)

    # Initiate empty list to capture aggregate values
    rolling_stats_list = []

    # Iterate through unique flight groupings to calculate performance metrics over rolling window
    for window in windows:
        rolling_stats = grouped_df.rolling(window=window, on='scheduled_departure_datetime').agg({
        'dep_delay': ['mean', 'median', 'max'],
        'arr_delay': ['mean', 'median', 'max'],
        'cancelled': 'sum',
        'div_airport_landings': 'sum'
        }).reset_index()

        # Add the number of flights by counting rows within each rolling window
        rolling_stats[f'n_flights_{window}'] = grouped_df.rolling(window=window, on='scheduled_departure_datetime')['dep_delay'].count().values

        # Flatten the multi-level column index for easier access
        rolling_stats.columns = [
            f'{col[0]}_{col[1]}_{window}' if isinstance(col, tuple) and col[1] != '' else col[0] for col in rolling_stats.columns
        ]

        rolling_stats_list.append(rolling_stats)

    # Concatenate all rolling stats into a single DataFrame
    all_rolling_stats = pd.concat(rolling_stats_list, axis=1)

    # Remove duplicate columns from the concatenation
    all_rolling_stats = all_rolling_stats.loc[:, ~all_rolling_stats.columns.duplicated()]

    return all_rolling_stats    


In [16]:
# calculate rolling statistics for 2023 unique flights
rolling_stats = calculate_flight_performance_aggregations(flights, ['10D', '30D', '90D'])

In [17]:
rolling_stats.shape

(7276990, 31)

In [18]:
rolling_stats.isna().sum()

airline_mkt                        0
route_id                           0
departure_window                   0
scheduled_departure_datetime       0
dep_delay_mean_10D              1428
dep_delay_median_10D            1428
dep_delay_max_10D               1428
arr_delay_mean_10D              1485
arr_delay_median_10D            1485
arr_delay_max_10D               1485
cancelled_sum_10D                  0
div_airport_landings_sum_10D       0
n_flights_10D                      0
dep_delay_mean_30D              1213
dep_delay_median_30D            1213
dep_delay_max_30D               1213
arr_delay_mean_30D              1261
arr_delay_median_30D            1261
arr_delay_max_30D               1261
cancelled_sum_30D                  0
div_airport_landings_sum_30D       0
n_flights_30D                      0
dep_delay_mean_90D              1029
dep_delay_median_90D            1029
dep_delay_max_90D               1029
arr_delay_mean_90D              1068
arr_delay_median_90D            1068
a

Initially I merged `rolling_stats` with the flights dataframe but due to duplicates, the merged dataframe was several thousand rows longer. The code below is how I investigated the duplicates. 

In [19]:
rolling_stats.duplicated(subset=['route_id', 'airline_mkt', 'departure_window', 'scheduled_departure_datetime']).sum()

152

In [20]:
rolling_stats[rolling_stats.duplicated(subset=['route_id', 'airline_mkt', 'departure_window', 'scheduled_departure_datetime'], keep=False)].sort_values('scheduled_departure_datetime').head()


Unnamed: 0,airline_mkt,route_id,departure_window,scheduled_departure_datetime,dep_delay_mean_10D,dep_delay_median_10D,dep_delay_max_10D,arr_delay_mean_10D,arr_delay_median_10D,arr_delay_max_10D,cancelled_sum_10D,div_airport_landings_sum_10D,n_flights_10D,dep_delay_mean_30D,dep_delay_median_30D,dep_delay_max_30D,arr_delay_mean_30D,arr_delay_median_30D,arr_delay_max_30D,cancelled_sum_30D,div_airport_landings_sum_30D,n_flights_30D,dep_delay_mean_90D,dep_delay_median_90D,dep_delay_max_90D,arr_delay_mean_90D,arr_delay_median_90D,arr_delay_max_90D,cancelled_sum_90D,div_airport_landings_sum_90D,n_flights_90D
3849038,Frontier Airlines Inc.,3781,early morning,2023-03-05 05:45:00,0.3,-1.0,20.0,-4.5,-6.5,18.0,0.0,0.0,10.0,1.055556,-1.5,39.0,-2.166667,-6.0,40.0,0.0,0.0,18.0,1.055556,-1.5,39.0,-2.166667,-6.0,40.0,0.0,0.0,18.0
3849039,Frontier Airlines Inc.,3781,early morning,2023-03-05 05:45:00,-0.181818,-2.0,20.0,-5.363636,-8.0,18.0,0.0,0.0,11.0,0.736842,-2.0,39.0,-2.789474,-7.0,40.0,0.0,0.0,19.0,0.736842,-2.0,39.0,-2.789474,-7.0,40.0,0.0,0.0,19.0
2501285,Delta Air Lines Inc.,288,morning,2023-03-14 09:10:00,-0.1,-1.0,11.0,-0.2,-3.0,20.0,1.0,0.0,10.0,3.310345,-1.0,99.0,3.344828,-2.0,101.0,2.0,0.0,29.0,4.805556,-1.0,113.0,-2.458333,-4.0,104.0,2.0,0.0,72.0
2501284,Delta Air Lines Inc.,288,morning,2023-03-14 09:10:00,0.111111,0.0,11.0,0.888889,-3.0,20.0,1.0,0.0,9.0,3.5,-1.0,99.0,3.821429,-1.0,101.0,2.0,0.0,28.0,4.901408,-1.0,113.0,-2.352113,-4.0,104.0,2.0,0.0,71.0
3684550,Delta Air Lines Inc.,6191,midday,2023-03-14 12:40:00,13.4,2.0,104.0,0.2,-10.5,114.0,1.0,0.0,10.0,10.275862,0.0,106.0,1.655172,-5.0,114.0,2.0,0.0,29.0,6.464789,-2.0,108.0,2.380282,-1.0,114.0,2.0,0.0,71.0


In order to address the issue of one flight getting paired with two unique flight values, I decided to drop duplicates and keep the last value. 

In [21]:
def drop_agg_duplicates(df):
    """
    Remove duplicate rows from the aggregated flight performance data while keeping the row 
    with the highest number of flights (n_flights) for each group.

    The DataFrame is sorted by 'route_id', 'airline_mkt', 'departure_window', and 
    'scheduled_departure_datetime', and duplicates are dropped based on these columns. 
    The row with the highest 'n_flights' for each group is retained.

    Parameters:
    ----------
    df : pandas.DataFrame
        The input DataFrame containing the rolling flight performance statistics, 
        including the columns 'route_id', 'airline_mkt', 'departure_window', 
        'scheduled_departure_datetime', and 'n_flights_<window>'.

    Returns:
    -------
    pandas.DataFrame
        A deduplicated DataFrame where only one row per unique combination of 'route_id', 
        'airline_mkt', 'departure_window', and 'scheduled_departure_datetime' is retained,
        specifically the row with the highest 'n_flights'.
    """
    df_sorted = df.sort_values( ['route_id', 'airline_mkt', 'departure_window', 'scheduled_departure_datetime', 'n_flights_90D'], ascending=True)

    df_deduped = df_sorted.drop_duplicates(subset=['route_id', 'airline_mkt', 'departure_window', 'scheduled_departure_datetime'], keep='last')

    return df_deduped

In [22]:
rolling_stats_deduped = drop_agg_duplicates(rolling_stats)

In [23]:
rolling_stats_deduped.shape

(7276838, 31)

In [24]:
flights_historical_performance = pd.merge(flights, rolling_stats_deduped, 
    on=['route_id', 'airline_mkt', 'departure_window', 'scheduled_departure_datetime'], 
    how='left')

In [25]:
flights_historical_performance.head()

Unnamed: 0,year,quarter,month,day_of_month,day_of_week,marketing_airline_id,flight_number_marketing_airline,operating_airline_id,tail_number,origin_airport_id,origin_city_market_id,origin,origin_state,dest_airport_id,dest_city_market_id,dest,dest_state,dep_delay,dep_del15,departure_delay_groups,taxi_out,taxi_in,arr_delay,cancelled,cancellation_code,diverted,scheduled_elapsed_time,actual_elapsed_time,distance,distance_group,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,div_airport_landings,code_share_flight,origin_city,destination_city,scheduled_departure_datetime,scheduled_arrival_datetime,actual_departure_datetime,actual_arrival_datetime,airline_mkt,airline_ops,origin_timezone,destination_timezone,scheduled_departure_datetime_utc,actual_departure_datetime_utc,scheduled_arrival_datetime_utc,actual_arrival_datetime_utc,is_holiday,dep_delay_clipped,arr_delay_clipped,route_id,hour_of_day,departure_window,dep_delay_mean_10D,dep_delay_median_10D,dep_delay_max_10D,arr_delay_mean_10D,arr_delay_median_10D,arr_delay_max_10D,cancelled_sum_10D,div_airport_landings_sum_10D,n_flights_10D,dep_delay_mean_30D,dep_delay_median_30D,dep_delay_max_30D,arr_delay_mean_30D,arr_delay_median_30D,arr_delay_max_30D,cancelled_sum_30D,div_airport_landings_sum_30D,n_flights_30D,dep_delay_mean_90D,dep_delay_median_90D,dep_delay_max_90D,arr_delay_mean_90D,arr_delay_median_90D,arr_delay_max_90D,cancelled_sum_90D,div_airport_landings_sum_90D,n_flights_90D
0,2023,3,8,6,7,20416,2252,20416,N978NK,12889,32211,LAS,NV,11057,31057,CLT,NC,,,,,,,1,B,0,263.0,,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-06 23:59:00,2023-08-07 07:22:00,,,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-07 07:00:00+00:00,,2023-08-07 11:00:00+00:00,,0,,,3228,23,late night,5.125,2.0,31.0,-5.0,-6.0,16.0,3.0,0.0,8.0,15.518519,5.0,82.0,4.923077,0.5,62.0,4.0,0.0,27.0,15.714286,6.0,163.0,4.903614,-2.0,166.0,5.0,0.0,84.0
1,2023,3,8,7,1,20416,2252,20416,N974NK,12889,32211,LAS,NV,11057,31057,CLT,NC,76.0,1.0,5.0,19.0,9.0,69.0,0,,0,263.0,256.0,1916.0,8,3.0,0.0,1.0,0.0,65.0,0.0,0,Las Vegas,Charlotte,2023-08-07 23:59:00,2023-08-08 07:22:00,2023-08-08 01:15:00,2023-08-08 08:31:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-08 07:00:00+00:00,2023-08-08 08:15:00+00:00,2023-08-08 11:00:00+00:00,2023-08-08 12:31:00+00:00,0,76.0,69.0,3228,23,late night,14.0,2.0,76.0,4.714286,-6.0,69.0,3.0,0.0,8.0,17.740741,5.0,82.0,7.961538,5.0,69.0,4.0,0.0,27.0,16.690476,7.0,163.0,5.819277,-1.0,166.0,5.0,0.0,84.0
2,2023,3,8,9,3,20416,2252,20416,N519NK,12889,32211,LAS,NV,11057,31057,CLT,NC,-11.0,0.0,-1.0,14.0,10.0,-13.0,0,,0,258.0,256.0,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-09 23:10:00,2023-08-10 06:28:00,2023-08-09 22:59:00,2023-08-10 06:15:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-10 06:00:00+00:00,2023-08-10 05:59:00+00:00,2023-08-10 10:00:00+00:00,2023-08-10 10:15:00+00:00,0,-11.0,-13.0,3228,23,late night,12.375,1.5,76.0,5.0,-6.0,69.0,3.0,0.0,8.0,15.185185,5.0,82.0,6.192308,0.5,69.0,4.0,0.0,27.0,16.710843,7.0,163.0,5.573171,-2.0,166.0,5.0,0.0,83.0
3,2023,3,8,10,4,20416,2252,20416,N532NK,12889,32211,LAS,NV,11057,31057,CLT,NC,-8.0,0.0,-1.0,12.0,7.0,-30.0,0,,0,258.0,236.0,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-10 23:10:00,2023-08-11 06:28:00,2023-08-10 23:02:00,2023-08-11 05:58:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-11 06:00:00+00:00,2023-08-11 06:02:00+00:00,2023-08-11 10:00:00+00:00,2023-08-11 09:58:00+00:00,0,-8.0,-30.0,3228,23,late night,10.111111,1.0,76.0,0.625,-8.5,69.0,2.0,0.0,9.0,11.851852,3.0,76.0,2.653846,-0.5,69.0,4.0,0.0,27.0,16.120482,7.0,163.0,4.573171,-3.0,166.0,5.0,0.0,83.0
4,2023,3,8,12,6,20416,2252,20416,N529NK,12889,32211,LAS,NV,11057,31057,CLT,NC,4.0,0.0,0.0,18.0,6.0,-4.0,0,,0,258.0,250.0,1916.0,8,,,,,,0.0,0,Las Vegas,Charlotte,2023-08-12 23:10:00,2023-08-13 06:28:00,2023-08-12 23:14:00,2023-08-13 06:24:00,Spirit Air Lines,Spirit Air Lines,America/Los_Angeles,America/New_York,2023-08-13 06:00:00+00:00,2023-08-13 06:14:00+00:00,2023-08-13 10:00:00+00:00,2023-08-13 10:24:00+00:00,0,4.0,-4.0,3228,23,late night,12.25,3.0,76.0,2.5,-5.0,69.0,1.0,0.0,8.0,12.961538,4.5,76.0,4.08,0.0,69.0,4.0,0.0,26.0,15.682927,5.0,163.0,4.098765,-4.0,166.0,5.0,0.0,82.0


In [26]:
flights_historical_performance.shape[0] == flights.shape[0]

True

# Missing Performance Data

In [27]:
missing_performance_data = flights_historical_performance.isna().sum().reset_index(name='count')
missing_performance_data[missing_performance_data['count'] > 0]


Unnamed: 0,index,count
8,tail_number,19648
17,dep_delay,90160
18,dep_del15,90160
19,departure_delay_groups,90160
20,taxi_out,93119
21,taxi_in,93898
22,arr_delay,93897
24,cancellation_code,7183093
26,scheduled_elapsed_time,4
27,actual_elapsed_time,93898


In [28]:
route_time_window_counts = flights.groupby(['airline_mkt', 'route_id', 'departure_window']).size().reset_index(name='n_flights')
rare_flights = route_time_window_counts[(route_time_window_counts['n_flights'] <= 5) & (route_time_window_counts['n_flights'] != 0)]
rare_flights

  route_time_window_counts = flights.groupby(['airline_mkt', 'route_id', 'departure_window']).size().reset_index(name='n_flights')


Unnamed: 0,airline_mkt,route_id,departure_window,n_flights
571,Alaska Airlines Inc.,64,early afternoon,5
573,Alaska Airlines Inc.,64,evening,1
574,Alaska Airlines Inc.,64,night,3
920,Alaska Airlines Inc.,103,morning,1
1002,Alaska Airlines Inc.,112,midday,5
...,...,...,...,...
603365,United Air Lines Inc.,6615,afternoon,4
604012,United Air Lines Inc.,6687,early afternoon,1
604030,United Air Lines Inc.,6689,early afternoon,1
604087,United Air Lines Inc.,6695,night,1


In [29]:
rolling_stats_deduped.isna().sum()
missing_stats = rolling_stats_deduped[rolling_stats_deduped['arr_delay_mean_10D'].isna()].sort_values('scheduled_departure_datetime')
missing_stats

Unnamed: 0,airline_mkt,route_id,departure_window,scheduled_departure_datetime,dep_delay_mean_10D,dep_delay_median_10D,dep_delay_max_10D,arr_delay_mean_10D,arr_delay_median_10D,arr_delay_max_10D,cancelled_sum_10D,div_airport_landings_sum_10D,n_flights_10D,dep_delay_mean_30D,dep_delay_median_30D,dep_delay_max_30D,arr_delay_mean_30D,arr_delay_median_30D,arr_delay_max_30D,cancelled_sum_30D,div_airport_landings_sum_30D,n_flights_30D,dep_delay_mean_90D,dep_delay_median_90D,dep_delay_max_90D,arr_delay_mean_90D,arr_delay_median_90D,arr_delay_max_90D,cancelled_sum_90D,div_airport_landings_sum_90D,n_flights_90D
3885328,Frontier Airlines Inc.,5054,overnight,2023-01-01 00:59:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
5744448,Spirit Air Lines,795,overnight,2023-01-01 03:59:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
4356697,Southwest Airlines Co.,372,early morning,2023-01-01 05:00:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
4028081,JetBlue Airways,733,early morning,2023-01-01 05:01:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
5930681,Spirit Air Lines,4612,early morning,2023-01-01 05:25:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493404,Allegiant Air,6446,midday,2023-12-27 11:15:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,-10.5,-10.5,-7.0,-34.5,-34.5,-28.0,1.0,0.0,2.0
460890,Allegiant Air,5092,midday,2023-12-27 11:17:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
421893,Allegiant Air,2361,midday,2023-12-27 11:41:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,31.0,31.0,31.0,38.0,38.0,38.0,1.0,0.0,1.0
452636,Allegiant Air,4700,early afternoon,2023-12-27 14:44:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,42.0,42.0,42.0,44.0,44.0,44.0,1.0,0.0,1.0


In [30]:
missing_stats['cancelled_sum_10D'].value_counts()
missing_jan_flights = missing_stats[missing_stats['scheduled_departure_datetime'].dt.month == 1]
missing_jan_flights

Unnamed: 0,airline_mkt,route_id,departure_window,scheduled_departure_datetime,dep_delay_mean_10D,dep_delay_median_10D,dep_delay_max_10D,arr_delay_mean_10D,arr_delay_median_10D,arr_delay_max_10D,cancelled_sum_10D,div_airport_landings_sum_10D,n_flights_10D,dep_delay_mean_30D,dep_delay_median_30D,dep_delay_max_30D,arr_delay_mean_30D,arr_delay_median_30D,arr_delay_max_30D,cancelled_sum_30D,div_airport_landings_sum_30D,n_flights_30D,dep_delay_mean_90D,dep_delay_median_90D,dep_delay_max_90D,arr_delay_mean_90D,arr_delay_median_90D,arr_delay_max_90D,cancelled_sum_90D,div_airport_landings_sum_90D,n_flights_90D
3885328,Frontier Airlines Inc.,5054,overnight,2023-01-01 00:59:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
5744448,Spirit Air Lines,795,overnight,2023-01-01 03:59:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
4356697,Southwest Airlines Co.,372,early morning,2023-01-01 05:00:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
4028081,JetBlue Airways,733,early morning,2023-01-01 05:01:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
5930681,Spirit Air Lines,4612,early morning,2023-01-01 05:25:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2119066,American Airlines Inc.,5680,early afternoon,2023-01-10 13:03:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
220849,Alaska Airlines Inc.,5681,afternoon,2023-01-10 15:20:00,,,,,,,2.0,0.0,0.0,,,,,,,2.0,0.0,0.0,,,,,,,2.0,0.0,0.0
2045590,American Airlines Inc.,5291,early morning,2023-01-11 05:10:00,157.0,157.0,157.0,,,,1.0,0.0,1.0,157.0,157.0,157.0,,,,1.0,0.0,1.0,157.0,157.0,157.0,,,,1.0,0.0,1.0
765522,American Airlines Inc.,1157,morning,2023-01-11 07:32:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0


In [31]:
missing_cancelled_flights = missing_stats[missing_stats['cancelled_sum_10D'] != 0]
missing_cancelled_flights.shape

(1485, 31)

In [32]:
# Merge missing cancelled flights with rare flights
rare_missing_cancelled_flights = pd.merge(
    missing_cancelled_flights,
    rare_flights,
    on=['airline_mkt', 'route_id', 'departure_window'],
    how='inner'
)
rare_missing_cancelled_flights

Unnamed: 0,airline_mkt,route_id,departure_window,scheduled_departure_datetime,dep_delay_mean_10D,dep_delay_median_10D,dep_delay_max_10D,arr_delay_mean_10D,arr_delay_median_10D,arr_delay_max_10D,cancelled_sum_10D,div_airport_landings_sum_10D,n_flights_10D,dep_delay_mean_30D,dep_delay_median_30D,dep_delay_max_30D,arr_delay_mean_30D,arr_delay_median_30D,arr_delay_max_30D,cancelled_sum_30D,div_airport_landings_sum_30D,n_flights_30D,dep_delay_mean_90D,dep_delay_median_90D,dep_delay_max_90D,arr_delay_mean_90D,arr_delay_median_90D,arr_delay_max_90D,cancelled_sum_90D,div_airport_landings_sum_90D,n_flights_90D,n_flights
0,Spirit Air Lines,4612,early morning,2023-01-01 05:25:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,4
1,Spirit Air Lines,3228,early morning,2023-01-01 05:30:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,2
2,Spirit Air Lines,3228,early morning,2023-01-02 05:30:00,,,,,,,2.0,0.0,0.0,,,,,,,2.0,0.0,0.0,,,,,,,2.0,0.0,0.0,2
3,Spirit Air Lines,3964,early morning,2023-01-01 05:45:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,4
4,Spirit Air Lines,1992,morning,2023-01-01 06:15:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129,Allegiant Air,449,early afternoon,2023-12-23 14:49:00,,,,,,,1.0,0.0,0.0,-8.0,-8.0,-8.0,-26.0,-26.0,-26.0,1.0,0.0,1.0,-8.0,-8.0,-8.0,-26.0,-26.0,-26.0,1.0,0.0,1.0,2
130,Southwest Airlines Co.,3909,night,2023-12-23 20:10:00,,,,,,,1.0,0.0,0.0,6.0,6.0,6.0,-3.0,-3.0,-3.0,1.0,0.0,1.0,6.0,6.0,6.0,-3.0,-3.0,-3.0,1.0,0.0,1.0,3
131,Allegiant Air,2426,night,2023-12-23 21:07:00,,,,,,,1.0,0.0,0.0,-2.0,-2.0,-2.0,-10.0,-10.0,-10.0,1.0,0.0,1.0,25.0,25.0,52.0,-2.0,-2.0,6.0,1.0,0.0,2.0,5
132,United Air Lines Inc.,2364,afternoon,2023-12-26 15:20:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,1


In [33]:
# Check the result
print(f"Number of missing cancelled flights that are rare: {len(rare_missing_cancelled_flights)}")

Number of missing cancelled flights that are rare: 134


In [34]:
flights[(flights['route_id'] == 3228) & (flights['airline_mkt'] == 'Spirit Air Lines')].sort_values('scheduled_departure_datetime').head(10)
rare_missing_cancelled_flights.head()

Unnamed: 0,airline_mkt,route_id,departure_window,scheduled_departure_datetime,dep_delay_mean_10D,dep_delay_median_10D,dep_delay_max_10D,arr_delay_mean_10D,arr_delay_median_10D,arr_delay_max_10D,cancelled_sum_10D,div_airport_landings_sum_10D,n_flights_10D,dep_delay_mean_30D,dep_delay_median_30D,dep_delay_max_30D,arr_delay_mean_30D,arr_delay_median_30D,arr_delay_max_30D,cancelled_sum_30D,div_airport_landings_sum_30D,n_flights_30D,dep_delay_mean_90D,dep_delay_median_90D,dep_delay_max_90D,arr_delay_mean_90D,arr_delay_median_90D,arr_delay_max_90D,cancelled_sum_90D,div_airport_landings_sum_90D,n_flights_90D,n_flights
0,Spirit Air Lines,4612,early morning,2023-01-01 05:25:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,4
1,Spirit Air Lines,3228,early morning,2023-01-01 05:30:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,2
2,Spirit Air Lines,3228,early morning,2023-01-02 05:30:00,,,,,,,2.0,0.0,0.0,,,,,,,2.0,0.0,0.0,,,,,,,2.0,0.0,0.0,2
3,Spirit Air Lines,3964,early morning,2023-01-01 05:45:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,4
4,Spirit Air Lines,1992,morning,2023-01-01 06:15:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,4


In [35]:
missing_stats['scheduled_month'] = missing_stats['scheduled_departure_datetime'].dt.month
missing_stats.groupby('scheduled_month').size()
missing_stats.drop(columns='scheduled_month', inplace=True)

In [36]:
# calculate the average aggregated performance by airline_mkt and route_id
route_airline_agg = rolling_stats.groupby(['airline_mkt', 'route_id']).agg({
    'dep_delay_mean_10D': 'mean',
    'dep_delay_median_10D': 'mean',
    'dep_delay_max_10D': 'mean',
    'arr_delay_mean_10D': 'mean',
    'arr_delay_median_10D': 'mean',
    'arr_delay_max_10D': 'mean',
    'cancelled_sum_10D': 'mean',
    'div_airport_landings_sum_10D': 'mean',
    'n_flights_10D': 'mean',
    'dep_delay_mean_30D': 'mean',
    'dep_delay_median_30D': 'mean',
    'dep_delay_max_30D': 'mean',
    'arr_delay_mean_30D': 'mean',
    'arr_delay_median_30D': 'mean',
    'arr_delay_max_30D': 'mean',
    'cancelled_sum_30D': 'mean',
    'div_airport_landings_sum_30D': 'mean',
    'n_flights_30D': 'mean',
    'dep_delay_mean_90D': 'mean',
    'dep_delay_median_90D': 'mean',
    'dep_delay_max_90D': 'mean',
    'arr_delay_mean_90D': 'mean',
    'arr_delay_median_90D': 'mean',
    'arr_delay_max_90D': 'mean',
    'cancelled_sum_90D': 'mean',
    'div_airport_landings_sum_90D': 'mean',
    'n_flights_90D': 'mean'
    
}).reset_index()

In [37]:
agg_cols = route_airline_agg.drop(columns=['airline_mkt', 'route_id']).columns.to_list()
agg_cols

['dep_delay_mean_10D',
 'dep_delay_median_10D',
 'dep_delay_max_10D',
 'arr_delay_mean_10D',
 'arr_delay_median_10D',
 'arr_delay_max_10D',
 'cancelled_sum_10D',
 'div_airport_landings_sum_10D',
 'n_flights_10D',
 'dep_delay_mean_30D',
 'dep_delay_median_30D',
 'dep_delay_max_30D',
 'arr_delay_mean_30D',
 'arr_delay_median_30D',
 'arr_delay_max_30D',
 'cancelled_sum_30D',
 'div_airport_landings_sum_30D',
 'n_flights_30D',
 'dep_delay_mean_90D',
 'dep_delay_median_90D',
 'dep_delay_max_90D',
 'arr_delay_mean_90D',
 'arr_delay_median_90D',
 'arr_delay_max_90D',
 'cancelled_sum_90D',
 'div_airport_landings_sum_90D',
 'n_flights_90D']

In [38]:
# merge the aggregated data back into the missing stats to fill missing values
missing_stats_filled = pd.merge(
    missing_stats,
    route_airline_agg,
    on=['airline_mkt', 'route_id'],
    suffixes=('', '_agg'),
    how='left'
)

In [39]:
# Fill missing values in the original columns with the aggregated values
for col in agg_cols:
    missing_stats_filled[col] = missing_stats_filled[col].fillna(missing_stats_filled[col + '_agg'])

In [40]:
# Drop the extra columns used for merging
missing_stats_filled = missing_stats_filled.drop(columns=[col + '_agg' for col in agg_cols])
missing_stats_filled.head()

Unnamed: 0,airline_mkt,route_id,departure_window,scheduled_departure_datetime,dep_delay_mean_10D,dep_delay_median_10D,dep_delay_max_10D,arr_delay_mean_10D,arr_delay_median_10D,arr_delay_max_10D,cancelled_sum_10D,div_airport_landings_sum_10D,n_flights_10D,dep_delay_mean_30D,dep_delay_median_30D,dep_delay_max_30D,arr_delay_mean_30D,arr_delay_median_30D,arr_delay_max_30D,cancelled_sum_30D,div_airport_landings_sum_30D,n_flights_30D,dep_delay_mean_90D,dep_delay_median_90D,dep_delay_max_90D,arr_delay_mean_90D,arr_delay_median_90D,arr_delay_max_90D,cancelled_sum_90D,div_airport_landings_sum_90D,n_flights_90D
0,Frontier Airlines Inc.,5054,overnight,2023-01-01 00:59:00,45.978041,42.412037,124.787037,26.325715,20.74537,103.824074,1.0,0.0,0.0,45.000105,39.356481,161.648148,25.169223,17.722222,144.814815,1.0,0.0,0.0,44.229067,37.555556,163.092593,25.018425,15.601852,146.425926,1.0,0.0,0.0
1,Spirit Air Lines,795,overnight,2023-01-01 03:59:00,14.424169,2.728618,78.963816,9.307353,-1.215461,75.578947,1.0,0.0,0.0,13.889934,-0.166118,113.217105,8.737511,-3.427632,113.555921,1.0,0.0,0.0,14.222407,-0.539474,130.098684,9.261303,-3.375,135.230263,1.0,0.0,0.0
2,Southwest Airlines Co.,372,early morning,2023-01-01 05:00:00,10.20644,4.611667,50.644,5.728662,0.269333,48.464,1.0,0.0,0.0,10.584321,3.998667,78.139333,6.155333,-0.350667,76.206,1.0,0.0,0.0,11.52479,3.494333,113.376,6.533819,-1.441667,111.548,1.0,0.0,0.0
3,JetBlue Airways,733,early morning,2023-01-01 05:01:00,26.772484,12.85724,113.245902,23.127723,9.998292,121.251366,1.0,0.0,0.0,28.036155,11.318648,163.653005,24.233785,8.414617,168.459699,1.0,0.0,0.0,28.705772,11.077186,188.015027,24.324407,7.55806,193.965847,1.0,0.0,0.0
4,Spirit Air Lines,4612,early morning,2023-01-01 05:25:00,17.363181,5.943925,96.278297,9.254138,-1.070613,98.286604,1.0,0.0,0.0,18.203486,4.914849,143.185877,10.37009,-1.106438,141.695742,1.0,0.0,0.0,18.221915,2.67134,170.90135,9.985966,-2.567497,171.582555,1.0,0.0,0.0


In [41]:
missing_stats_filled.isna().sum()
missing_stats_filled[missing_stats_filled['dep_delay_mean_10D'].isna()]

Unnamed: 0,airline_mkt,route_id,departure_window,scheduled_departure_datetime,dep_delay_mean_10D,dep_delay_median_10D,dep_delay_max_10D,arr_delay_mean_10D,arr_delay_median_10D,arr_delay_max_10D,cancelled_sum_10D,div_airport_landings_sum_10D,n_flights_10D,dep_delay_mean_30D,dep_delay_median_30D,dep_delay_max_30D,arr_delay_mean_30D,arr_delay_median_30D,arr_delay_max_30D,cancelled_sum_30D,div_airport_landings_sum_30D,n_flights_30D,dep_delay_mean_90D,dep_delay_median_90D,dep_delay_max_90D,arr_delay_mean_90D,arr_delay_median_90D,arr_delay_max_90D,cancelled_sum_90D,div_airport_landings_sum_90D,n_flights_90D
535,Allegiant Air,2423,morning,2023-01-03 10:23:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
545,Allegiant Air,6242,early afternoon,2023-01-03 14:44:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
713,United Air Lines Inc.,1284,evening,2023-01-30 17:00:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
754,American Airlines Inc.,6590,morning,2023-03-04 09:30:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
866,Frontier Airlines Inc.,3986,morning,2023-04-14 08:20:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
1379,American Airlines Inc.,5715,midday,2023-09-24 12:15:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
1422,Frontier Airlines Inc.,4924,evening,2023-12-17 18:24:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0
1477,United Air Lines Inc.,2364,afternoon,2023-12-26 15:20:00,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0,,,,,,,1.0,0.0,0.0


In [42]:
flights[(flights['route_id'] == 1284) & (flights['airline_mkt'] == 'United Air Lines Inc.')].sort_values('scheduled_departure_datetime').head(10)


Unnamed: 0,year,quarter,month,day_of_month,day_of_week,marketing_airline_id,flight_number_marketing_airline,operating_airline_id,tail_number,origin_airport_id,origin_city_market_id,origin,origin_state,dest_airport_id,dest_city_market_id,dest,dest_state,dep_delay,dep_del15,departure_delay_groups,taxi_out,taxi_in,arr_delay,cancelled,cancellation_code,diverted,scheduled_elapsed_time,actual_elapsed_time,distance,distance_group,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,div_airport_landings,code_share_flight,origin_city,destination_city,scheduled_departure_datetime,scheduled_arrival_datetime,actual_departure_datetime,actual_arrival_datetime,airline_mkt,airline_ops,origin_timezone,destination_timezone,scheduled_departure_datetime_utc,actual_departure_datetime_utc,scheduled_arrival_datetime_utc,actual_arrival_datetime_utc,is_holiday,dep_delay_clipped,arr_delay_clipped,route_id,hour_of_day,departure_window
5903949,2023,1,1,30,1,19977,6009,20304,N909SW,11092,31092,CNY,UT,14869,34614,SLC,UT,,,,,,,1,B,0,52.0,,183.0,1,,,,,,0.0,1,Moab,Salt Lake City,2023-01-30 17:00:00,2023-01-30 17:52:00,,,United Air Lines Inc.,SkyWest Airlines Inc.,America/Denver,America/Denver,2023-01-31 00:00:00+00:00,,2023-01-31 01:00:00+00:00,,0,,,1284,17,evening


In [43]:
rolling_stats_not_missing = rolling_stats_deduped.dropna(subset=['arr_delay_mean_10D'])
rolling_stats_deduped_imputed = pd.concat([rolling_stats_not_missing, missing_stats_filled], axis=0)
rolling_stats_deduped_imputed.shape

(7276838, 31)

In [44]:
rolling_stats_deduped.columns.to_list()

['airline_mkt',
 'route_id',
 'departure_window',
 'scheduled_departure_datetime',
 'dep_delay_mean_10D',
 'dep_delay_median_10D',
 'dep_delay_max_10D',
 'arr_delay_mean_10D',
 'arr_delay_median_10D',
 'arr_delay_max_10D',
 'cancelled_sum_10D',
 'div_airport_landings_sum_10D',
 'n_flights_10D',
 'dep_delay_mean_30D',
 'dep_delay_median_30D',
 'dep_delay_max_30D',
 'arr_delay_mean_30D',
 'arr_delay_median_30D',
 'arr_delay_max_30D',
 'cancelled_sum_30D',
 'div_airport_landings_sum_30D',
 'n_flights_30D',
 'dep_delay_mean_90D',
 'dep_delay_median_90D',
 'dep_delay_max_90D',
 'arr_delay_mean_90D',
 'arr_delay_median_90D',
 'arr_delay_max_90D',
 'cancelled_sum_90D',
 'div_airport_landings_sum_90D',
 'n_flights_90D']

In [45]:
rolling_stats[['dep_delay_mean_10D',
 'dep_delay_median_10D',
 'dep_delay_max_10D',
 'arr_delay_mean_10D',
 'arr_delay_median_10D',
 'arr_delay_max_10D',
 'cancelled_sum_10D',
 'div_airport_landings_sum_10D',
 'n_flights_10D']].describe().applymap(lambda x: '{:.2f}'.format(x))

  'n_flights_10D']].describe().applymap(lambda x: '{:.2f}'.format(x))


Unnamed: 0,dep_delay_mean_10D,dep_delay_median_10D,dep_delay_max_10D,arr_delay_mean_10D,arr_delay_median_10D,arr_delay_max_10D,cancelled_sum_10D,div_airport_landings_sum_10D,n_flights_10D
count,7275562.0,7275562.0,7275562.0,7275505.0,7275505.0,7275505.0,7276990.0,7276990.0,7276990.0
mean,10.5,1.67,71.78,5.52,-2.89,73.55,0.15,0.03,11.74
std,16.95,13.35,67.11,19.57,16.4,70.15,0.52,0.35,7.53
min,-48.0,-48.0,-48.0,-119.0,-119.0,-119.0,0.0,0.0,0.0
25%,-0.89,-4.5,17.0,-7.1,-12.0,18.0,0.0,0.0,9.0
50%,5.7,-2.0,50.0,1.25,-6.0,50.0,0.0,0.0,10.0
75%,16.89,2.0,111.0,13.6,2.0,115.0,0.0,0.0,11.0
max,209.0,209.0,209.0,217.0,217.0,217.0,15.0,28.0,100.0


In [46]:
rolling_stats_deduped.shape

(7276838, 31)

In [47]:
flights_historical_performance_imputed = pd.merge(flights, rolling_stats_deduped_imputed, 
    on=['route_id', 'airline_mkt', 'departure_window', 'scheduled_departure_datetime'], 
    how='left')

In [48]:
flights_historical_performance_imputed.shape

(7276990, 84)

In [49]:
flights_historical_performance.shape

(7276990, 84)

# Save to CSV

In [50]:
flights_historical_performance.to_csv(DATA_PATH + '/interim/flights_historical_performance.csv', index=False)

In [51]:
flights_historical_performance_imputed.to_csv(DATA_PATH + '/interim/flights_historical_performance_imputed.csv', index=False)