# Data Pipeline

**Feature Extraction**

1.  **Capture Historical Delay Patterns:**
    * **Feature:** `Delay_Trend`
    * **Goal:** Identify recurring delays between specific airport pairs during particular weeks.
    * **Implementation:** Aggregate weekly delay data, focusing on the top 15 airports to manage memory and ensure generalizability.
    * **Rationale:** Historical delay patterns are strong indicators of future delays.

2.  **Assess Aircraft Reuse:**
    * **Feature:** `Same_Day_Tail_Reuse`
    * **Goal:** Determine if an aircraft is scheduled for multiple flights on the same day.
    * **Rationale:** High reuse can lead to delay propagation.

3.  **Calculate Turnaround Time:**
    * **Feature:** `Turnaround_Time`
    * **Goal:** Measure the time between an aircraft's arrival and its next scheduled departure.
    * **Rationale:** Short turnaround times increase delay vulnerability.

4.  **Determine Slack Time:**
    * **Feature:** `Slack_Time`
    * **Goal:** Calculate the difference between scheduled arrival and expected travel time.
    * **Rationale:** Slack time indicates buffer capacity, which can absorb minor delays.

**Outlier**

1.  **Handle Extreme Aircraft Age Values:**
    * **Feature:** `Aircraft_Age`
    * **Issue:** Presence of unrealistic age values.
    * **Solution:** Clip outliers to a realistic maximum (e.g., 56.9 years).
    * **Rationale:** Prevent outliers from skewing model training.

**Transformation and Imputation**

1.  **Apply Logarithmic Transformation:**
    * **Features:** `Distance`, `Wind_Gust`
    * **Transformation:** `log1p`
    * **Rationale:** Address right-skewed distributions and improve model robustness.

2.  **Impute Missing Values:**
    * **Strategy:** Mean imputation for numeric features, most frequent imputation for categorical features.
    * **Rationale:** Maintain data completeness with a simple and effective approach.

**Feature Pruning**

1.  **Remove Redundant Information:**
    * **Action:** Drop columns like city/state names, latitude/longitude, and detailed marketing airline data.
    * **Rationale:** Eliminate redundant information that doesn't contribute to prediction accuracy.

2.  **Discard Uninformative or Highly Missing Columns:**
    * **Action:** Remove columns with significant missing data or those that offer little predictive value.
    * **Rationale:** Reduce noise and improve model focus.

**Encoding and Categorical**

1.  **One-Hot Encoding with Cardinality Limits:**
    * **Features:** `OriginAirportID`, `DestAirportID`
    * **Encoding:** `OneHotEncoder` with `max_categories=50`.
    * **Rationale:** Handle high-cardinality features efficiently while preserving important information.

2.  **Full One-Hot Encoding:**
    * **Features:** `DayOfWeek`, `Month`, `Operating_Airline`
    * **Encoding:** Complete one-hot encoding.
    * **Rationale:** Preserve category identity for time- and airline-sensitive predictions.

**Dimensionality Reduction**

- **Principal Component Analysis (PCA):**
    * **Technique:** `PCA(n_components=50)`
    * **Rationale:** Address multicollinearity and reduce dimensionality, especially with correlated weather and location data.
    * **Benefit:** Improved training efficiency and reduced overfitting.

In [18]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
import time
from sklearn.model_selection import cross_val_score

from sklearn import set_config
set_config(transform_output = "pandas")
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import classification_report


In [19]:
def stratified_random_split(df: pd.DataFrame, target_column: str, test_size: float = 0.1, random_state: int = 42):
    """
    Performs a stratified random train-test split to ensure all classes in 
    'Flight_Status' are proportionally represented in both sets.

    Parameters:
    df (pd.DataFrame): The dataset containing the target variable.
    target_column (str): The column representing the classification target.
    test_size (float): The proportion of data to be used as test data.
    random_state (int): Random seed for reproducibility.

    Returns:
    tuple: (train_df, test_df) DataFrames.
    """
    train_df, test_df = train_test_split(
        df, test_size=test_size, stratify=df[target_column], random_state=random_state
    )

    print(f"Train size: {len(train_df)} samples")
    print(f"Test size: {len(test_df)} samples")

    return train_df, test_df



In [20]:
# Setting up the DataFrame for the flight data
flight_data = pd.read_parquet("WEATHER121.parquet")


In [21]:
# Stratify the data
train_data, test_data= stratified_random_split(flight_data, target_column="Flight_Status")

Train size: 13184010 samples
Test size: 1464890 samples


In [24]:
#downsample to 100000

train_data = train_data.sample(n=100000, random_state=42)

In [25]:
# Splitting the data into X and y
X_train = train_data.drop(columns=["Flight_Status"])
y_train = train_data["Flight_Status"]

### Pipeline Functions

In [26]:
# Custom progress logger
class ProgressLogger(BaseEstimator, TransformerMixin):
    """
    A transformer that logs progress through a pipeline.
    """
    
    def __init__(self, total_rows, log_interval=0.01, name='Pipeline'):
        self.total_rows = total_rows
        self.log_interval = log_interval
        self.name = name
        self.start_time = None
        self.last_log_percent = -1
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Initialize timer if not started
        if self.start_time is None:
            self.start_time = time.time()
            print(f"{self.name} processing started on {self.total_rows:,} rows")
        
        # Calculate current progress
        current_rows = X.shape[0]
        percent_complete = current_rows / self.total_rows
        
        # Check if we need to log progress
        int_percent = int(percent_complete / self.log_interval)
        if int_percent > self.last_log_percent:
            self.last_log_percent = int_percent
            elapsed = time.time() - self.start_time
            
            # Estimate time remaining
            percent_done = percent_complete * 100
            if percent_complete > 0:
                total_est = elapsed / percent_complete
                remaining = total_est - elapsed
                time_str = f" - Est. remaining: {remaining:.1f}s"
            else:
                time_str = ""
                
            print(f"{self.name}: {percent_done:.1f}% complete ({current_rows:,}/{self.total_rows:,} rows){time_str}")
        
        return X

class DelayTrendEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that precomputes a route-based delay trend using week_of_year and top N airports.

    Parameters:
        date_col (str): Name of the datetime column
        origin_col (str): Origin airport column
        dest_col (str): Destination airport column
        status_col (str): Flight status column for delay signal
        output_col (str): Name of the new delay trend feature
        top_n_airports (int): Number of top airports to retain by frequency
    """
    def __init__(self,
                 date_col='dep_datetime',
                 origin_col='OriginAirportID',
                 dest_col='DestAirportID',
                 status_col='Flight_Status',
                 output_col='Delay_Trend',
                 top_n_airports=15):
        self.date_col = date_col
        self.origin_col = origin_col
        self.dest_col = dest_col
        self.status_col = status_col
        self.output_col = output_col
        self.top_n_airports = top_n_airports

    def fit(self, X, y=None):
        X_temp = X.copy()
        X_temp[self.date_col] = pd.to_datetime(X_temp[self.date_col])
        X_temp['week_of_year'] = X_temp[self.date_col].dt.isocalendar().week

        X_temp['Flight_Status'] = y.reset_index(drop=True)
        
        # Restrict to top N airports
        all_airports = pd.concat([X_temp[self.origin_col], X_temp[self.dest_col]])
        top_airports = all_airports.value_counts().head(self.top_n_airports).index
        X_temp = X_temp[
            X_temp[self.origin_col].isin(top_airports) &
            X_temp[self.dest_col].isin(top_airports)
        ]

        # Create delay signal
        delay_reasons = [
            'CarrierDelay', 'WeatherDelay', 'NASDelay',
            'SecurityDelay', 'LateAircraftDelay'
        ]
        X_temp['delay_signal'] = 0
        mask = X_temp['Flight_Status'].str.contains('Delay', na=False)
        delay_type = X_temp.loc[mask, 'Flight_Status'].str.split(' - ').str[-1]
        valid = delay_type.isin(delay_reasons)
        X_temp.loc[mask[mask].index[valid], 'delay_signal'] = 1

        # Group by week and airport pair
        self.trend_lookup_ = (
            X_temp.groupby(['week_of_year', self.origin_col, self.dest_col])['delay_signal']
            .mean()
            .reset_index()
            .rename(columns={'delay_signal': self.output_col})
        )

        return self

    def transform(self, X):
        X = X.copy()
        X[self.date_col] = pd.to_datetime(X[self.date_col])
        X['week_of_year'] = X[self.date_col].dt.isocalendar().week

        X = X.merge(
            self.trend_lookup_,
            on=['week_of_year', self.origin_col, self.dest_col],
            how='left'
        )

        X[self.output_col] = X[self.output_col].fillna(0)
        return X.drop(columns=['week_of_year'])

class SameDayTailReuseEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that creates a feature counting how many times
    the same tail number (aircraft) is used on the same day.

    Parameters:
        datetime_col (str): Column containing full departure datetime
        tail_col (str): Column name for tail number
        output_col (str): Name of the output column
    """

    def __init__(self,
                 datetime_col='dep_datetime',
                 tail_col='Tail_Number',
                 output_col='Same_Day_Tail_Reuse'):
        self.datetime_col = datetime_col
        self.tail_col = tail_col
        self.output_col = output_col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()

        # Ensure datetime is parsed correctly
        X[self.datetime_col] = pd.to_datetime(X[self.datetime_col])

        # Extract just the date portion
        X['dep_date'] = X[self.datetime_col].dt.date

        # Count reuse of same tail number per day
        X[self.output_col] = (
            X.groupby([self.tail_col, 'dep_date'])[self.tail_col]
            .transform('count')
            .astype(float)  
        )

        return X.drop(columns=['dep_date'])

    
class TurnaroundDelayEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that computes:
    - Previous_Flight_Delay: scheduled departure time of previous flight with same tail number on same day
    - Turnaround_Time: time between previous arrival and current departure

    Parameters:
        datetime_col (str): Column with full departure datetime
        dep_time_col (str): Column with scheduled departure time (e.g. CRSDepTime)
        arr_time_col (str): Column with scheduled arrival time (e.g. CRSArrTime)
        tail_col (str): Tail number column
        output_prefix (str): Prefix to use for new feature columns
    """

    def __init__(self,
                 datetime_col='dep_datetime',
                 dep_time_col='CRSDepTime',
                 arr_time_col='CRSArrTime',
                 tail_col='Tail_Number',
                 output_prefix=''):
        self.datetime_col = datetime_col
        self.dep_time_col = dep_time_col
        self.arr_time_col = arr_time_col
        self.tail_col = tail_col
        self.output_prefix = output_prefix

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[self.datetime_col] = pd.to_datetime(X[self.datetime_col])
        X['dep_date'] = X[self.datetime_col].dt.date

        # Sort the values
        X = X.sort_values(by=[self.tail_col, 'dep_date', self.dep_time_col])

        prev_delay_col = self.output_prefix + 'Previous_Flight_Delay'
        turnaround_col = self.output_prefix + 'Turnaround_Time'

        # Previous scheduled departure
        X[prev_delay_col] = (
            X.groupby([self.tail_col, 'dep_date'])[self.dep_time_col]
            .shift(1)
        )

        # Previous scheduled arrival time
        X['Previous_Arrival_Time'] = (
            X.groupby([self.tail_col, 'dep_date'])[self.arr_time_col]
            .shift(1)
        )

        # Compute turnaround time using a defined method
        X[turnaround_col] = self._calculate_turnaround(X[self.dep_time_col], X['Previous_Arrival_Time'])

        # Replace missing with 0
        X[prev_delay_col] = X[prev_delay_col].fillna(0)
        X[turnaround_col] = X[turnaround_col].fillna(0)

        return X.drop(columns=['dep_date', 'Previous_Arrival_Time'])

    def _calculate_turnaround(self, current_dep, previous_arr):
        """
        Applies time difference logic with wrap-around at midnight (2400).
        """
        diff = current_dep - previous_arr
        adjusted = diff.mask((~diff.isna()) & (diff < 0), diff + 2400)
        return adjusted

class SlackTimeEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that calculates slack time as the difference between scheduled
    arrival time (CRSArrTime) and scheduled elapsed time (CRSElapsedTime).

    Parameters:
        arr_col (str): Column name for scheduled arrival time.
        elapsed_col (str): Column name for scheduled elapsed time.
        output_col (str): Name of the output column to store slack time.
    """

    def __init__(self,
                 arr_col='CRSArrTime',
                 elapsed_col='CRSElapsedTime',
                 output_col='Slack_Time'):
        self.arr_col = arr_col
        self.elapsed_col = elapsed_col
        self.output_col = output_col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[self.output_col] = X[self.arr_col] - X[self.elapsed_col]
        return X
    
    

class AgeClipper(BaseEstimator, TransformerMixin):
    """
    Transformer that caps aircraft age at a maximum value.
    
    Parameters:
    -----------
    max_age : float, default=56.9
        Maximum value for aircraft age
    column_name : str, default='Aircraft_Age'
        Name of the column to apply the cap to
    """
    
    def __init__(self, max_age=56.9, column_name='Aircraft_Age'):
        self.max_age = max_age
        self.column_name = column_name
    
    def fit(self, X, y=None):
        # Nothing to fit, just return self
        return self
    
    def transform(self, X):
        # Convert to DataFrame if it's not already
        X_transformed = X.copy() if isinstance(X, pd.DataFrame) else pd.DataFrame(X)
        
        # Apply the cap only if the column exists
        if self.column_name in X_transformed.columns:
            X_transformed[self.column_name] = np.minimum(X_transformed[self.column_name], self.max_age)
        
        # Return array if input was array
        if not isinstance(X, pd.DataFrame):
            X_transformed = X_transformed.values
            
        return X_transformed

### Pipeline Model

In [58]:
# import random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA


numerical_features = [
    'Distance', 'relative_humidity_2m', 'temperature_2m',
    'Aircraft_Age', 'temperature_2m_dest_dep_time',
    'snowfall', 'soil_moisture_0_to_1cm',
    'snowfall_dest_dep_time', 
    'et0_fao_evapotranspiration_dest_dep_time',
    'surface_pressure_dest_dep_time', 'pressure_msl_dest_dep_time',
    'wind_direction_10m_dest_dep_time', 'wind_gusts_10m_dest_dep_time', 'CRSElapsedTime'
]

categorical_features = [
    'DayOfWeek', 'Month',
    'OriginAirportID',
    'DestAirportID', 'Is_Holiday_Week', 'Operating_Airline '
]

drop_features = [
            'Aircraft_Model', 'Aircraft_EngineType', 'Holiday',
            'dep_datetime', 'Tail_Number', 'temperature_120m_dest_dep_time', 
            'temperature_180m_dest_dep_time', 'temperature_80m_dest_dep_time', 'temperature_180m', 
            'temperature_120m', 'temperature_80m', 'wind_speed_80m', 'wind_speed_120m',
            'wind_speed_180m', 'wind_direction_80m', 'wind_direction_120m', 'wind_direction_180m',
            'wind_speed_80m_dest_dep_time', 'wind_speed_120m_dest_dep_time', 'wind_speed_180m_dest_dep_time',
            'wind_direction_80m_dest_dep_time', 'wind_direction_120m_dest_dep_time', 'wind_direction_180m_dest_dep_time', 
            'soil_temperature_0cm_dest_dep_time', 'soil_temperature_0cm', 'rain', 'rain_dest_dep_time', 'dew_point_2m_dest_dep_time', 
            'cloud_cover_mid', 'cloud_cover', 'cloud_cover_mid_dest_dep_time', 'cloud_cover_dest_dep_time', 'dew_point_2m', 
            'wind_speed_10m', 'wind_speed_10m_dest_dep_time', 'visibility_dest_dep_time', 'visibility', 'apparent_temperature', 
            'apparent_temperature_dest_dep_time', 'vapour_pressure_deficit', 'vapour_pressure_deficit_dest_dep_time', 'Flights',
            'DOT_ID_Operating_Airline', 'Flight_Number_Operating_Airline', 'Aircraft_Engines', 'Aircraft_Seats', 'Is_Freighter',
            'relative_humidity_2m_dest_dep_time', 'precipitation', 'showers', 'snow_depth', 'et0_fao_evapotranspiration',
            'evapotranspiration', 'cloud_cover_high', 'cloud_cover_low', 'surface_pressure', 'weather_code', 'pressure_msl', 
            'wind_direction_10m', 'wind_gusts_10m', 'precipitation_dest_dep_time', 'showers_dest_dep_time', 'snow_depth_dest_dep_time',
            'soil_moisture_0_to_1cm_dest_dep_time', 'evapotranspiration_dest_dep_time', 'cloud_cover_high_dest_dep_time',
            'cloud_cover_low_dest_dep_time', 'weather_code_dest_dep_time', 'Operated_or_Branded_Code_Share_Partners', 'Year',
            'Quarter', 'Marketing_Airline_Network', 'DayofMonth', 'DOT_ID_Marketing_Airline', 'Flight_Number_Marketing_Airline', 
            'OriginAirportSeqID', 'OriginCityMarketID', 'Origin', 'OriginCityName', 'OriginState', 'OriginStateFips', 
            'OriginStateName', 'OriginWac', 'DestAirportSeqID', 'DestCityMarketID', 'Dest', 'DestCityName', 'DestState', 'DestStateFips', 
            'DestStateName', 'DestWac', 'CRSDepTime', 'DepTimeBlk', 'CRSArrTime', 'ArrTimeBlk', 'DistanceGroup', 'Duplicate', 
            'Aircraft_Airline', 'Aircraft_ModelCode', 'Aircraft_Type'
        ]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('age_clipper', AgeClipper(max_age=56.9, column_name='Aircraft_Age')),
    ('log1p_distance', ColumnTransformer([
        ('log', FunctionTransformer(np.log1p), ['Distance', 'wind_gusts_10m_dest_dep_time'])
    ], remainder='passthrough', verbose_feature_names_out=False)), 
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('OHE', ColumnTransformer([
        ('top50', OneHotEncoder(handle_unknown='ignore', sparse_output=False, max_categories=51), ['OriginAirportID', 'DestAirportID']),
        ('OHE' , OneHotEncoder(handle_unknown='ignore', sparse_output=False), ['DayOfWeek', 'Month', 'Operating_Airline '])
    ]))
])

preprocessor = Pipeline([
    ('initial_logger', ProgressLogger(total_rows=13000000, name='Starting Pipeline')),
    
    ('delay_trend', DelayTrendEncoder(
        date_col='dep_datetime',
        origin_col='OriginAirportID',
        dest_col='DestAirportID',
        status_col='Flight_Status',
        output_col='Delay_Trend',
        top_n_airports=15
    )),
    
    ('trend_logger', ProgressLogger(total_rows=13000000, name='Delay Trend Complete')),
    
    ('tail_reuse', SameDayTailReuseEncoder(
        datetime_col='dep_datetime',
        tail_col='Tail_Number',
        output_col='Same_Day_Tail_Reuse'
    )),
    
    ('reuse_logger', ProgressLogger(total_rows=13000000, name='Tail Reuse Complete')),
    
    ('turnaround_delay', TurnaroundDelayEncoder(
        datetime_col='dep_datetime',
        dep_time_col='CRSDepTime',
        arr_time_col='CRSArrTime',
        tail_col='Tail_Number',
        output_prefix=''
    )),
    
    ('turnaround_logger', ProgressLogger(total_rows=13000000, name='Turnaround Delay Complete')),
    
    ('slack_time', SlackTimeEncoder(
        arr_col='CRSArrTime',
        elapsed_col='CRSElapsedTime',
        output_col='Slack_Time'
    )),
    
    ('slack_logger', ProgressLogger(total_rows=13000000, name='Slack Time Complete')),
    
    ('feature_prep', ColumnTransformer([
        ('numeric', numeric_transformer, numerical_features),
        ('categorical', categorical_transformer, categorical_features),
        ('drop', 'drop', drop_features)
    ], remainder='passthrough', verbose_feature_names_out=False, n_jobs=16)),
    
    ('prep_logger', ProgressLogger(total_rows=13000000, name='Feature Prep Complete')),
    ('select', PCA(
        n_components=75,
        random_state=42,
    )),
    ('final_logger', ProgressLogger(total_rows=13000000, name='Pipeline Complete'))
])
# so that I dont forget, dep_datetime is dropped because of redundancy

preprocessor

In [59]:
# Train the pipeline in this cell and run it on X, this cell must end with the transformed data
X = preprocessor.fit_transform(X_train, y_train)

pd.set_option('display.max_columns', None)
X

Starting Pipeline processing started on 13,000,000 rows
Starting Pipeline: 0.8% complete (100,000/13,000,000 rows) - Est. remaining: 0.0s
Delay Trend Complete processing started on 13,000,000 rows
Delay Trend Complete: 0.8% complete (100,000/13,000,000 rows) - Est. remaining: 0.0s
Tail Reuse Complete processing started on 13,000,000 rows
Tail Reuse Complete: 0.8% complete (100,000/13,000,000 rows) - Est. remaining: 0.0s
Turnaround Delay Complete processing started on 13,000,000 rows
Turnaround Delay Complete: 0.8% complete (100,000/13,000,000 rows) - Est. remaining: 0.0s
Slack Time Complete processing started on 13,000,000 rows
Slack Time Complete: 0.8% complete (100,000/13,000,000 rows) - Est. remaining: 0.0s
Feature Prep Complete processing started on 13,000,000 rows
Feature Prep Complete: 0.8% complete (100,000/13,000,000 rows) - Est. remaining: 0.0s
Pipeline Complete processing started on 13,000,000 rows
Pipeline Complete: 0.8% complete (100,000/13,000,000 rows) - Est. remaining: 0

Unnamed: 0,pca0,pca1,pca2,pca3,pca4,pca5,pca6,pca7,pca8,pca9,pca10,pca11,pca12,pca13,pca14,pca15,pca16,pca17,pca18,pca19,pca20,pca21,pca22,pca23,pca24,pca25,pca26,pca27,pca28,pca29,pca30,pca31,pca32,pca33,pca34,pca35,pca36,pca37,pca38,pca39,pca40,pca41,pca42,pca43,pca44,pca45,pca46,pca47,pca48,pca49,pca50,pca51,pca52,pca53,pca54,pca55,pca56,pca57,pca58,pca59,pca60,pca61,pca62,pca63,pca64,pca65,pca66,pca67,pca68,pca69,pca70,pca71,pca72,pca73,pca74
84341,369.406386,-9.608162,0.179558,-0.062166,0.596542,0.550946,-0.264805,-0.513814,0.620569,-0.494372,-0.033893,-0.995415,-0.787977,1.142734,0.158463,0.712329,0.101890,0.226943,0.124162,-0.294356,0.021556,0.094708,0.192587,0.501935,1.026374,-0.284926,-0.768755,-0.461073,-0.015108,-0.096331,-0.181976,-0.309769,-0.666685,0.297711,0.118579,-0.368372,0.384661,0.169260,0.034277,0.009300,-0.030203,-0.003334,-0.175433,-0.079776,-0.057985,-0.054252,-0.127423,-0.039765,0.029790,-0.034567,0.015400,-0.125174,-0.107160,0.091005,0.008068,-0.057949,-0.046862,-0.089729,-0.060579,0.085124,0.074155,0.014692,0.126393,-0.077876,-0.009426,0.055077,-0.084227,-0.086071,0.014726,0.175574,-0.087826,0.090813,0.013114,-0.080915,0.057627
37492,-653.419003,-28.480496,1.441660,0.403025,-0.210278,1.273879,1.188273,0.604383,-0.605486,-0.228558,-0.084783,-0.375250,-0.546711,0.029999,-0.257772,0.389423,0.653168,0.186674,0.678997,0.421090,-0.288825,-0.247507,-0.250112,-0.166262,0.223529,-0.114127,-0.280239,-0.372612,0.040899,0.430109,0.691879,0.226532,-0.182182,-0.098686,-0.077041,-0.099270,0.171210,0.110906,0.083219,0.154603,0.012287,-0.048274,-0.041235,-0.006061,0.023991,-0.071345,0.000542,-0.083028,-0.017518,0.050702,-0.095135,-0.087924,0.073731,0.044353,0.019871,-0.037660,-0.044624,-0.133056,0.001022,0.071558,0.031961,0.058267,0.180589,-0.026137,0.029428,-0.001374,-0.070153,0.008106,0.048121,0.252027,-0.103192,0.021629,0.053371,0.068848,0.073725
55696,170.439372,-13.279279,0.424960,-0.865623,0.010005,1.300962,-0.513658,-1.093548,0.485868,-0.025464,-0.085488,-0.288770,-0.799541,1.098683,-0.744922,0.173100,1.028667,0.262057,0.056733,-0.253986,0.099261,0.747357,-0.514250,-0.133495,0.059222,-0.132035,-0.246426,-0.506664,0.039164,-0.126774,-0.426497,0.658513,-0.185086,-0.121259,-0.036137,0.055114,0.065869,0.195934,0.166376,0.117271,0.034822,-0.342663,0.052657,0.039262,0.303700,0.356481,0.250798,0.161384,0.403531,-0.172050,-0.044468,-0.047200,-0.140299,0.012314,0.014660,-0.060238,-0.096416,-0.046480,0.005344,-0.025261,0.164102,0.028976,0.127662,0.009735,0.052783,0.073739,-0.079384,-0.065289,-0.019136,0.220117,-0.083960,0.039033,-0.004053,-0.095916,0.008449
8859,-290.482590,-21.784067,0.993400,-1.629461,0.419143,-0.613777,0.709189,0.458208,0.372986,-0.385654,0.120670,-0.548422,0.245208,-0.089310,-0.753674,-0.209001,0.088275,0.399016,-0.346035,-0.480131,-0.638432,-0.238420,-0.371723,-0.266506,0.720798,-0.196218,-0.650979,-0.475358,-0.506042,0.672654,-0.368695,-0.191316,0.015183,-0.147292,0.008653,-0.275956,-0.114787,-0.227552,-0.160119,0.035556,-0.016678,-0.058753,-0.150326,-0.065618,-0.018790,-0.114727,-0.111390,-0.063124,0.006538,0.039179,-0.042315,-0.108234,-0.046616,0.055069,0.029809,-0.046890,-0.034855,-0.129810,0.011569,0.051555,0.067567,0.008258,0.104265,-0.090196,0.008584,0.004241,-0.035065,-0.079586,-0.043614,0.197611,-0.066696,0.012070,-0.017909,-0.084068,0.058986
63927,180.438408,-13.094857,0.412573,-0.265649,0.577157,-0.261206,-0.811060,-0.919617,0.377864,0.138909,0.088343,0.432462,-0.557569,1.044393,0.444634,0.150587,0.707015,0.235432,0.749725,0.355320,-0.240020,-0.224256,-0.283004,-0.154981,0.301620,-0.172372,-0.339936,-0.412389,-0.429822,0.688057,-0.347213,-0.164563,0.003498,-0.054972,0.051276,-0.063025,-0.010599,-0.217660,-0.095197,0.175298,0.000363,-0.000030,-0.073339,-0.016454,-0.020901,-0.027462,-0.004294,-0.118264,-0.010297,0.018758,0.003588,-0.100117,0.007330,0.107345,0.010003,-0.059084,-0.070466,-0.081514,-0.054371,0.087490,0.065572,0.059322,0.166721,-0.063915,-0.006499,0.032121,-0.086406,-0.141783,0.009698,0.246450,-0.049105,0.167784,0.044545,-0.027023,0.044475
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56457,272.423439,-11.397601,0.299051,1.049562,0.887998,-0.154112,-2.259923,-1.174667,0.939465,-0.324289,0.142413,-0.888163,-0.466841,0.914062,0.226891,0.969369,0.055987,0.235886,-0.334974,-0.548431,-0.665348,-0.257553,-0.384939,-0.269167,0.729279,-0.140257,-0.722307,-0.379208,0.056280,-0.097889,0.027030,0.163891,-0.118186,-0.153676,-0.642712,-0.230875,0.210449,-0.495717,-0.340380,-0.014402,-0.008586,-0.050982,-0.152503,-0.065527,-0.081110,-0.031524,-0.077445,-0.031281,0.080550,-0.029484,-0.057563,-0.133563,-0.067893,0.058199,0.012539,-0.022202,-0.038458,-0.087365,-0.064144,0.106497,0.048255,-0.001019,0.165281,0.004340,0.044619,0.014873,-0.085760,-0.069628,0.057621,0.149355,-0.067925,0.155274,0.021784,-0.076124,0.026221
85107,-380.465280,-23.444177,1.104279,1.295955,-0.199383,-1.560158,0.007369,0.281071,0.204463,-0.168330,-0.035494,-0.348671,0.288222,-0.674397,-0.276337,-0.562774,-0.797003,0.141040,0.098534,-0.194190,0.007139,0.075563,0.033263,-0.033686,0.436093,0.903622,-0.008902,-0.095733,0.048257,-0.077473,0.044971,0.213805,-0.135484,-0.122858,-0.574807,-0.037961,0.227216,-0.323514,-0.273558,0.031939,-0.050307,0.131515,-0.210701,0.075566,-0.013212,0.026382,-0.202430,0.061240,-0.029218,0.008849,-0.007954,-0.104001,-0.174207,0.246146,0.089714,-0.110091,-0.099578,-0.117778,0.055855,-0.183307,0.344112,0.086584,-0.461102,0.581721,-0.338702,-0.239570,-0.134775,0.220266,0.113473,0.109242,0.026682,0.104874,0.059077,0.133471,0.047502
606,809.330375,-1.488648,-0.363055,0.232749,-1.304640,-1.616957,-0.133677,0.103888,0.489555,-0.146001,-0.124768,-0.456670,0.412055,-0.507539,-0.065535,-0.367373,0.036662,0.113159,0.021222,-0.113348,0.092393,0.748955,-0.370062,0.017586,-0.303835,-0.159816,1.042052,-0.388768,0.079151,-0.129915,0.051711,0.218511,-0.161141,-0.164185,-0.598167,-0.052229,0.246792,-0.301275,-0.293132,-0.320089,0.004867,0.163151,-0.268619,0.100604,-0.047159,0.111273,-0.287029,0.028149,-0.178728,0.060060,0.037626,-0.044209,-0.180846,0.468412,1.585908,0.811831,0.496688,0.366944,0.106318,-0.073995,0.195277,-0.023088,-0.368545,0.209962,-0.553170,-0.127615,-0.141667,0.106703,0.182462,0.067878,-0.107040,0.126785,0.155378,0.051131,0.086904
56102,373.657521,573.614184,7.936077,-0.864480,3.430380,0.627452,0.906955,-1.130587,-0.006697,0.072143,0.457770,1.311895,0.721516,-0.216496,-1.143928,-0.488288,0.797072,0.756073,0.154506,-0.408371,0.108292,0.776406,-0.547547,-0.261748,0.582770,-0.235049,-0.100535,0.760732,0.085789,-0.038428,0.075302,0.208538,-0.111765,-0.114521,-0.548806,-0.180342,0.315016,-0.383090,-0.312783,-0.003132,0.062059,0.195384,-0.066369,0.138277,0.073050,0.005418,-0.259615,-0.119560,-0.101697,-0.077715,0.228720,0.052273,-0.056388,0.228059,1.216312,0.406900,0.325878,0.295633,0.027026,0.161872,0.251420,-0.505005,-0.299936,-0.205255,-0.005661,0.120301,-0.049197,-0.002847,-0.179501,0.054250,-0.127216,-0.100144,-0.194842,0.104562,-0.057194


In [1]:
X.isna().sum().sort_values(ascending=False).head(20)

NameError: name 'X' is not defined

In [75]:
# Save the preprocessor as a parquet file
X.to_parquet("preprocessor.parquet")


In [78]:
# pickle the preprocessor
import pickle
with open("preprocessor.pkl", "wb") as f:
    pickle.dump(preprocessor, f, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
import sys

for name, step in preprocessor.named_steps.items():
    print(name, type(step), sys.getsizeof(step))

initial_logger <class '__main__.ProgressLogger'> 48
delay_trend <class '__main__.DelayTrendEncoder'> 48
trend_logger <class '__main__.ProgressLogger'> 48
tail_reuse <class '__main__.SameDayTailReuseEncoder'> 48
reuse_logger <class '__main__.ProgressLogger'> 48
turnaround_delay <class '__main__.TurnaroundDelayEncoder'> 48
turnaround_logger <class '__main__.ProgressLogger'> 48
slack_time <class '__main__.SlackTimeEncoder'> 48
slack_logger <class '__main__.ProgressLogger'> 48
feature_prep <class 'sklearn.compose._column_transformer.ColumnTransformer'> 48
prep_logger <class '__main__.ProgressLogger'> 48
select <class 'sklearn.feature_selection._from_model.SelectFromModel'> 48
final_logger <class '__main__.ProgressLogger'> 48


# Models

## Decision Tree

In [47]:
from sklearn.metrics import make_scorer, precision_score, recall_score, precision_recall_curve
def precision_at_recall(y, y_pred, *, recall, **kwargs):
    """
    Calculate the precision at a given recall level.

    To use this with cross_val_score, you need to use the make_scorer function to
    create a scoring function that can be used with cross_val_score. For example:

        scorer = make_scorer(precision_at_recall, recall=0.9)  # 0.9 is the minimum recall level
        cross_val_score(model, X, y, cv=5, scoring=scorer)

    The default will only work for binary classification problems. You must change the
    average parameter if you want to use for multiclass classification. For example:

        scorer = make_scorer(precision_at_recall, recall=0.9, average="micro")
    """
    return precision_score(y, y_pred, **kwargs) if recall_score(y, y_pred, **kwargs) > recall else 0.0

def recall_at_precision(y, y_pred, *, precision, **kwargs):
    """
    Calculate the recall at a given precision level.

    To use this with cross_val_score, you need to use the make_scorer function to
    create a scoring function that can be used with cross_val_score. For example:

        scorer = make_scorer(recall_at_precision, precision=0.9)
        cross_val_score(model, X, y, cv=5, scoring=scorer)

    The default will only work for binary classification problems. You must change the
    average parameter if you want to use for multiclass classification. For example:

        scorer = make_scorer(recall_at_precision, precision=0.9, average="micro")
    """
    return recall_score(y, y_pred, **kwargs) if precision_score(y, y_pred, **kwargs) > precision else 0.0

In [65]:
X_sample = X.sample(n=100000, random_state=42)
y_train = y_train.sample(n=100000, random_state=42)

In [69]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score

# Create a DecisionTreeClassifier
dt = DecisionTreeClassifier(
    random_state=42,
    max_depth=20,
    min_samples_leaf=10,
    class_weight='balanced'
)

scorer = make_scorer(precision_at_recall, recall=0.25, average='weighted')

scores = cross_val_score(dt, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(dt, X_sample, y_train, cv=5, n_jobs=64)

# Calculate recall, precision, and F1 score
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Cross-validated performance
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))

print(f"Mean precision at recall 0.25: {scores.mean():.4f} ± {scores.std():.4f}")


[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   40.4s finished


Recall: 0.0565
Precision: 0.0547
F1 Score: 0.0130
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.09      0.01       424
      Major Delay - CarrierDelay       0.00      0.05      0.01       514
 Major Delay - LateAircraftDelay       0.01      0.09      0.01       649
          Major Delay - NASDelay       0.00      0.02      0.00        91
      Major Delay - WeatherDelay       0.00      0.02      0.00        96
     Medium Delay - CarrierDelay       0.02      0.03      0.02      1881
Medium Delay - LateAircraftDelay       0.02      0.02      0.02      2886
         Medium Delay - NASDelay       0.01      0.08      0.01       487
     Medium Delay - WeatherDelay       0.00      0.06      0.01       325
      Minor Delay - CarrierDelay       0.04      0.02      0.03      3539
 Minor Delay - LateAircraftDelay       0.03      0.02      0.02      4238
          Minor Delay

In [68]:
y_train.value_counts()

Flight_Status
On-Time                             78470
Minor Delay - LateAircraftDelay      4238
Unknown                              3977
Minor Delay - CarrierDelay           3539
Medium Delay - LateAircraftDelay     2886
Medium Delay - CarrierDelay          1881
Minor Delay - NASDelay               1060
Weather Cancellation                  781
Major Delay - LateAircraftDelay       649
Major Delay - CarrierDelay            514
Medium Delay - NASDelay               487
Carrier Cancellation                  424
Minor Delay - WeatherDelay            326
Medium Delay - WeatherDelay           325
NAS Cancellation                      205
Major Delay - WeatherDelay             96
Major Delay - NASDelay                 91
Security Issue                         51
Name: count, dtype: int64

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import cross_val_score, cross_val_predict

# Create Gradient Boosting classifier
model = GradientBoostingClassifier(
    n_estimators=10,    # Number of trees
    max_depth=3,         # Max tree depth
    learning_rate=0.1,   # Slightly lower learning rate for better stability
    subsample=0.8,       # Use 80% of samples for each tree (helps with large datasets)
    random_state=42,
    verbose=2            # Show progress
)

# Create scorer for precision at recall
scorer = make_scorer(precision_at_recall, recall=0.25, average='weighted')

# Run cross-validation with the Gradient Boosting model
scores = cross_val_score(model, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=16)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(model, X_sample, y_train, cv=5, n_jobs=16)

# Calculate metrics
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Print classification report
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.25: {scores.mean():.4f} ± {scores.std():.4f}")


[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.


In [8]:
classes = np.unique(y_train)
y_train_bin = label_binarize(y_train, classes=classes)

y_scores = cross_val_predict(dt_clf, X, y_train, cv=cv, method='predict_proba')

plt.figure(figsize=(10, 6))
for i, class_name in enumerate(classes):
    precision, recall, _ = precision_recall_curve(y_train_bin[:, i], y_scores[:, i])
    plt.plot(recall, precision, lw=2, label=f"{class_name}")

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curves (One-vs-Rest)")
plt.legend(loc="best", fontsize='small')
plt.grid(True)
plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
for i, class_name in enumerate(classes):
    fpr, tpr, _ = roc_curve(y_train_bin[:, i], y_scores[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=2, label=f"{class_name} (AUC = {roc_auc:.2f})")

plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves (One-vs-Rest)")
plt.legend(loc="lower right", fontsize='small')
plt.grid(True)
plt.tight_layout()
plt.show()

KeyboardInterrupt: 

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report
)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

log_reg = LogisticRegression(
    class_weight='balanced',
    solver='lbfgs',
    max_iter=2000,  
    random_state=42
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
y_pred_cv_log_reg = cross_val_predict(log_reg, X_scaled, y_train, cv=cv)

log_reg.fit(X_scaled, y_train)
y_pred_train_log_reg = log_reg.predict(X_scaled)

print("Performance on Training Set:")
print(f"Accuracy: {accuracy_score(y_train, y_pred_train_log_reg):.4f}")
print(f"Precision: {precision_score(y_train, y_pred_train_log_reg, average='macro', zero_division=0):.4f}")
print(f"Recall: {recall_score(y_train, y_pred_train_log_reg, average='macro'):.4f}")
print(f"F1 Score: {f1_score(y_train, y_pred_train_log_reg, average='macro'):.4f}\n")

print("Cross-Validated Performance:")
print(f"CV Accuracy: {accuracy_score(y_train, y_pred_cv_log_reg):.4f}")
print(f"CV Precision: {precision_score(y_train, y_pred_cv_log_reg, average='macro', zero_division=0):.4f}")
print(f"CV Recall: {recall_score(y_train, y_pred_cv_log_reg, average='macro'):.4f}")
print(f"CV F1 Score: {f1_score(y_train, y_pred_cv_log_reg, average='macro'):.4f}\n")

print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv_log_reg, zero_division=0))

KeyboardInterrupt: 

In [None]:
# use MLP classifier on the pipeline

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, precision_score, f1_score

X_test = test_data.drop('Flight_Status', axis=1)
y_test = test_data['Flight_Status']

# Your X_prepped is already available from previous preprocessing
# Now we'll create and train the MLP classifier optimized for precision

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, precision_score

# Create the MLP classifier with precision-focused parameters
mlp = MLPClassifier(
    hidden_layer_sizes=(100,), 
    random_state=42            
)

# Train the model using your preprocessed X data
mlp.fit(X_prepped, y)

# For testing, you'll need to apply the same preprocessing to X_test
# If you haven't already done this, you would typically do:
# X_test_prepped = preprocessor.transform(X_test)
# But since you didn't specify this step, I'll assume you need to do it:

# Make predictions (using whatever preprocessed test data you have)
# If X_test is already preprocessed:
y_pred = mlp.predict(X_test)
# If you need to preprocess X_test:
# y_pred = mlp.predict(X_test_prepped)

# Evaluate with focus on precision
print("Precision score:", precision_score(y_test, y_pred, average='macro'))
print("F1 score:", f1_score(y_test, y_pred, average='macro'))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''