# Data Pipeline

### Our pipeline process:
#### Extract Features:
- `Airport_Pair` which will be used for `Delay_Trend_Past_Week` to see if there was a Delay trend in the previous week
- `Same_Day_Tail_Reuse`which will be using **ONLY** the date from `dep_datetime`, as well as using `Tail_Number`
- `Previous_Flight_Delay` which will be using information from `Tail_Number`, `dep_date` which will derive from `dep_datetime`, and `CRSDepTime`
- `Turnaround_Time` will also be extracted from `CRSDepTime` subtracted by `Previous_Arrival_Time`(Which is derived from `Tail_Number`, `dep_date`, and `CRSArrTime`)
- `Slack_Time` that derives from `CRSArrTime` subtracted by `CRSElaspedTime`

#### Outlier
- `Aircraft_Age` outlier must be removed 

#### Transformations: 
- `log_distance` using the log transformation 

#### Drop 
- is_Holidays

Because so much data is missing, it is reasonable to remove 
Originally_Scheduled_Code_Share_Airline
DOT_ID_Originally_Scheduled_Code_Share_Airline
IATA_Code_Originally_Scheduled_Code_Share_Airline
Flight_Num_Originally_Scheduled_Code_Share_Airline
Unnamed: 119
precipitation_probability
precipitation_probability_dest_dep_time

# also not gonna need
'Origin_Lon', 'Origin_Lat',
'Dest_Lat', 'Dest_Lon', 'Duplicate', 'Origin', 'Dest', 'OriginAirportSeqID', 'OriginStateName', 'DestAirportSeqID', 'DestStateName', 'DestWac', 'OriginWac','Operated_or_Branded_Code_Share_Partners',
'OriginCityName', 'DestCityName', 'DestState', 'OriginState', 'DepTimeBlk', 'ArrTimeBlk', 'DistanceGroup', 'Aircraft_Airline', 'Aircraft_ModelCode', 'Aircraft_Type',
'IATA_Code_Marketing_Airline', 'IATA_Code_Operating_Airline', 'latitude_dest_dep_time', 'longitude_dest_dep_time', 'longitude', 'latitude', 'Marketing_Airline_Network', 
'DOT_ID_Marketing_Airline', 'Flight_Number_Marketing_Airline', 'Operating_Airline ', dest_dep_datetime, 'Flights'

because they are redundant

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
import time
from sklearn.model_selection import cross_val_score

from sklearn import set_config
set_config(transform_output = "pandas")
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import classification_report


In [2]:
def stratified_random_split(df: pd.DataFrame, target_column: str, test_size: float = 0.1, random_state: int = 42):
    """
    Performs a stratified random train-test split to ensure all classes in 
    'Flight_Status' are proportionally represented in both sets.

    Parameters:
    df (pd.DataFrame): The dataset containing the target variable.
    target_column (str): The column representing the classification target.
    test_size (float): The proportion of data to be used as test data.
    random_state (int): Random seed for reproducibility.

    Returns:
    tuple: (train_df, test_df) DataFrames.
    """
    train_df, test_df = train_test_split(
        df, test_size=test_size, stratify=df[target_column], random_state=random_state
    )

    print(f"Train size: {len(train_df)} samples")
    print(f"Test size: {len(test_df)} samples")

    return train_df, test_df



In [3]:
# Setting up the DataFrame for the flight data
flight_data = pd.read_parquet("WEATHER121.parquet")


In [4]:
# Stratify the data
train_data, test_data= stratified_random_split(flight_data, target_column="Flight_Status")

Train size: 13184010 samples
Test size: 1464890 samples


In [5]:
#downsample to 100000

#train_data = train_data.sample(n=100000, random_state=42)

In [6]:
# Splitting the data into X and y
X_train = train_data.drop(columns=["Flight_Status"])
y_train = train_data["Flight_Status"]

### Pipeline Functions

In [7]:
# Custom progress logger
class ProgressLogger(BaseEstimator, TransformerMixin):
    """
    A transformer that logs progress through a pipeline.
    """
    
    def __init__(self, total_rows, log_interval=0.01, name='Pipeline'):
        self.total_rows = total_rows
        self.log_interval = log_interval
        self.name = name
        self.start_time = None
        self.last_log_percent = -1
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Initialize timer if not started
        if self.start_time is None:
            self.start_time = time.time()
            print(f"{self.name} processing started on {self.total_rows:,} rows")
        
        # Calculate current progress
        current_rows = X.shape[0]
        percent_complete = current_rows / self.total_rows
        
        # Check if we need to log progress
        int_percent = int(percent_complete / self.log_interval)
        if int_percent > self.last_log_percent:
            self.last_log_percent = int_percent
            elapsed = time.time() - self.start_time
            
            # Estimate time remaining
            percent_done = percent_complete * 100
            if percent_complete > 0:
                total_est = elapsed / percent_complete
                remaining = total_est - elapsed
                time_str = f" - Est. remaining: {remaining:.1f}s"
            else:
                time_str = ""
                
            print(f"{self.name}: {percent_done:.1f}% complete ({current_rows:,}/{self.total_rows:,} rows){time_str}")
        
        return X

class DelayTrendEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that precomputes a route-based delay trend using week_of_year and top N airports.

    Parameters:
        date_col (str): Name of the datetime column
        origin_col (str): Origin airport column
        dest_col (str): Destination airport column
        status_col (str): Flight status column for delay signal
        output_col (str): Name of the new delay trend feature
        top_n_airports (int): Number of top airports to retain by frequency
    """
    def __init__(self,
                 date_col='dep_datetime',
                 origin_col='OriginAirportID',
                 dest_col='DestAirportID',
                 status_col='Flight_Status',
                 output_col='Delay_Trend',
                 top_n_airports=15):
        self.date_col = date_col
        self.origin_col = origin_col
        self.dest_col = dest_col
        self.status_col = status_col
        self.output_col = output_col
        self.top_n_airports = top_n_airports

    def fit(self, X, y=None):
        X_temp = X.copy()
        X_temp[self.date_col] = pd.to_datetime(X_temp[self.date_col])
        X_temp['week_of_year'] = X_temp[self.date_col].dt.isocalendar().week

        X_temp['Flight_Status'] = y.reset_index(drop=True)
        
        # Restrict to top N airports
        all_airports = pd.concat([X_temp[self.origin_col], X_temp[self.dest_col]])
        top_airports = all_airports.value_counts().head(self.top_n_airports).index
        X_temp = X_temp[
            X_temp[self.origin_col].isin(top_airports) &
            X_temp[self.dest_col].isin(top_airports)
        ]

        # Create delay signal
        delay_reasons = [
            'CarrierDelay', 'WeatherDelay', 'NASDelay',
            'SecurityDelay', 'LateAircraftDelay'
        ]
        X_temp['delay_signal'] = 0
        mask = X_temp['Flight_Status'].str.contains('Delay', na=False)
        delay_type = X_temp.loc[mask, 'Flight_Status'].str.split(' - ').str[-1]
        valid = delay_type.isin(delay_reasons)
        X_temp.loc[mask[mask].index[valid], 'delay_signal'] = 1

        # Group by week and airport pair
        self.trend_lookup_ = (
            X_temp.groupby(['week_of_year', self.origin_col, self.dest_col])['delay_signal']
            .mean()
            .reset_index()
            .rename(columns={'delay_signal': self.output_col})
        )

        return self

    def transform(self, X):
        X = X.copy()
        X[self.date_col] = pd.to_datetime(X[self.date_col])
        X['week_of_year'] = X[self.date_col].dt.isocalendar().week

        X = X.merge(
            self.trend_lookup_,
            on=['week_of_year', self.origin_col, self.dest_col],
            how='left'
        )

        X[self.output_col] = X[self.output_col].fillna(0)
        return X.drop(columns=['week_of_year'])

class SameDayTailReuseEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that creates a feature counting how many times
    the same tail number (aircraft) is used on the same day.

    Parameters:
        datetime_col (str): Column containing full departure datetime
        tail_col (str): Column name for tail number
        output_col (str): Name of the output column
    """

    def __init__(self,
                 datetime_col='dep_datetime',
                 tail_col='Tail_Number',
                 output_col='Same_Day_Tail_Reuse'):
        self.datetime_col = datetime_col
        self.tail_col = tail_col
        self.output_col = output_col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()

        # Ensure datetime is parsed correctly
        X[self.datetime_col] = pd.to_datetime(X[self.datetime_col])

        # Extract just the date portion
        X['dep_date'] = X[self.datetime_col].dt.date

        # Count reuse of same tail number per day
        X[self.output_col] = (
            X.groupby([self.tail_col, 'dep_date'])[self.tail_col]
            .transform('count')
            .astype(float)  
        )

        return X.drop(columns=['dep_date'])

    
class TurnaroundDelayEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that computes:
    - Previous_Flight_Delay: scheduled departure time of previous flight with same tail number on same day
    - Turnaround_Time: time between previous arrival and current departure

    Parameters:
        datetime_col (str): Column with full departure datetime
        dep_time_col (str): Column with scheduled departure time (e.g. CRSDepTime)
        arr_time_col (str): Column with scheduled arrival time (e.g. CRSArrTime)
        tail_col (str): Tail number column
        output_prefix (str): Prefix to use for new feature columns
    """

    def __init__(self,
                 datetime_col='dep_datetime',
                 dep_time_col='CRSDepTime',
                 arr_time_col='CRSArrTime',
                 tail_col='Tail_Number',
                 output_prefix=''):
        self.datetime_col = datetime_col
        self.dep_time_col = dep_time_col
        self.arr_time_col = arr_time_col
        self.tail_col = tail_col
        self.output_prefix = output_prefix

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[self.datetime_col] = pd.to_datetime(X[self.datetime_col])
        X['dep_date'] = X[self.datetime_col].dt.date

        # Sort the values
        X = X.sort_values(by=[self.tail_col, 'dep_date', self.dep_time_col])

        prev_delay_col = self.output_prefix + 'Previous_Flight_Delay'
        turnaround_col = self.output_prefix + 'Turnaround_Time'

        # Previous scheduled departure
        X[prev_delay_col] = (
            X.groupby([self.tail_col, 'dep_date'])[self.dep_time_col]
            .shift(1)
        )

        # Previous scheduled arrival time
        X['Previous_Arrival_Time'] = (
            X.groupby([self.tail_col, 'dep_date'])[self.arr_time_col]
            .shift(1)
        )

        # Compute turnaround time using a defined method
        X[turnaround_col] = self._calculate_turnaround(X[self.dep_time_col], X['Previous_Arrival_Time'])

        # Replace missing with 0
        X[prev_delay_col] = X[prev_delay_col].fillna(0)
        X[turnaround_col] = X[turnaround_col].fillna(0)

        return X.drop(columns=['dep_date', 'Previous_Arrival_Time'])

    def _calculate_turnaround(self, current_dep, previous_arr):
        """
        Applies time difference logic with wrap-around at midnight (2400).
        """
        diff = current_dep - previous_arr
        adjusted = diff.mask((~diff.isna()) & (diff < 0), diff + 2400)
        return adjusted

class SlackTimeEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that calculates slack time as the difference between scheduled
    arrival time (CRSArrTime) and scheduled elapsed time (CRSElapsedTime).

    Parameters:
        arr_col (str): Column name for scheduled arrival time.
        elapsed_col (str): Column name for scheduled elapsed time.
        output_col (str): Name of the output column to store slack time.
    """

    def __init__(self,
                 arr_col='CRSArrTime',
                 elapsed_col='CRSElapsedTime',
                 output_col='Slack_Time'):
        self.arr_col = arr_col
        self.elapsed_col = elapsed_col
        self.output_col = output_col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[self.output_col] = X[self.arr_col] - X[self.elapsed_col]
        return X
    
    

class AgeClipper(BaseEstimator, TransformerMixin):
    """
    Transformer that caps aircraft age at a maximum value.
    
    Parameters:
    -----------
    max_age : float, default=56.9
        Maximum value for aircraft age
    column_name : str, default='Aircraft_Age'
        Name of the column to apply the cap to
    """
    
    def __init__(self, max_age=56.9, column_name='Aircraft_Age'):
        self.max_age = max_age
        self.column_name = column_name
    
    def fit(self, X, y=None):
        # Nothing to fit, just return self
        return self
    
    def transform(self, X):
        # Convert to DataFrame if it's not already
        X_transformed = X.copy() if isinstance(X, pd.DataFrame) else pd.DataFrame(X)
        
        # Apply the cap only if the column exists
        if self.column_name in X_transformed.columns:
            X_transformed[self.column_name] = np.minimum(X_transformed[self.column_name], self.max_age)
        
        # Return array if input was array
        if not isinstance(X, pd.DataFrame):
            X_transformed = X_transformed.values
            
        return X_transformed

### Pipeline Model

In [16]:
# import random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA


numerical_features = [
    'Distance', 'relative_humidity_2m', 'temperature_2m',
    'Aircraft_Age', 'temperature_2m_dest_dep_time',
    'snowfall', 'soil_moisture_0_to_1cm',
    'snowfall_dest_dep_time', 
    'et0_fao_evapotranspiration_dest_dep_time',
    'surface_pressure_dest_dep_time', 'pressure_msl_dest_dep_time',
    'wind_direction_10m_dest_dep_time', 'wind_gusts_10m_dest_dep_time', 'CRSElapsedTime', 'Slack_Time'
]

categorical_features = [
    'DayOfWeek', 'Month',
    'OriginAirportID',
    'DestAirportID', 'Is_Holiday_Week', 'Operating_Airline '
]

drop_features = [
            'Aircraft_Model', 'Aircraft_EngineType', 'Holiday',
            'dep_datetime', 'Tail_Number', 'temperature_120m_dest_dep_time', 
            'temperature_180m_dest_dep_time', 'temperature_80m_dest_dep_time', 'temperature_180m', 
            'temperature_120m', 'temperature_80m', 'wind_speed_80m', 'wind_speed_120m',
            'wind_speed_180m', 'wind_direction_80m', 'wind_direction_120m', 'wind_direction_180m',
            'wind_speed_80m_dest_dep_time', 'wind_speed_120m_dest_dep_time', 'wind_speed_180m_dest_dep_time',
            'wind_direction_80m_dest_dep_time', 'wind_direction_120m_dest_dep_time', 'wind_direction_180m_dest_dep_time', 
            'soil_temperature_0cm_dest_dep_time', 'soil_temperature_0cm', 'rain', 'rain_dest_dep_time', 'dew_point_2m_dest_dep_time', 
            'cloud_cover_mid', 'cloud_cover', 'cloud_cover_mid_dest_dep_time', 'cloud_cover_dest_dep_time', 'dew_point_2m', 
            'wind_speed_10m', 'wind_speed_10m_dest_dep_time', 'visibility_dest_dep_time', 'visibility', 'apparent_temperature', 
            'apparent_temperature_dest_dep_time', 'vapour_pressure_deficit', 'vapour_pressure_deficit_dest_dep_time', 'Flights',
            'DOT_ID_Operating_Airline', 'Flight_Number_Operating_Airline', 'Aircraft_Engines', 'Aircraft_Seats', 'Is_Freighter',
            'relative_humidity_2m_dest_dep_time', 'precipitation', 'showers', 'snow_depth', 'et0_fao_evapotranspiration',
            'evapotranspiration', 'cloud_cover_high', 'cloud_cover_low', 'surface_pressure', 'weather_code', 'pressure_msl', 
            'wind_direction_10m', 'wind_gusts_10m', 'precipitation_dest_dep_time', 'showers_dest_dep_time', 'snow_depth_dest_dep_time',
            'soil_moisture_0_to_1cm_dest_dep_time', 'evapotranspiration_dest_dep_time', 'cloud_cover_high_dest_dep_time',
            'cloud_cover_low_dest_dep_time', 'weather_code_dest_dep_time', 'Operated_or_Branded_Code_Share_Partners', 'Year',
            'Quarter', 'Marketing_Airline_Network', 'DayofMonth', 'DOT_ID_Marketing_Airline', 'Flight_Number_Marketing_Airline', 
            'OriginAirportSeqID', 'OriginCityMarketID', 'Origin', 'OriginCityName', 'OriginState', 'OriginStateFips', 
            'OriginStateName', 'OriginWac', 'DestAirportSeqID', 'DestCityMarketID', 'Dest', 'DestCityName', 'DestState', 'DestStateFips', 
            'DestStateName', 'DestWac', 'CRSDepTime', 'DepTimeBlk', 'CRSArrTime', 'ArrTimeBlk', 'DistanceGroup', 'Duplicate', 
            'Aircraft_Airline', 'Aircraft_ModelCode', 'Aircraft_Type'
        ]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('age_clipper', AgeClipper(max_age=56.9, column_name='Aircraft_Age')),
    ('log1p_distance', ColumnTransformer([
        ('log', FunctionTransformer(np.log1p), ['Distance', 'wind_gusts_10m_dest_dep_time'])
    ], remainder='passthrough', verbose_feature_names_out=False)), 
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('OHE', ColumnTransformer([
        ('top50', OneHotEncoder(handle_unknown='ignore', sparse_output=False, max_categories=51), ['OriginAirportID', 'DestAirportID']),
        ('OHE' , OneHotEncoder(handle_unknown='ignore', sparse_output=False), ['DayOfWeek', 'Month', 'Operating_Airline '])
    ]))
])

preprocessor = Pipeline([
    ('initial_logger', ProgressLogger(total_rows=13000000, name='Starting Pipeline')),
    
    ('delay_trend', DelayTrendEncoder(
        date_col='dep_datetime',
        origin_col='OriginAirportID',
        dest_col='DestAirportID',
        status_col='Flight_Status',
        output_col='Delay_Trend',
        top_n_airports=15
    )),
    
    ('trend_logger', ProgressLogger(total_rows=13000000, name='Delay Trend Complete')),
    
    ('tail_reuse', SameDayTailReuseEncoder(
        datetime_col='dep_datetime',
        tail_col='Tail_Number',
        output_col='Same_Day_Tail_Reuse'
    )),
    
    ('reuse_logger', ProgressLogger(total_rows=13000000, name='Tail Reuse Complete')),
    
    ('turnaround_delay', TurnaroundDelayEncoder(
        datetime_col='dep_datetime',
        dep_time_col='CRSDepTime',
        arr_time_col='CRSArrTime',
        tail_col='Tail_Number',
        output_prefix=''
    )),
    
    ('turnaround_logger', ProgressLogger(total_rows=13000000, name='Turnaround Delay Complete')),
    
    ('slack_time', SlackTimeEncoder(
        arr_col='CRSArrTime',
        elapsed_col='CRSElapsedTime',
        output_col='Slack_Time'
    )),
    
    ('slack_logger', ProgressLogger(total_rows=13000000, name='Slack Time Complete')),
    
    ('feature_prep', ColumnTransformer([
        ('numeric', numeric_transformer, numerical_features),
        ('categorical', categorical_transformer, categorical_features),
        ('drop', 'drop', drop_features)
    ], remainder='passthrough', verbose_feature_names_out=False, n_jobs=16)),
    
    ('prep_logger', ProgressLogger(total_rows=13000000, name='Feature Prep Complete')),
    ('select', PCA(
        n_components=50,
        random_state=42,
    )),
    ('final_logger', ProgressLogger(total_rows=13000000, name='Pipeline Complete'))
])
# so that I dont forget, dep_datetime is dropped because of redundancy

preprocessor

In [17]:
# Train the pipeline in this cell and run it on X, this cell must end with the transformed data
X = preprocessor.fit_transform(X_train, y_train)

pd.set_option('display.max_columns', None)
X

Starting Pipeline processing started on 13,000,000 rows
Starting Pipeline: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Delay Trend Complete processing started on 13,000,000 rows
Delay Trend Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Tail Reuse Complete processing started on 13,000,000 rows
Tail Reuse Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Turnaround Delay Complete processing started on 13,000,000 rows
Turnaround Delay Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Slack Time Complete processing started on 13,000,000 rows
Slack Time Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Feature Prep Complete processing started on 13,000,000 rows
Feature Prep Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Pipeline Complete processing started on 13,000,000 rows
Pipeline Complete: 101.4% complete (13,184

Unnamed: 0,pca0,pca1,pca2,pca3,pca4,pca5,pca6,pca7,pca8,pca9,pca10,pca11,pca12,pca13,pca14,pca15,pca16,pca17,pca18,pca19,pca20,pca21,pca22,pca23,pca24,pca25,pca26,pca27,pca28,pca29,pca30,pca31,pca32,pca33,pca34,pca35,pca36,pca37,pca38,pca39,pca40,pca41,pca42,pca43,pca44,pca45,pca46,pca47,pca48,pca49
237129,-854.833838,-22.350535,-2.076370,-0.440297,0.167337,0.572813,-1.737709,-1.939359,0.598997,-0.055869,-0.137670,-0.548254,-0.542312,-0.298612,-2.088348,0.848928,-0.670065,0.047236,-0.212079,-0.128787,-0.051648,-0.342953,0.748292,-0.345412,0.041730,-0.132633,0.142040,0.022152,-0.059687,-0.016585,-0.066402,-0.026409,0.022660,-0.050261,0.101513,0.066042,-0.130050,-0.550974,-0.570871,-0.209702,-0.269762,-0.001562,0.144834,-0.300568,-0.045557,-0.039284,0.002726,-0.013357,-0.034872,0.008718
4084368,-854.834141,-22.338888,-1.070168,0.928008,0.109072,-0.652765,1.616179,0.338744,0.239957,-0.415272,-0.021180,-0.514327,1.026836,0.051468,-0.161360,-0.747006,0.014806,-0.861882,0.250880,-0.119239,-0.133633,-0.564447,-0.641994,-0.424654,0.075837,0.196848,0.176961,-0.352887,-0.314717,0.040957,-0.038851,0.017130,-0.036814,-0.053807,0.053965,0.037875,0.115357,-0.299556,0.404938,0.723895,-0.138716,0.011560,0.018870,-0.168647,-0.030812,0.037567,0.036142,-0.042263,-0.116985,-0.060923
1029548,147.713096,-77.108846,-2.508873,-0.219041,0.183271,0.052362,-1.468815,-0.891923,0.671056,-0.296954,-0.029719,-1.096354,-0.770653,0.639488,-0.545767,-0.418539,-0.463200,0.530307,-0.035674,-0.098353,-0.162390,-0.570236,-0.655938,-0.407612,0.078903,0.174136,0.140763,-0.348507,-0.346499,0.049225,-0.071596,-0.003263,-0.006059,-0.044298,0.045863,0.068025,0.124905,-0.333909,0.390048,0.723567,-0.149370,-0.045389,0.032171,-0.168937,-0.031922,-0.040047,0.011496,-0.131700,-0.011625,-0.015509
11163145,-854.831051,-22.315751,0.932171,1.247213,0.264628,-0.378429,1.092836,0.747042,-0.019225,-0.376363,-0.005954,-0.375436,-0.156368,-0.354746,0.050448,-0.307444,-0.060815,-0.807921,0.260014,-0.296096,-0.442353,0.695594,-0.105101,-0.350372,0.069165,0.235374,0.162357,-0.368976,-0.373376,0.025128,-0.027154,-0.014787,-0.057837,-0.081301,0.079118,0.050852,0.108255,-0.309205,0.403263,0.745437,-0.141418,0.018153,0.017904,-0.187889,-0.014081,0.025421,0.022275,-0.030729,-0.126334,-0.055377
12079249,-119.476685,-38.252221,-0.403234,0.088713,0.273856,-0.885572,-0.906457,0.789604,0.586895,-0.564328,0.173975,-1.293512,-1.045807,0.493408,-0.800316,-0.377957,-0.731336,0.488200,-0.055334,-0.272335,-0.448231,0.696689,-0.107188,-0.339161,0.062435,0.195067,0.137958,-0.358201,-0.377980,0.034737,-0.053403,-0.025451,-0.045867,-0.086990,0.071079,0.072455,0.108627,-0.313937,0.391182,0.760339,-0.142154,-0.032643,0.032447,-0.195417,0.006769,-0.028557,0.003164,-0.126971,-0.017519,-0.006554
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4840623,1446.920810,1804.960411,16.114719,2.422619,1.610649,-0.672875,0.969258,0.592852,0.183474,-0.746838,0.094900,-1.218365,0.083777,-0.472712,-0.095805,-1.852203,1.153189,0.728729,-0.041103,-0.053404,0.011857,-0.055555,0.029238,0.410233,-0.748385,0.465689,0.282304,-0.387809,-0.183253,-0.041832,-0.231200,0.201801,0.159446,-0.228248,-0.209145,0.538293,0.012855,0.290036,0.311480,-0.237649,-0.298210,-0.055947,-0.012867,-0.086320,-0.094366,0.080918,-0.002684,-0.034766,0.057815,0.020743
3552614,1481.735235,1815.057061,15.927770,1.738535,1.102894,-0.766987,-1.048087,-0.406609,0.001147,0.025129,-0.039639,0.033922,-1.720980,0.440557,-1.980516,-0.355618,-1.014928,0.132163,0.074193,-0.259497,0.047569,-0.056746,0.000133,0.344437,-0.708481,0.774321,0.353554,-0.731177,-0.290823,-0.058627,-0.207307,0.196359,0.167049,-0.224005,-0.235772,0.591216,0.245028,0.328596,0.390303,-0.269474,-0.090319,-0.014113,-0.042403,-0.161968,0.010416,0.025028,-0.126482,0.003035,0.014830,0.113438
3919427,1548.791783,1824.509533,15.677436,1.876378,1.746183,-0.700656,-1.643239,-0.711924,0.211755,-0.202342,0.123060,-0.448522,-1.297856,0.692134,-1.638968,-0.015735,-1.040784,0.249929,-0.238392,0.003897,-0.026105,-0.040807,0.009947,0.485450,-0.788205,0.007574,0.209793,0.100577,0.097209,-0.039116,-0.231329,0.152183,0.178976,-0.158408,-0.167657,0.561432,-0.150555,0.334187,0.428725,-0.305908,-0.206265,-0.031733,0.096635,0.175397,0.029542,0.611845,0.371691,0.308198,-0.090158,0.175399
592306,1570.035200,1838.599017,15.483712,1.772993,1.485489,-0.392773,-1.326895,-0.203323,-0.108285,-0.113250,-0.141969,-0.310526,-3.409197,2.777021,1.569630,-0.193509,-0.997710,-0.807068,0.165139,-0.074919,-0.013840,-0.052491,0.027435,0.422841,-0.754428,0.257316,0.220303,-0.386106,-0.367061,-0.057872,-0.231652,0.158008,0.179158,-0.163109,-0.198327,0.584778,0.052776,0.258012,0.453432,-0.352048,-0.233304,0.053192,0.028380,-0.193773,-0.071399,-0.015947,-0.155446,0.019068,-0.020207,0.039576


In [19]:
# Save the preprocessor as a parquet file
X.to_parquet("preprocessor.parquet")


In [20]:
# pickle the preprocessor
import pickle
with open("preprocessor.pkl", "wb") as f:
    pickle.dump(preprocessor, f, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
import sys

for name, step in preprocessor.named_steps.items():
    print(name, type(step), sys.getsizeof(step))

initial_logger <class '__main__.ProgressLogger'> 48
delay_trend <class '__main__.DelayTrendEncoder'> 48
trend_logger <class '__main__.ProgressLogger'> 48
tail_reuse <class '__main__.SameDayTailReuseEncoder'> 48
reuse_logger <class '__main__.ProgressLogger'> 48
turnaround_delay <class '__main__.TurnaroundDelayEncoder'> 48
turnaround_logger <class '__main__.ProgressLogger'> 48
slack_time <class '__main__.SlackTimeEncoder'> 48
slack_logger <class '__main__.ProgressLogger'> 48
feature_prep <class 'sklearn.compose._column_transformer.ColumnTransformer'> 48
prep_logger <class '__main__.ProgressLogger'> 48
select <class 'sklearn.feature_selection._from_model.SelectFromModel'> 48
final_logger <class '__main__.ProgressLogger'> 48


# Models

In [21]:
from sklearn.metrics import make_scorer, precision_score, recall_score, precision_recall_curve
def precision_at_recall(y, y_pred, *, recall, **kwargs):
    """
    Calculate the precision at a given recall level.

    To use this with cross_val_score, you need to use the make_scorer function to
    create a scoring function that can be used with cross_val_score. For example:

        scorer = make_scorer(precision_at_recall, recall=0.9)  # 0.9 is the minimum recall level
        cross_val_score(model, X, y, cv=5, scoring=scorer)

    The default will only work for binary classification problems. You must change the
    average parameter if you want to use for multiclass classification. For example:

        scorer = make_scorer(precision_at_recall, recall=0.9, average="micro")
    """
    return precision_score(y, y_pred, **kwargs) if recall_score(y, y_pred, **kwargs) > recall else 0.0

def recall_at_precision(y, y_pred, *, precision, **kwargs):
    """
    Calculate the recall at a given precision level.

    To use this with cross_val_score, you need to use the make_scorer function to
    create a scoring function that can be used with cross_val_score. For example:

        scorer = make_scorer(recall_at_precision, precision=0.9)
        cross_val_score(model, X, y, cv=5, scoring=scorer)

    The default will only work for binary classification problems. You must change the
    average parameter if you want to use for multiclass classification. For example:

        scorer = make_scorer(recall_at_precision, precision=0.9, average="micro")
    """
    return recall_score(y, y_pred, **kwargs) if precision_score(y, y_pred, **kwargs) > precision else 0.0

In [35]:
X_sample = X.sample(n=100000, random_state=42)
y_train = y_train.sample(n=100000, random_state=42)

## Decision Tree

In [29]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score

dt = DecisionTreeClassifier(
    random_state=42,
    max_depth=20,
    min_samples_leaf=10,
    class_weight='balanced'
)

scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

scores = cross_val_score(dt, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(dt, X_sample, y_train, cv=5, n_jobs=64)

# Calculate recall, precision, and F1 score
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Cross-validated performance
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))

print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")


[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   19.0s finished


Recall: 0.0520
Precision: 0.0549
F1 Score: 0.0131
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.14      0.01       424
      Major Delay - CarrierDelay       0.01      0.11      0.01       514
 Major Delay - LateAircraftDelay       0.01      0.12      0.01       649
          Major Delay - NASDelay       0.00      0.01      0.00        91
      Major Delay - WeatherDelay       0.00      0.06      0.00        96
     Medium Delay - CarrierDelay       0.02      0.06      0.03      1881
Medium Delay - LateAircraftDelay       0.03      0.02      0.03      2886
         Medium Delay - NASDelay       0.00      0.11      0.01       487
     Medium Delay - WeatherDelay       0.00      0.05      0.01       325
      Minor Delay - CarrierDelay       0.03      0.03      0.03      3539
 Minor Delay - LateAircraftDelay       0.04      0.02      0.02      4238
          Minor Delay

In [24]:
y_train.value_counts()

Flight_Status
On-Time                             78470
Minor Delay - LateAircraftDelay      4238
Unknown                              3977
Minor Delay - CarrierDelay           3539
Medium Delay - LateAircraftDelay     2886
Medium Delay - CarrierDelay          1881
Minor Delay - NASDelay               1060
Weather Cancellation                  781
Major Delay - LateAircraftDelay       649
Major Delay - CarrierDelay            514
Medium Delay - NASDelay               487
Carrier Cancellation                  424
Minor Delay - WeatherDelay            326
Medium Delay - WeatherDelay           325
NAS Cancellation                      205
Major Delay - WeatherDelay             96
Major Delay - NASDelay                 91
Security Issue                         51
Name: count, dtype: int64

In [39]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import cross_val_score, cross_val_predict

# Create Gradient Boosting classifier
model = GradientBoostingClassifier(
    n_estimators=20,    
    max_depth=3,        
    learning_rate=0.1,  
    subsample=0.6,       
    random_state=42,
    verbose=2            
)

# Create scorer for precision at recall
scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

# Run cross-validation with the Gradient Boosting model
scores = cross_val_score(model, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(model, X_sample, y_train, cv=5, n_jobs=64)

# Calculate metrics
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Print classification report
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")


[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:  6.9min finished


Recall: 0.0556
Precision: 0.0482
F1 Score: 0.0494
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.03      0.00      0.00       424
      Major Delay - CarrierDelay       0.02      0.00      0.00       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.00      0.00      0.00        96
     Medium Delay - CarrierDelay       0.00      0.00      0.00      1881
Medium Delay - LateAircraftDelay       0.00      0.00      0.00      2886
         Medium Delay - NASDelay       0.03      0.00      0.00       487
     Medium Delay - WeatherDelay       0.00      0.00      0.00       325
      Minor Delay - CarrierDelay       0.00      0.00      0.00      3539
 Minor Delay - LateAircraftDelay       0.00      0.00      0.00      4238
          Minor Delay

## SGD Regression

In [40]:
from sklearn.linear_model import SGDClassifier

# Basic SGD classifier for multi-class classification
sgd_classifier = SGDClassifier(
    loss='log_loss',     
    alpha=0.0001,   
    n_jobs=64,
    random_state=42       
)

# Create scorer for precision at recall
scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

# Run cross-validation with the Gradient Boosting model
scores = cross_val_score(sgd_classifier, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(sgd_classifier, X_sample, y_train, cv=5, n_jobs=64)

# Calculate metrics
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Print classification report
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   29.8s finished


Recall: 0.0555
Precision: 0.0436
F1 Score: 0.0488
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.00      0.00       424
      Major Delay - CarrierDelay       0.00      0.00      0.00       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.00      0.00      0.00        96
     Medium Delay - CarrierDelay       0.00      0.00      0.00      1881
Medium Delay - LateAircraftDelay       0.00      0.00      0.00      2886
         Medium Delay - NASDelay       0.00      0.00      0.00       487
     Medium Delay - WeatherDelay       0.00      0.00      0.00       325
      Minor Delay - CarrierDelay       0.00      0.00      0.00      3539
 Minor Delay - LateAircraftDelay       0.00      0.00      0.00      4238
          Minor Delay

In [41]:
from sklearn.linear_model import LogisticRegression

# Fast but effective logistic regression configuration
log_reg = LogisticRegression(
    penalty='l2',
    C=1.0,                 
    max_iter=200,        
    tol=1e-3,           
    class_weight='balanced', 
    n_jobs=64,          
    random_state=42,     
    verbose=1            
)

# Run cross-validation with the Gradient Boosting model
scores = cross_val_score(log_reg, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(log_reg, X_sample, y_train, cv=5, n_jobs=64)

# Calculate metrics
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Print classification report
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.


RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  4.52906D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  5.08173D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  5.06678D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  6.82702D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =     

 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.



At iterate   50    f=  2.30413D+05    |proj g|=  1.83678D+04

At iterate   50    f=  2.30460D+05    |proj g|=  7.64161D+03

At iterate   50    f=  2.30432D+05    |proj g|=  2.04515D+04

At iterate   50    f=  2.30363D+05    |proj g|=  3.13326D+04

At iterate   50    f=  2.30345D+05    |proj g|=  1.77993D+05

At iterate  100    f=  2.29901D+05    |proj g|=  6.31362D+04

At iterate  100    f=  2.29820D+05    |proj g|=  1.75573D+04

At iterate  100    f=  2.30006D+05    |proj g|=  7.56836D+04

At iterate  100    f=  2.29745D+05    |proj g|=  3.66572D+04

At iterate  100    f=  2.29718D+05    |proj g|=  9.28460D+04

At iterate  150    f=  2.28940D+05    |proj g|=  2.57050D+04

At iterate  150    f=  2.29335D+05    |proj g|=  4.78086D+04

At iterate  150    f=  2.29303D+05    |proj g|=  1.39796D+04

At iterate  150    f=  2.29740D+05    |proj g|=  2.78229D+04

At iterate  150    f=  2.29359D+05    |proj g|=  1.66245D+04

At iterate  200    f=  2.28725D+05    |proj g|=  8.22377D+04

       

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   10.8s finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   10.9s finished



At iterate  200    f=  2.28091D+05    |proj g|=  5.04936D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    214      1     0     0   5.049D+04   2.281D+05
  F =   228091.21236615849     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 

At iterate  200    f=  2.29113D+05    |proj g|=  3.50903D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final proj

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.1s finished
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.1s finished



At iterate  200    f=  2.28640D+05    |proj g|=  1.39139D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    217      1     0     0   1.391D+04   2.286D+05
  F =   228639.94628400612     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.3s finished
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   12.8s finished
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.


RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  5.08173D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  5.06678D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  4.50830D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  4.52906D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =     

 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.



At iterate   50    f=  2.30345D+05    |proj g|=  1.77993D+05

At iterate   50    f=  2.30432D+05    |proj g|=  2.04515D+04

At iterate   50    f=  2.30460D+05    |proj g|=  7.64161D+03

At iterate   50    f=  2.30413D+05    |proj g|=  1.83678D+04

At iterate   50    f=  2.30363D+05    |proj g|=  3.13326D+04

At iterate  100    f=  2.29718D+05    |proj g|=  9.28460D+04

At iterate  100    f=  2.29820D+05    |proj g|=  1.75573D+04

At iterate  100    f=  2.30006D+05    |proj g|=  7.56836D+04

At iterate  100    f=  2.29901D+05    |proj g|=  6.31362D+04

At iterate  100    f=  2.29745D+05    |proj g|=  3.66572D+04

At iterate  150    f=  2.29303D+05    |proj g|=  1.39796D+04

At iterate  150    f=  2.29359D+05    |proj g|=  1.66245D+04

At iterate  150    f=  2.29740D+05    |proj g|=  2.78229D+04

At iterate  150    f=  2.29335D+05    |proj g|=  4.78086D+04

At iterate  150    f=  2.28940D+05    |proj g|=  2.57050D+04

At iterate  200    f=  2.28655D+05    |proj g|=  1.18179D+04

       

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   10.8s finished



At iterate  200    f=  2.29113D+05    |proj g|=  3.50903D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    213      1     0     0   3.509D+04   2.291D+05
  F =   229113.05220798624     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 

At iterate  200    f=  2.28640D+05    |proj g|=  1.39139D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final proj

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.7s finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.7s finished



At iterate  200    f=  2.28725D+05    |proj g|=  8.22377D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    211      1     0     0   8.224D+04   2.287D+05
  F =   228724.81232402046     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   12.1s finished



At iterate  200    f=  2.28091D+05    |proj g|=  5.04936D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    214      1     0     0   5.049D+04   2.281D+05
  F =   228091.21236615849     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   12.8s finished


Recall: 0.0682
Precision: 0.0681
F1 Score: 0.0058
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.04      0.01       424
      Major Delay - CarrierDelay       0.01      0.07      0.01       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.10      0.00        91
      Major Delay - WeatherDelay       0.00      0.27      0.00        96
     Medium Delay - CarrierDelay       0.04      0.00      0.01      1881
Medium Delay - LateAircraftDelay       0.04      0.01      0.01      2886
         Medium Delay - NASDelay       0.00      0.05      0.01       487
     Medium Delay - WeatherDelay       0.00      0.05      0.01       325
      Minor Delay - CarrierDelay       0.03      0.00      0.00      3539
 Minor Delay - LateAircraftDelay       0.04      0.00      0.00      4238
          Minor Delay

## MLP

In [46]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, precision_score

# Create the MLP classifier with precision-focused parameters
mlp = MLPClassifier(
    hidden_layer_sizes=(100),
    max_iter=100,              
    early_stopping=True,
    random_state=42            
)

# Create scorer for precision at recall
scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

# Run cross-validation with the Gradient Boosting model
scores = cross_val_score(mlp, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(mlp, X_sample, y_train, cv=5, n_jobs=64)

# Calculate metrics
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Print classification report
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

ImportError: cannot import name '_MissingValues' from 'sklearn.utils._param_validation' (/sw/external/python/anaconda3_cpu/lib/python3.9/site-packages/sklearn/utils/_param_validation.py)

## Random Forest

In [43]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest classifier
rf = RandomForestClassifier(
    n_estimators=100,        
    max_depth=15,            
    min_samples_split=10,    
    min_samples_leaf=5,     
    max_features='sqrt',     
    bootstrap=True,          
    class_weight='balanced', 
    n_jobs=64,              
    random_state=42,         
    verbose=1               
)

# Create scorer for precision at recall
scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

# Run cross-validation with the Gradient Boosting model
scores = cross_val_score(rf, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(rf, X_sample, y_train, cv=5, n_jobs=64)

# Calculate metrics
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Print classification report
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    7.6s remaining:    2.7s
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    7.6s remaining:    2.7s
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    7.7s remaining:    2.7s
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    7.9s remaining:    2.8s
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    8.1s remaining:    2.8s
[Parallel(n_jobs=64)]: Done 100 out of 100 | elapsed:    8.4s finished
[Parallel(n_jobs=64)]: Using backend 

Recall: 0.0564
Precision: 0.0559
F1 Score: 0.0523
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.01      0.01       424
      Major Delay - CarrierDelay       0.01      0.01      0.01       514
 Major Delay - LateAircraftDelay       0.01      0.02      0.01       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.00      0.00      0.00        96
     Medium Delay - CarrierDelay       0.02      0.04      0.03      1881
Medium Delay - LateAircraftDelay       0.03      0.06      0.04      2886
         Medium Delay - NASDelay       0.01      0.02      0.01       487
     Medium Delay - WeatherDelay       0.00      0.01      0.01       325
      Minor Delay - CarrierDelay       0.04      0.07      0.05      3539
 Minor Delay - LateAircraftDelay       0.04      0.08      0.05      4238
          Minor Delay

In [44]:
# KNN configuration
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(
    n_neighbors=5,   
    weights='distance', 
    n_jobs=64           
)


# Create scorer for precision at recall
scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

# Run cross-validation with the Gradient Boosting model
scores = cross_val_score(knn, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(knn, X_sample, y_train, cv=5, n_jobs=64)

# Calculate metrics
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Print classification report
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   12.6s finished


Recall: 0.0560
Precision: 0.0589
F1 Score: 0.0513
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.00      0.00       424
      Major Delay - CarrierDelay       0.00      0.00      0.00       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.10      0.01      0.02        96
     Medium Delay - CarrierDelay       0.02      0.00      0.00      1881
Medium Delay - LateAircraftDelay       0.02      0.00      0.00      2886
         Medium Delay - NASDelay       0.00      0.00      0.00       487
     Medium Delay - WeatherDelay       0.00      0.00      0.00       325
      Minor Delay - CarrierDelay       0.03      0.00      0.01      3539
 Minor Delay - LateAircraftDelay       0.05      0.01      0.01      4238
          Minor Delay

In [45]:
from sklearn.ensemble import ExtraTreesClassifier

extra_trees = ExtraTreesClassifier(
    n_estimators=100,
    max_depth=15,
    class_weight='balanced',
    n_jobs=64,
    random_state=42
)

# Create scorer for precision at recall
scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

# Run cross-validation with the Gradient Boosting model
scores = cross_val_score(extra_trees, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

# Fit the model and predict using cross-validation
y_pred_cv = cross_val_predict(extra_trees, X_sample, y_train, cv=5, n_jobs=64)

# Calculate metrics
recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

# Print classification report
print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:    4.3s finished


Recall: 0.0566
Precision: 0.0560
F1 Score: 0.0548
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.00      0.00       424
      Major Delay - CarrierDelay       0.01      0.01      0.01       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.01      0.02      0.01        96
     Medium Delay - CarrierDelay       0.02      0.01      0.01      1881
Medium Delay - LateAircraftDelay       0.03      0.01      0.02      2886
         Medium Delay - NASDelay       0.01      0.01      0.01       487
     Medium Delay - WeatherDelay       0.00      0.00      0.00       325
      Minor Delay - CarrierDelay       0.04      0.02      0.02      3539
 Minor Delay - LateAircraftDelay       0.04      0.02      0.02      4238
          Minor Delay