# Data Pipeline

**Feature Extraction**

1.  **Capture Historical Delay Patterns:**
    * **Feature:** `Delay_Trend`
    * **Goal:** Identify recurring delays between specific airport pairs during particular weeks.
    * **Implementation:** Aggregate weekly delay data, focusing on the top 15 airports to manage memory and ensure generalizability.
    * **Rationale:** Historical delay patterns are strong indicators of future delays.

2.  **Assess Aircraft Reuse:**
    * **Feature:** `Same_Day_Tail_Reuse`
    * **Goal:** Determine if an aircraft is scheduled for multiple flights on the same day.
    * **Rationale:** High reuse can lead to delay propagation.

3.  **Calculate Turnaround Time:**
    * **Feature:** `Turnaround_Time`
    * **Goal:** Measure the time between an aircraft's arrival and its next scheduled departure.
    * **Rationale:** Short turnaround times increase delay vulnerability.

4.  **Determine Slack Time:**
    * **Feature:** `Slack_Time`
    * **Goal:** Calculate the difference between scheduled arrival and expected travel time.
    * **Rationale:** Slack time indicates buffer capacity, which can absorb minor delays.

**Outlier**

1.  **Handle Extreme Aircraft Age Values:**
    * **Feature:** `Aircraft_Age`
    * **Issue:** Presence of unrealistic age values.
    * **Solution:** Clip outliers to a realistic maximum (e.g., 56.9 years).
    * **Rationale:** Prevent outliers from skewing model training.

**Transformation and Imputation**

1.  **Apply Logarithmic Transformation:**
    * **Features:** `Distance`, `Wind_Gust`
    * **Transformation:** `log1p`
    * **Rationale:** Address right-skewed distributions and improve model robustness.

2.  **Impute Missing Values:**
    * **Strategy:** Mean imputation for numeric features, most frequent imputation for categorical features.
    * **Rationale:** Maintain data completeness with a simple and effective approach.

**Feature Pruning**

1.  **Remove Redundant Information:**
    * **Action:** Drop columns like city/state names, latitude/longitude, and detailed marketing airline data.
    * **Rationale:** Eliminate redundant information that doesn't contribute to prediction accuracy.

2.  **Discard Uninformative or Highly Missing Columns:**
    * **Action:** Remove columns with significant missing data or those that offer little predictive value.
    * **Rationale:** Reduce noise and improve model focus.

**Encoding and Categorical**

1.  **One-Hot Encoding with Cardinality Limits:**
    * **Features:** `OriginAirportID`, `DestAirportID`
    * **Encoding:** `OneHotEncoder` with `max_categories=50`.
    * **Rationale:** Handle high-cardinality features efficiently while preserving important information.

2.  **Full One-Hot Encoding:**
    * **Features:** `DayOfWeek`, `Month`, `Operating_Airline`
    * **Encoding:** Complete one-hot encoding.
    * **Rationale:** Preserve category identity for time- and airline-sensitive predictions.

**Dimensionality Reduction**

- **Principal Component Analysis (PCA):**
    * **Technique:** `PCA(n_components=50)`
    * **Rationale:** Address multicollinearity and reduce dimensionality, especially with correlated weather and location data.
    * **Benefit:** Improved training efficiency and reduced overfitting.

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
import time
from sklearn.model_selection import cross_val_score

from sklearn import set_config
set_config(transform_output = "pandas")
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import classification_report


In [2]:
def stratified_random_split(df: pd.DataFrame, target_column: str, test_size: float = 0.1, random_state: int = 42):
    """
    Performs a stratified random train-test split to ensure all classes in 
    'Flight_Status' are proportionally represented in both sets.

    Parameters:
    df (pd.DataFrame): The dataset containing the target variable.
    target_column (str): The column representing the classification target.
    test_size (float): The proportion of data to be used as test data.
    random_state (int): Random seed for reproducibility.

    Returns:
    tuple: (train_df, test_df) DataFrames.
    """
    train_df, test_df = train_test_split(
        df, test_size=test_size, stratify=df[target_column], random_state=random_state
    )

    print(f"Train size: {len(train_df)} samples")
    print(f"Test size: {len(test_df)} samples")

    return train_df, test_df



In [3]:
flight_data = pd.read_parquet("data/WEATHER121.parquet")

In [4]:
train_data, test_data= stratified_random_split(flight_data, target_column="Flight_Status")

Train size: 13184010 samples
Test size: 1464890 samples


In [5]:
#train_data = train_data.sample(n=100000, random_state=42)

In [6]:
X_train = train_data.drop(columns=["Flight_Status"])
y_train = train_data["Flight_Status"]

### Pipeline Functions

In [7]:
class ProgressLogger(BaseEstimator, TransformerMixin):
    """
    A transformer that logs progress through a pipeline.
    """
    
    def __init__(self, total_rows, log_interval=0.01, name='Pipeline'):
        self.total_rows = total_rows
        self.log_interval = log_interval
        self.name = name
        self.start_time = None
        self.last_log_percent = -1
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        if self.start_time is None:
            self.start_time = time.time()
            print(f"{self.name} processing started on {self.total_rows:,} rows")
        
        current_rows = X.shape[0]
        percent_complete = current_rows / self.total_rows
        
        int_percent = int(percent_complete / self.log_interval)
        if int_percent > self.last_log_percent:
            self.last_log_percent = int_percent
            elapsed = time.time() - self.start_time
            
            percent_done = percent_complete * 100
            if percent_complete > 0:
                total_est = elapsed / percent_complete
                remaining = total_est - elapsed
                time_str = f" - Est. remaining: {remaining:.1f}s"
            else:
                time_str = ""
                
            print(f"{self.name}: {percent_done:.1f}% complete ({current_rows:,}/{self.total_rows:,} rows){time_str}")
        
        return X

class DelayTrendEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that precomputes a route-based delay trend using week_of_year and top N airports.

    Parameters:
        date_col (str): Name of the datetime column
        origin_col (str): Origin airport column
        dest_col (str): Destination airport column
        status_col (str): Flight status column for delay signal
        output_col (str): Name of the new delay trend feature
        top_n_airports (int): Number of top airports to retain by frequency
    """
    def __init__(self,
                 date_col='dep_datetime',
                 origin_col='OriginAirportID',
                 dest_col='DestAirportID',
                 status_col='Flight_Status',
                 output_col='Delay_Trend',
                 top_n_airports=15):
        self.date_col = date_col
        self.origin_col = origin_col
        self.dest_col = dest_col
        self.status_col = status_col
        self.output_col = output_col
        self.top_n_airports = top_n_airports

    def fit(self, X, y=None):
        X_temp = X.copy()
        X_temp[self.date_col] = pd.to_datetime(X_temp[self.date_col])
        X_temp['week_of_year'] = X_temp[self.date_col].dt.isocalendar().week

        X_temp['Flight_Status'] = y.reset_index(drop=True)
        
        all_airports = pd.concat([X_temp[self.origin_col], X_temp[self.dest_col]])
        top_airports = all_airports.value_counts().head(self.top_n_airports).index
        X_temp = X_temp[
            X_temp[self.origin_col].isin(top_airports) &
            X_temp[self.dest_col].isin(top_airports)
        ]

        delay_reasons = [
            'CarrierDelay', 'WeatherDelay', 'NASDelay',
            'SecurityDelay', 'LateAircraftDelay'
        ]
        X_temp['delay_signal'] = 0
        mask = X_temp['Flight_Status'].str.contains('Delay', na=False)
        delay_type = X_temp.loc[mask, 'Flight_Status'].str.split(' - ').str[-1]
        valid = delay_type.isin(delay_reasons)
        X_temp.loc[mask[mask].index[valid], 'delay_signal'] = 1

        self.trend_lookup_ = (
            X_temp.groupby(['week_of_year', self.origin_col, self.dest_col])['delay_signal']
            .mean()
            .reset_index()
            .rename(columns={'delay_signal': self.output_col})
        )

        return self

    def transform(self, X):
        X = X.copy()
        X[self.date_col] = pd.to_datetime(X[self.date_col])
        X['week_of_year'] = X[self.date_col].dt.isocalendar().week

        X = X.merge(
            self.trend_lookup_,
            on=['week_of_year', self.origin_col, self.dest_col],
            how='left'
        )

        X[self.output_col] = X[self.output_col].fillna(0)
        return X.drop(columns=['week_of_year'])

class SameDayTailReuseEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that creates a feature counting how many times
    the same tail number (aircraft) is used on the same day.

    Parameters:
        datetime_col (str): Column containing full departure datetime
        tail_col (str): Column name for tail number
        output_col (str): Name of the output column
    """

    def __init__(self,
                 datetime_col='dep_datetime',
                 tail_col='Tail_Number',
                 output_col='Same_Day_Tail_Reuse'):
        self.datetime_col = datetime_col
        self.tail_col = tail_col
        self.output_col = output_col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()

        X[self.datetime_col] = pd.to_datetime(X[self.datetime_col])

        X['dep_date'] = X[self.datetime_col].dt.date

        X[self.output_col] = (
            X.groupby([self.tail_col, 'dep_date'])[self.tail_col]
            .transform('count')
            .astype(float)  
        )

        return X.drop(columns=['dep_date'])

    
class TurnaroundDelayEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that computes:
    - Previous_Flight_Delay: scheduled departure time of previous flight with same tail number on same day
    - Turnaround_Time: time between previous arrival and current departure

    Parameters:
        datetime_col (str): Column with full departure datetime
        dep_time_col (str): Column with scheduled departure time (e.g. CRSDepTime)
        arr_time_col (str): Column with scheduled arrival time (e.g. CRSArrTime)
        tail_col (str): Tail number column
        output_prefix (str): Prefix to use for new feature columns
    """

    def __init__(self,
                 datetime_col='dep_datetime',
                 dep_time_col='CRSDepTime',
                 arr_time_col='CRSArrTime',
                 tail_col='Tail_Number',
                 output_prefix=''):
        self.datetime_col = datetime_col
        self.dep_time_col = dep_time_col
        self.arr_time_col = arr_time_col
        self.tail_col = tail_col
        self.output_prefix = output_prefix

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[self.datetime_col] = pd.to_datetime(X[self.datetime_col])
        X['dep_date'] = X[self.datetime_col].dt.date

        X = X.sort_values(by=[self.tail_col, 'dep_date', self.dep_time_col])

        prev_delay_col = self.output_prefix + 'Previous_Flight_Delay'
        turnaround_col = self.output_prefix + 'Turnaround_Time'

        X[prev_delay_col] = (
            X.groupby([self.tail_col, 'dep_date'])[self.dep_time_col]
            .shift(1)
        )

        X['Previous_Arrival_Time'] = (
            X.groupby([self.tail_col, 'dep_date'])[self.arr_time_col]
            .shift(1)
        )

        X[turnaround_col] = self._calculate_turnaround(X[self.dep_time_col], X['Previous_Arrival_Time'])

        X[prev_delay_col] = X[prev_delay_col].fillna(0)
        X[turnaround_col] = X[turnaround_col].fillna(0)

        return X.drop(columns=['dep_date', 'Previous_Arrival_Time'])

    def _calculate_turnaround(self, current_dep, previous_arr):
        """
        Applies time difference logic with wrap-around at midnight (2400).
        """
        diff = current_dep - previous_arr
        adjusted = diff.mask((~diff.isna()) & (diff < 0), diff + 2400)
        return adjusted

class SlackTimeEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that calculates slack time as the difference between scheduled
    arrival time (CRSArrTime) and scheduled elapsed time (CRSElapsedTime).

    Parameters:
        arr_col (str): Column name for scheduled arrival time.
        elapsed_col (str): Column name for scheduled elapsed time.
        output_col (str): Name of the output column to store slack time.
    """

    def __init__(self,
                 arr_col='CRSArrTime',
                 elapsed_col='CRSElapsedTime',
                 output_col='Slack_Time'):
        self.arr_col = arr_col
        self.elapsed_col = elapsed_col
        self.output_col = output_col

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[self.output_col] = X[self.arr_col] - X[self.elapsed_col]
        return X
    
    

class AgeClipper(BaseEstimator, TransformerMixin):
    """
    Transformer that caps aircraft age at a maximum value.
    
    Parameters:
    -----------
    max_age : float, default=56.9
        Maximum value for aircraft age
    column_name : str, default='Aircraft_Age'
        Name of the column to apply the cap to
    """
    
    def __init__(self, max_age=56.9, column_name='Aircraft_Age'):
        self.max_age = max_age
        self.column_name = column_name
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_transformed = X.copy() if isinstance(X, pd.DataFrame) else pd.DataFrame(X)
        
        if self.column_name in X_transformed.columns:
            X_transformed[self.column_name] = np.minimum(X_transformed[self.column_name], self.max_age)
        
        if not isinstance(X, pd.DataFrame):
            X_transformed = X_transformed.values
            
        return X_transformed

### Pipeline Model

**Feature Extraction**

1.  **Capture Historical Delay Patterns:**
    * **New Feature:** `Delay_Trend`
    * **Goal:** Identify recurring delays between specific airport pairs during particular weeks.
    * **Implementation:** Aggregate weekly delay data, focusing on the top 15 airports to manage memory and ensure generalizability.
    * **Why we thought of this:** Historical delay patterns are strong indicators of future delays.

2.  **Assess Aircraft Reuse:**
    * **New Feature:** `Same_Day_Tail_Reuse`
    * **Goal:** Determine if an aircraft is scheduled for multiple flights on the same day.
    * **Why we thought of this:** High reuse can lead to delay propagation.

3.  **Calculate Turnaround Time:**
    * **New Feature:** `Turnaround_Time`
    * **Goal:** Measure the time between an aircraft's arrival and its next scheduled departure.
    * **Why we thought of this:** Short turnaround times increase delay vulnerability.

4.  **Determine Slack Time:**
    * **New Feature:** `Slack_Time`
    * **Goal:** Calculate the difference between scheduled arrival and expected travel time.
    * **Why we thought of this:** Slack time indicates buffer capacity, which can absorb minor delays.

**Outlier**

1.  **Handle Extreme Aircraft Age Values:**
    * **Feature:** `Aircraft_Age`
    * **Issue:**  unrealistic age values (There are hundreds of ages in the thousands)
    * **Solution:** Clip outliers to the second highest value (56.9 years). (more reasonable)
    * **Why:** Prevent outliers from skewing model training.

**Transformation and Imputation**

1.  **Apply Logarithmic Transformation:**
    * **Features:** `Distance`, `Wind_Gust`
    * **Using:** log1p (In case theres 0 values)
    * **Why:** Hopefully improve the models performance

2.  **Impute Missing Values:**
    * **Strategy:** Mean imputation for numeric features, most frequent imputation for categorical features.
    * **Why:** Maintain data completeness with a simple and effective approach.

**Feature Pruning**

1.  **Remove Redundant Information:**
    * **Action:** Drop columns like city/state names, latitude/longitude, and detailed marketing airline data.
    * **Why:** Eliminate redundant information that doesn't contribute to prediction accuracy.

2.  **Discard Uninformative or Highly Missing Columns:**
    * **Action:** Remove columns with significant missing data or those that offer little predictive value.
    * **Why:** Reduce noise and improve model focus.

**Encoding and Categorical**

1.  **One-Hot Encoding with Cardinality Limits:**
    * **Features:** `OriginAirportID`, `DestAirportID`
    * **Encoding:** `OneHotEncoder` with `max_categories=51`.
    * **Why:** We made a max of 50 categories (technically 51) so we kept the top 50 out of 300 something columns. 300 columns would be too much.

2.  **Full One-Hot Encoding:**
    * **Features:** `DayOfWeek`, `Month`, `Operating_Airline`
    * **Encoding:** Complete one-hot encoding.
    * **Why:** Preserve category identity for time- and airline-sensitive predictions.

**Dimensionality Reduction**

- **Principal Component Analysis (PCA):**
    * **Technique:** `PCA(n_components=50)`
    * **Rationale:** Address multicollinearity and reduce dimensionality, especially with correlated weather and location data.
    * **Why:** Improved training efficiency and reduced overfitting.

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA


numerical_features = [
    'Distance', 'relative_humidity_2m', 'temperature_2m',
    'Aircraft_Age', 'temperature_2m_dest_dep_time',
    'snowfall', 'soil_moisture_0_to_1cm',
    'snowfall_dest_dep_time', 
    'et0_fao_evapotranspiration_dest_dep_time',
    'surface_pressure_dest_dep_time', 'pressure_msl_dest_dep_time',
    'wind_direction_10m_dest_dep_time', 'wind_gusts_10m_dest_dep_time', 'CRSElapsedTime', 'Slack_Time'
]

categorical_features = [
    'DayOfWeek', 'Month',
    'OriginAirportID',
    'DestAirportID', 'Is_Holiday_Week', 'Operating_Airline '
]

drop_features = [
            'Aircraft_Model', 'Aircraft_EngineType', 'Holiday',
            'dep_datetime', 'Tail_Number', 'temperature_120m_dest_dep_time', 
            'temperature_180m_dest_dep_time', 'temperature_80m_dest_dep_time', 'temperature_180m', 
            'temperature_120m', 'temperature_80m', 'wind_speed_80m', 'wind_speed_120m',
            'wind_speed_180m', 'wind_direction_80m', 'wind_direction_120m', 'wind_direction_180m',
            'wind_speed_80m_dest_dep_time', 'wind_speed_120m_dest_dep_time', 'wind_speed_180m_dest_dep_time',
            'wind_direction_80m_dest_dep_time', 'wind_direction_120m_dest_dep_time', 'wind_direction_180m_dest_dep_time', 
            'soil_temperature_0cm_dest_dep_time', 'soil_temperature_0cm', 'rain', 'rain_dest_dep_time', 'dew_point_2m_dest_dep_time', 
            'cloud_cover_mid', 'cloud_cover', 'cloud_cover_mid_dest_dep_time', 'cloud_cover_dest_dep_time', 'dew_point_2m', 
            'wind_speed_10m', 'wind_speed_10m_dest_dep_time', 'visibility_dest_dep_time', 'visibility', 'apparent_temperature', 
            'apparent_temperature_dest_dep_time', 'vapour_pressure_deficit', 'vapour_pressure_deficit_dest_dep_time', 'Flights',
            'DOT_ID_Operating_Airline', 'Flight_Number_Operating_Airline', 'Aircraft_Engines', 'Aircraft_Seats', 'Is_Freighter',
            'relative_humidity_2m_dest_dep_time', 'precipitation', 'showers', 'snow_depth', 'et0_fao_evapotranspiration',
            'evapotranspiration', 'cloud_cover_high', 'cloud_cover_low', 'surface_pressure', 'weather_code', 'pressure_msl', 
            'wind_direction_10m', 'wind_gusts_10m', 'precipitation_dest_dep_time', 'showers_dest_dep_time', 'snow_depth_dest_dep_time',
            'soil_moisture_0_to_1cm_dest_dep_time', 'evapotranspiration_dest_dep_time', 'cloud_cover_high_dest_dep_time',
            'cloud_cover_low_dest_dep_time', 'weather_code_dest_dep_time', 'Operated_or_Branded_Code_Share_Partners', 'Year',
            'Quarter', 'Marketing_Airline_Network', 'DayofMonth', 'DOT_ID_Marketing_Airline', 'Flight_Number_Marketing_Airline', 
            'OriginAirportSeqID', 'OriginCityMarketID', 'Origin', 'OriginCityName', 'OriginState', 'OriginStateFips', 
            'OriginStateName', 'OriginWac', 'DestAirportSeqID', 'DestCityMarketID', 'Dest', 'DestCityName', 'DestState', 'DestStateFips', 
            'DestStateName', 'DestWac', 'CRSDepTime', 'DepTimeBlk', 'CRSArrTime', 'ArrTimeBlk', 'DistanceGroup', 'Duplicate', 
            'Aircraft_Airline', 'Aircraft_ModelCode', 'Aircraft_Type'
        ]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('age_clipper', AgeClipper(max_age=56.9, column_name='Aircraft_Age')),
    ('log1p_distance', ColumnTransformer([
        ('log', FunctionTransformer(np.log1p), ['Distance', 'wind_gusts_10m_dest_dep_time'])
    ], remainder='passthrough', verbose_feature_names_out=False)), 
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('OHE', ColumnTransformer([
        ('top50', OneHotEncoder(handle_unknown='ignore', sparse_output=False, max_categories=51), ['OriginAirportID', 'DestAirportID']),
        ('OHE' , OneHotEncoder(handle_unknown='ignore', sparse_output=False), ['DayOfWeek', 'Month', 'Operating_Airline '])
    ]))
])

preprocessor = Pipeline([
    ('initial_logger', ProgressLogger(total_rows=13000000, name='Starting Pipeline')),
    
    ('delay_trend', DelayTrendEncoder(
        date_col='dep_datetime',
        origin_col='OriginAirportID',
        dest_col='DestAirportID',
        status_col='Flight_Status',
        output_col='Delay_Trend',
        top_n_airports=15
    )),
    
    ('trend_logger', ProgressLogger(total_rows=13000000, name='Delay Trend Complete')),
    
    ('tail_reuse', SameDayTailReuseEncoder(
        datetime_col='dep_datetime',
        tail_col='Tail_Number',
        output_col='Same_Day_Tail_Reuse'
    )),
    
    ('reuse_logger', ProgressLogger(total_rows=13000000, name='Tail Reuse Complete')),
    
    ('turnaround_delay', TurnaroundDelayEncoder(
        datetime_col='dep_datetime',
        dep_time_col='CRSDepTime',
        arr_time_col='CRSArrTime',
        tail_col='Tail_Number',
        output_prefix=''
    )),
    
    ('turnaround_logger', ProgressLogger(total_rows=13000000, name='Turnaround Delay Complete')),
    
    ('slack_time', SlackTimeEncoder(
        arr_col='CRSArrTime',
        elapsed_col='CRSElapsedTime',
        output_col='Slack_Time'
    )),
    
    ('slack_logger', ProgressLogger(total_rows=13000000, name='Slack Time Complete')),
    
    ('feature_prep', ColumnTransformer([
        ('numeric', numeric_transformer, numerical_features),
        ('categorical', categorical_transformer, categorical_features),
        ('drop', 'drop', drop_features)
    ], remainder='passthrough', verbose_feature_names_out=False, n_jobs=1)),
    
    ('prep_logger', ProgressLogger(total_rows=13000000, name='Feature Prep Complete')),
    ('select', PCA(
        n_components=50,
        random_state=42,
    )),
    ('final_logger', ProgressLogger(total_rows=13000000, name='Pipeline Complete'))
])
# so that I dont forget, dep_datetime is dropped because of redundancy

preprocessor

In [14]:
# Train the pipeline in this cell and run it on X, this cell must end with the transformed data
X = preprocessor.fit_transform(X_train, y_train)

pd.set_option('display.max_columns', None)
X

Starting Pipeline processing started on 13,000,000 rows
Starting Pipeline: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Delay Trend Complete processing started on 13,000,000 rows
Delay Trend Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: 0.0s
Tail Reuse Complete processing started on 13,000,000 rows
Tail Reuse Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: 0.0s
Turnaround Delay Complete processing started on 13,000,000 rows
Turnaround Delay Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Slack Time Complete processing started on 13,000,000 rows
Slack Time Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Feature Prep Complete processing started on 13,000,000 rows
Feature Prep Complete: 101.4% complete (13,184,010/13,000,000 rows) - Est. remaining: -0.0s
Pipeline Complete processing started on 13,000,000 rows
Pipeline Complete: 101.4% complete (13,184,0

Unnamed: 0,pca0,pca1,pca2,pca3,pca4,pca5,pca6,pca7,pca8,pca9,pca10,pca11,pca12,pca13,pca14,pca15,pca16,pca17,pca18,pca19,pca20,pca21,pca22,pca23,pca24,pca25,pca26,pca27,pca28,pca29,pca30,pca31,pca32,pca33,pca34,pca35,pca36,pca37,pca38,pca39,pca40,pca41,pca42,pca43,pca44,pca45,pca46,pca47,pca48,pca49
237129,-854.833838,-22.350535,-2.076370,0.440297,0.167337,-0.572813,-1.737709,1.939359,0.598997,-0.055869,-0.137670,-0.548254,-0.542312,0.298613,2.088348,0.848927,-0.670052,0.047257,0.212088,-0.128773,-0.051730,-0.342995,0.748262,-0.345468,-0.041718,0.132528,-0.142064,0.022071,-0.059568,0.017391,-0.065895,-0.026877,0.021622,-0.050331,0.102271,-0.066844,0.130475,0.551934,0.570553,-0.210321,-0.257098,0.002678,0.173589,-0.273864,-0.091286,0.039982,-0.003104,-0.021177,-0.032929,0.007210
4084368,-854.834141,-22.338888,-1.070168,-0.928008,0.109072,0.652765,1.616179,-0.338744,0.239957,-0.415272,-0.021180,-0.514327,1.026836,-0.051468,0.161360,-0.747007,0.014852,-0.861847,-0.250895,-0.119135,-0.133813,-0.564526,-0.642091,-0.424731,-0.075788,-0.197131,-0.176772,-0.352992,-0.315177,-0.040223,-0.038654,0.016543,-0.037551,-0.054355,0.055144,-0.039692,-0.116110,0.300608,-0.405332,0.724086,-0.120801,-0.010670,0.046517,-0.184143,-0.035533,-0.030640,0.081651,-0.096209,-0.106677,0.052260
1029548,147.713096,-77.108846,-2.508873,0.219041,0.183271,-0.052362,-1.468815,0.891923,0.671056,-0.296954,-0.029719,-1.096354,-0.770653,-0.639487,0.545767,-0.418540,-0.463175,0.530344,0.035675,-0.098316,-0.162530,-0.570328,-0.655989,-0.407679,-0.078887,-0.174331,-0.140635,-0.348590,-0.346866,-0.048655,-0.071341,-0.003489,-0.006502,-0.044877,0.046853,-0.069222,-0.125848,0.334864,-0.390163,0.723194,-0.134809,0.043661,0.057495,-0.173647,-0.040443,0.051205,0.059454,-0.133284,0.035110,0.017812
11163145,-854.831051,-22.315751,0.932171,-1.247213,0.264628,0.378429,1.092836,-0.747042,-0.019225,-0.376363,-0.005954,-0.375436,-0.156368,0.354747,-0.050449,-0.307445,-0.060767,-0.807887,-0.260028,-0.295990,-0.442526,0.695511,-0.105206,-0.350451,-0.069121,-0.235673,-0.162171,-0.369080,-0.373825,-0.024298,-0.026886,-0.015455,-0.058525,-0.081886,0.080335,-0.052614,-0.108992,0.310271,-0.403620,0.745579,-0.123430,-0.017223,0.049851,-0.204998,-0.022656,-0.019727,0.065002,-0.089580,-0.117794,0.048885
12079249,-119.476685,-38.252221,-0.403234,-0.088713,0.273856,0.885572,-0.906457,-0.789604,0.586895,-0.564328,0.173975,-1.293512,-1.045807,-0.493408,0.800315,-0.377958,-0.731310,0.488236,0.055338,-0.272298,-0.448355,0.696611,-0.107243,-0.339225,-0.062423,-0.195257,-0.137837,-0.358293,-0.378293,-0.034077,-0.053095,-0.025850,-0.046371,-0.087511,0.072096,-0.073661,-0.109357,0.314965,-0.391231,0.759849,-0.127114,0.031212,0.063480,-0.202389,-0.007835,0.036338,0.045206,-0.129339,0.028349,0.012828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4840623,1446.920810,1804.960411,16.114719,-2.422619,1.610649,0.672875,0.969259,-0.592852,0.183473,-0.746838,0.094900,-1.218365,0.083777,0.472713,0.095803,-1.852203,1.153305,0.728748,0.041038,-0.053026,0.011305,-0.055739,0.029012,0.410431,0.748450,-0.466618,-0.281612,-0.387750,-0.185651,0.043890,-0.232125,0.200536,0.158804,-0.229552,-0.207657,-0.536611,-0.010809,-0.290260,-0.307992,-0.237195,-0.266611,0.054686,0.025843,-0.122352,-0.085951,-0.067543,0.059791,-0.030700,0.073462,-0.044373
3552614,1481.735235,1815.057061,15.927770,-1.738535,1.102894,0.766987,-1.048087,0.406609,0.001147,0.025129,-0.039639,0.033922,-1.720980,-0.440556,1.980514,-0.355618,-1.014802,0.132178,-0.074262,-0.259097,0.046958,-0.056985,-0.000113,0.344622,0.708491,-0.775352,-0.352902,-0.731150,-0.293085,0.061118,-0.207876,0.194648,0.166185,-0.225491,-0.233954,-0.589453,-0.242532,-0.328401,-0.386488,-0.269394,-0.055422,0.010671,0.013124,-0.213789,0.007008,-0.020031,-0.065807,-0.036923,0.014594,-0.126170
3919427,1548.791783,1824.509533,15.677436,-1.876378,1.746183,0.700656,-1.643239,0.711923,0.211755,-0.202342,0.123060,-0.448522,-1.297856,-0.692133,1.638966,-0.015737,-1.040638,0.249941,0.238326,0.004336,-0.026771,-0.041061,0.009652,0.485591,0.788253,-0.008662,-0.209193,0.100622,0.095191,0.042284,-0.231527,0.149686,0.177698,-0.159124,-0.165513,-0.559997,0.153371,-0.333992,-0.425397,-0.305514,-0.170967,0.031572,0.138557,0.118684,0.082152,-0.585153,0.406673,0.312473,-0.185027,-0.180811
592306,1570.035200,1838.599017,15.483712,-1.772993,1.485489,0.392773,-1.326895,0.203322,-0.108285,-0.113250,-0.141969,-0.310526,-3.409197,-2.777020,-1.569631,-0.193509,-0.997573,-0.807044,-0.165214,-0.074524,-0.014452,-0.052748,0.027151,0.422977,0.754466,-0.258362,-0.219672,-0.386010,-0.369264,0.060509,-0.232003,0.156227,0.178212,-0.164480,-0.196227,-0.583570,-0.050938,-0.257679,-0.450286,-0.351557,-0.199204,-0.054291,0.085845,-0.231116,-0.079747,0.021328,-0.092450,-0.039104,-0.022430,-0.052356


In [19]:
# Save the preprocessor as a parquet file
X.to_parquet("preprocessor.parquet")


In [None]:
import pickle
with open("preprocessor.pkl", "wb") as f:
    pickle.dump(preprocessor, f, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
import sys

for name, step in preprocessor.named_steps.items():
    print(name, type(step), sys.getsizeof(step))

initial_logger <class '__main__.ProgressLogger'> 48
delay_trend <class '__main__.DelayTrendEncoder'> 48
trend_logger <class '__main__.ProgressLogger'> 48
tail_reuse <class '__main__.SameDayTailReuseEncoder'> 48
reuse_logger <class '__main__.ProgressLogger'> 48
turnaround_delay <class '__main__.TurnaroundDelayEncoder'> 48
turnaround_logger <class '__main__.ProgressLogger'> 48
slack_time <class '__main__.SlackTimeEncoder'> 48
slack_logger <class '__main__.ProgressLogger'> 48
feature_prep <class 'sklearn.compose._column_transformer.ColumnTransformer'> 48
prep_logger <class '__main__.ProgressLogger'> 48
select <class 'sklearn.feature_selection._from_model.SelectFromModel'> 48
final_logger <class '__main__.ProgressLogger'> 48


# Models

In [15]:
from sklearn.metrics import make_scorer, precision_score, recall_score, precision_recall_curve
def precision_at_recall(y, y_pred, *, recall, **kwargs):
    """
    Calculate the precision at a given recall level.

    To use this with cross_val_score, you need to use the make_scorer function to
    create a scoring function that can be used with cross_val_score. For example:

        scorer = make_scorer(precision_at_recall, recall=0.9)  # 0.9 is the minimum recall level
        cross_val_score(model, X, y, cv=5, scoring=scorer)

    The default will only work for binary classification problems. You must change the
    average parameter if you want to use for multiclass classification. For example:

        scorer = make_scorer(precision_at_recall, recall=0.9, average="micro")
    """
    return precision_score(y, y_pred, **kwargs) if recall_score(y, y_pred, **kwargs) > recall else 0.0

def recall_at_precision(y, y_pred, *, precision, **kwargs):
    """
    Calculate the recall at a given precision level.

    To use this with cross_val_score, you need to use the make_scorer function to
    create a scoring function that can be used with cross_val_score. For example:

        scorer = make_scorer(recall_at_precision, precision=0.9)
        cross_val_score(model, X, y, cv=5, scoring=scorer)

    The default will only work for binary classification problems. You must change the
    average parameter if you want to use for multiclass classification. For example:

        scorer = make_scorer(recall_at_precision, precision=0.9, average="micro")
    """
    return recall_score(y, y_pred, **kwargs) if precision_score(y, y_pred, **kwargs) > precision else 0.0

In [None]:
X_sample = X # .sample(n=100000, random_state=42)
y_train = y_train.sample(n=100000, random_state=42)

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report, recall_score, precision_score, f1_score

dt = DecisionTreeClassifier(
    random_state=42,
    max_depth=20,
    min_samples_leaf=10,
    class_weight='balanced'
)

scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

scores = cross_val_score(dt, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

y_pred_cv = cross_val_predict(dt, X_sample, y_train, cv=5, n_jobs=64)

recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))

print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")


[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   19.0s finished


Recall: 0.0520
Precision: 0.0549
F1 Score: 0.0131
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.14      0.01       424
      Major Delay - CarrierDelay       0.01      0.11      0.01       514
 Major Delay - LateAircraftDelay       0.01      0.12      0.01       649
          Major Delay - NASDelay       0.00      0.01      0.00        91
      Major Delay - WeatherDelay       0.00      0.06      0.00        96
     Medium Delay - CarrierDelay       0.02      0.06      0.03      1881
Medium Delay - LateAircraftDelay       0.03      0.02      0.03      2886
         Medium Delay - NASDelay       0.00      0.11      0.01       487
     Medium Delay - WeatherDelay       0.00      0.05      0.01       325
      Minor Delay - CarrierDelay       0.03      0.03      0.03      3539
 Minor Delay - LateAircraftDelay       0.04      0.02      0.02      4238
          Minor Delay

In [24]:
y_train.value_counts()

Flight_Status
On-Time                             78470
Minor Delay - LateAircraftDelay      4238
Unknown                              3977
Minor Delay - CarrierDelay           3539
Medium Delay - LateAircraftDelay     2886
Medium Delay - CarrierDelay          1881
Minor Delay - NASDelay               1060
Weather Cancellation                  781
Major Delay - LateAircraftDelay       649
Major Delay - CarrierDelay            514
Medium Delay - NASDelay               487
Carrier Cancellation                  424
Minor Delay - WeatherDelay            326
Medium Delay - WeatherDelay           325
NAS Cancellation                      205
Major Delay - WeatherDelay             96
Major Delay - NASDelay                 91
Security Issue                         51
Name: count, dtype: int64

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import cross_val_score, cross_val_predict

model = GradientBoostingClassifier(
    n_estimators=20,    
    max_depth=3,        
    learning_rate=0.1,  
    subsample=0.6,       
    random_state=42,
    verbose=2            
)

scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

scores = cross_val_score(model, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

y_pred_cv = cross_val_predict(model, X_sample, y_train, cv=5, n_jobs=64)

recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")


[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:  6.9min finished


Recall: 0.0556
Precision: 0.0482
F1 Score: 0.0494
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.03      0.00      0.00       424
      Major Delay - CarrierDelay       0.02      0.00      0.00       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.00      0.00      0.00        96
     Medium Delay - CarrierDelay       0.00      0.00      0.00      1881
Medium Delay - LateAircraftDelay       0.00      0.00      0.00      2886
         Medium Delay - NASDelay       0.03      0.00      0.00       487
     Medium Delay - WeatherDelay       0.00      0.00      0.00       325
      Minor Delay - CarrierDelay       0.00      0.00      0.00      3539
 Minor Delay - LateAircraftDelay       0.00      0.00      0.00      4238
          Minor Delay

## SGD Regression

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_classifier = SGDClassifier(
    loss='log_loss',     
    alpha=0.0001,   
    n_jobs=64,
    random_state=42       
)

scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

scores = cross_val_score(sgd_classifier, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

y_pred_cv = cross_val_predict(sgd_classifier, X_sample, y_train, cv=5, n_jobs=64)

recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   29.8s finished


Recall: 0.0555
Precision: 0.0436
F1 Score: 0.0488
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.00      0.00       424
      Major Delay - CarrierDelay       0.00      0.00      0.00       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.00      0.00      0.00        96
     Medium Delay - CarrierDelay       0.00      0.00      0.00      1881
Medium Delay - LateAircraftDelay       0.00      0.00      0.00      2886
         Medium Delay - NASDelay       0.00      0.00      0.00       487
     Medium Delay - WeatherDelay       0.00      0.00      0.00       325
      Minor Delay - CarrierDelay       0.00      0.00      0.00      3539
 Minor Delay - LateAircraftDelay       0.00      0.00      0.00      4238
          Minor Delay

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    penalty='l2',
    C=1.0,                 
    max_iter=200,        
    tol=1e-3,           
    class_weight='balanced', 
    n_jobs=64,          
    random_state=42,     
    verbose=1            
)

scores = cross_val_score(log_reg, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

y_pred_cv = cross_val_predict(log_reg, X_sample, y_train, cv=5, n_jobs=64)

recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.


RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  4.52906D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  5.08173D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  5.06678D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  6.82702D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =     

 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.



At iterate   50    f=  2.30413D+05    |proj g|=  1.83678D+04

At iterate   50    f=  2.30460D+05    |proj g|=  7.64161D+03

At iterate   50    f=  2.30432D+05    |proj g|=  2.04515D+04

At iterate   50    f=  2.30363D+05    |proj g|=  3.13326D+04

At iterate   50    f=  2.30345D+05    |proj g|=  1.77993D+05

At iterate  100    f=  2.29901D+05    |proj g|=  6.31362D+04

At iterate  100    f=  2.29820D+05    |proj g|=  1.75573D+04

At iterate  100    f=  2.30006D+05    |proj g|=  7.56836D+04

At iterate  100    f=  2.29745D+05    |proj g|=  3.66572D+04

At iterate  100    f=  2.29718D+05    |proj g|=  9.28460D+04

At iterate  150    f=  2.28940D+05    |proj g|=  2.57050D+04

At iterate  150    f=  2.29335D+05    |proj g|=  4.78086D+04

At iterate  150    f=  2.29303D+05    |proj g|=  1.39796D+04

At iterate  150    f=  2.29740D+05    |proj g|=  2.78229D+04

At iterate  150    f=  2.29359D+05    |proj g|=  1.66245D+04

At iterate  200    f=  2.28725D+05    |proj g|=  8.22377D+04

       

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   10.8s finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   10.9s finished



At iterate  200    f=  2.28091D+05    |proj g|=  5.04936D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    214      1     0     0   5.049D+04   2.281D+05
  F =   228091.21236615849     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 

At iterate  200    f=  2.29113D+05    |proj g|=  3.50903D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final proj

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.1s finished
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.1s finished



At iterate  200    f=  2.28640D+05    |proj g|=  1.39139D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    217      1     0     0   1.391D+04   2.286D+05
  F =   228639.94628400612     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.3s finished
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   12.8s finished
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.


RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  5.08173D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  5.06678D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  4.50830D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =          918     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  2.31230D+05    |proj g|=  4.52906D+05
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =     

 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.



At iterate   50    f=  2.30345D+05    |proj g|=  1.77993D+05

At iterate   50    f=  2.30432D+05    |proj g|=  2.04515D+04

At iterate   50    f=  2.30460D+05    |proj g|=  7.64161D+03

At iterate   50    f=  2.30413D+05    |proj g|=  1.83678D+04

At iterate   50    f=  2.30363D+05    |proj g|=  3.13326D+04

At iterate  100    f=  2.29718D+05    |proj g|=  9.28460D+04

At iterate  100    f=  2.29820D+05    |proj g|=  1.75573D+04

At iterate  100    f=  2.30006D+05    |proj g|=  7.56836D+04

At iterate  100    f=  2.29901D+05    |proj g|=  6.31362D+04

At iterate  100    f=  2.29745D+05    |proj g|=  3.66572D+04

At iterate  150    f=  2.29303D+05    |proj g|=  1.39796D+04

At iterate  150    f=  2.29359D+05    |proj g|=  1.66245D+04

At iterate  150    f=  2.29740D+05    |proj g|=  2.78229D+04

At iterate  150    f=  2.29335D+05    |proj g|=  4.78086D+04

At iterate  150    f=  2.28940D+05    |proj g|=  2.57050D+04

At iterate  200    f=  2.28655D+05    |proj g|=  1.18179D+04

       

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   10.8s finished



At iterate  200    f=  2.29113D+05    |proj g|=  3.50903D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    213      1     0     0   3.509D+04   2.291D+05
  F =   229113.05220798624     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 

At iterate  200    f=  2.28640D+05    |proj g|=  1.39139D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final proj

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.7s finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   11.7s finished



At iterate  200    f=  2.28725D+05    |proj g|=  8.22377D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    211      1     0     0   8.224D+04   2.287D+05
  F =   228724.81232402046     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   12.1s finished



At iterate  200    f=  2.28091D+05    |proj g|=  5.04936D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
  918    200    214      1     0     0   5.049D+04   2.281D+05
  F =   228091.21236615849     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=64)]: Done   1 out of   1 | elapsed:   12.8s finished


Recall: 0.0682
Precision: 0.0681
F1 Score: 0.0058
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.04      0.01       424
      Major Delay - CarrierDelay       0.01      0.07      0.01       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.10      0.00        91
      Major Delay - WeatherDelay       0.00      0.27      0.00        96
     Medium Delay - CarrierDelay       0.04      0.00      0.01      1881
Medium Delay - LateAircraftDelay       0.04      0.01      0.01      2886
         Medium Delay - NASDelay       0.00      0.05      0.01       487
     Medium Delay - WeatherDelay       0.00      0.05      0.01       325
      Minor Delay - CarrierDelay       0.03      0.00      0.00      3539
 Minor Delay - LateAircraftDelay       0.04      0.00      0.00      4238
          Minor Delay

## MLP

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, precision_score

mlp = MLPClassifier(
    hidden_layer_sizes=(100),
    max_iter=100,              
    early_stopping=True,
    random_state=42            
)

scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

scores = cross_val_score(mlp, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

y_pred_cv = cross_val_predict(mlp, X_sample, y_train, cv=5, n_jobs=64)

recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   22.7s finished


Recall: 0.0556
Precision: 0.0571
F1 Score: 0.0489
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.00      0.00       424
      Major Delay - CarrierDelay       0.00      0.00      0.00       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.00      0.00      0.00        96
     Medium Delay - CarrierDelay       0.00      0.00      0.00      1881
Medium Delay - LateAircraftDelay       0.00      0.00      0.00      2886
         Medium Delay - NASDelay       0.00      0.00      0.00       487
     Medium Delay - WeatherDelay       0.00      0.00      0.00       325
      Minor Delay - CarrierDelay       0.00      0.00      0.00      3539
 Minor Delay - LateAircraftDelay       0.17      0.00      0.00      4238
          Minor Delay

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,        
    max_depth=15,            
    min_samples_split=10,    
    min_samples_leaf=5,     
    max_features='sqrt',     
    bootstrap=True,          
    class_weight='balanced', 
    n_jobs=64,              
    random_state=42,         
    verbose=1               
)

scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

scores = cross_val_score(rf, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

y_pred_cv = cross_val_predict(rf, X_sample, y_train, cv=5, n_jobs=64)

recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    7.6s remaining:    2.7s
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    7.6s remaining:    2.7s
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    7.7s remaining:    2.7s
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    7.9s remaining:    2.8s
[Parallel(n_jobs=64)]: Done  74 out of 100 | elapsed:    8.1s remaining:    2.8s
[Parallel(n_jobs=64)]: Done 100 out of 100 | elapsed:    8.4s finished
[Parallel(n_jobs=64)]: Using backend 

Recall: 0.0564
Precision: 0.0559
F1 Score: 0.0523
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.01      0.01       424
      Major Delay - CarrierDelay       0.01      0.01      0.01       514
 Major Delay - LateAircraftDelay       0.01      0.02      0.01       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.00      0.00      0.00        96
     Medium Delay - CarrierDelay       0.02      0.04      0.03      1881
Medium Delay - LateAircraftDelay       0.03      0.06      0.04      2886
         Medium Delay - NASDelay       0.01      0.02      0.01       487
     Medium Delay - WeatherDelay       0.00      0.01      0.01       325
      Minor Delay - CarrierDelay       0.04      0.07      0.05      3539
 Minor Delay - LateAircraftDelay       0.04      0.08      0.05      4238
          Minor Delay

In [None]:
# KNN configuration
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(
    n_neighbors=5,   
    weights='distance', 
    n_jobs=64           
)


scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

scores = cross_val_score(knn, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

y_pred_cv = cross_val_predict(knn, X_sample, y_train, cv=5, n_jobs=64)

recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:   12.6s finished


Recall: 0.0560
Precision: 0.0589
F1 Score: 0.0513
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.00      0.00       424
      Major Delay - CarrierDelay       0.00      0.00      0.00       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.10      0.01      0.02        96
     Medium Delay - CarrierDelay       0.02      0.00      0.00      1881
Medium Delay - LateAircraftDelay       0.02      0.00      0.00      2886
         Medium Delay - NASDelay       0.00      0.00      0.00       487
     Medium Delay - WeatherDelay       0.00      0.00      0.00       325
      Minor Delay - CarrierDelay       0.03      0.00      0.01      3539
 Minor Delay - LateAircraftDelay       0.05      0.01      0.01      4238
          Minor Delay

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

extra_trees = ExtraTreesClassifier(
    n_estimators=100,
    max_depth=15,
    class_weight='balanced',
    n_jobs=64,
    random_state=42
)

scorer = make_scorer(precision_at_recall, recall=0.2, average='weighted')

scores = cross_val_score(extra_trees, X_sample, y_train, cv=5, scoring=scorer, verbose=1, n_jobs=64)

y_pred_cv = cross_val_predict(extra_trees, X_sample, y_train, cv=5, n_jobs=64)

recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")

print("Cross-Validation Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))
print(f"Mean precision at recall 0.2: {scores.mean():.4f} ± {scores.std():.4f}")

[Parallel(n_jobs=64)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=64)]: Done   5 out of   5 | elapsed:    4.3s finished


Recall: 0.0566
Precision: 0.0560
F1 Score: 0.0548
Cross-Validation Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.00      0.00       424
      Major Delay - CarrierDelay       0.01      0.01      0.01       514
 Major Delay - LateAircraftDelay       0.00      0.00      0.00       649
          Major Delay - NASDelay       0.00      0.00      0.00        91
      Major Delay - WeatherDelay       0.01      0.02      0.01        96
     Medium Delay - CarrierDelay       0.02      0.01      0.01      1881
Medium Delay - LateAircraftDelay       0.03      0.01      0.02      2886
         Medium Delay - NASDelay       0.01      0.01      0.01       487
     Medium Delay - WeatherDelay       0.00      0.00      0.00       325
      Minor Delay - CarrierDelay       0.04      0.02      0.02      3539
 Minor Delay - LateAircraftDelay       0.04      0.02      0.02      4238
          Minor Delay

# Shortlisting Models

1. **Logistic Regression**
    - Precision: 0.0681
    - Recall: 0.0682
    - Precision at Recall 0.2: NA

Precision at recall is non existent because it didnt hit our threshold of 0.2. We are hoping that because it was our best model, giving it more data (we trained it on 100k),and fine tuning it will yield better results and hopefully get the threshold over the 0.2 (and ideally hit 0.25)

2. **MLP**
    - Recall: 0.0556
    - Precision: 0.0571
    - Precision at Recall 0.2: 0.6195
    
On the surface this precision at recall is great and got me super excited, however upon further notice its because it just said it was On Time everytime. Hopefully with further tuning and more data it improves. I am mostly picking this one because its a neural network and I want to use one of those even though it was worse 

# Fine-Tuning models

In [16]:
import pickle
with open('preprocessor.pkl', 'rb') as f:
    preprocessor = pickle.load(f)

X = preprocessor.transform(X_train)

  preprocessor = pickle.load(f)
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


OSError: [WinError 87] The parameter is incorrect

In [18]:
X = X.sample(n=100000, random_state=42)
y_train = y_train.sample(n=100000, random_state=42)

In [20]:
from sklearn.model_selection import RandomizedSearchCV, cross_val_predict, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    recall_score, precision_score, f1_score, classification_report, make_scorer
)
from scipy.stats import loguniform, uniform

precision_at_recall_scorer = make_scorer(
    precision_at_recall,
    recall=0.2,
    average='macro',
    zero_division=0
)

log_reg = LogisticRegression(
    solver='saga',
    max_iter=5000,
    class_weight='balanced',
    n_jobs=4,
    random_state=42,
    verbose=1
)

param_dist = {
    'C': loguniform(1e-3, 1e3),
    'penalty': ['l1', 'l2'],
    'tol': uniform(1e-4, 1e-3),
    'solver': ['saga'],
    'fit_intercept': [True],
    'max_iter': [300],  
    'multi_class': ['auto', 'ovr'],  
}

random_search = RandomizedSearchCV(
    log_reg,
    param_distributions=param_dist,
    n_iter=5,  
    scoring=precision_at_recall_scorer,
    cv=3,       
    verbose=1,
    n_jobs=4,
    random_state=42
)

random_search.fit(X, y_train)
best_model = random_search.best_estimator_

y_pred_cv = cross_val_predict(best_model, X, y_train, cv=3, n_jobs=4)

recall = recall_score(y_train, y_pred_cv, average='macro')
precision = precision_score(y_train, y_pred_cv, average='macro', zero_division=0)
f1 = f1_score(y_train, y_pred_cv, average='macro')

print("Best Parameters Found:", random_search.best_params_)
print(f"Recall: {recall:.4f}")
print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")
print("Classification Report:")
print(classification_report(y_train, y_pred_cv, zero_division=0))

custom_cv_scores = cross_val_score(
    best_model,
    X,
    y_train,
    cv=3,
    scoring=precision_at_recall_scorer,
    n_jobs=4,
    verbose=1
)

print(f"Mean Precision at Recall ≥ 0.2: {custom_cv_scores.mean():.4f} ± {custom_cv_scores.std():.4f}")


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.


max_iter reached after 147 seconds




Best Parameters Found: {'C': np.float64(0.1767016940294795), 'fit_intercept': True, 'max_iter': 300, 'multi_class': 'auto', 'penalty': 'l1', 'solver': 'saga', 'tol': np.float64(0.0008319939418114052)}
Recall: 0.0516
Precision: 0.0657
F1 Score: 0.0053
Classification Report:
                                  precision    recall  f1-score   support

            Carrier Cancellation       0.00      0.03      0.01       424
      Major Delay - CarrierDelay       0.00      0.05      0.01       514
 Major Delay - LateAircraftDelay       0.01      0.03      0.01       649
          Major Delay - NASDelay       0.00      0.05      0.00        91
      Major Delay - WeatherDelay       0.00      0.16      0.00        96
     Medium Delay - CarrierDelay       0.01      0.00      0.00      1881
Medium Delay - LateAircraftDelay       0.03      0.01      0.01      2886
         Medium Delay - NASDelay       0.01      0.05      0.01       487
     Medium Delay - WeatherDelay       0.00      0.07      

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Mean Precision at Recall ≥ 0.2: 0.0000 ± 0.0000


[Parallel(n_jobs=4)]: Done   3 out of   3 | elapsed:  1.8min finished
