# Two-Stage Flight Delay Classification Pipeline

**Author:** Daniel Costa  
**Project:** Flight Delay Prediction (W261 Final Project)  
**Dataset:** US Domestic Flights (2015-2019)

---

## Overview

This notebook implements a **two-stage classification pipeline** for predicting flight delays. The approach combines quantile regression with a classifier to handle the inherent uncertainty in delay prediction.

### Model Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                     TWO-STAGE INTERVAL CLASSIFIER                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   STAGE 1: Quantile Regressors                                          │
│   ┌─────────────────────┐    ┌─────────────────────┐                    │
│   │  XGBoost Regressor  │    │  XGBoost Regressor  │                    │
│   │  (Low Quantile)     │    │  (High Quantile)    │                    │
│   │  α = 0.50           │    │  α = 0.98           │                    │
│   └─────────┬───────────┘    └─────────┬───────────┘                    │
│             │                          │                                │
│             ▼                          ▼                                │
│         qLow pred                  qHigh pred                           │
│             │                          │                                │
│             └────────────┬─────────────┘                                │
│                          │                                              │
│                          ▼                                              │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    DECISION LOGIC                                │   │
│   │                                                                  │   │
│   │  if threshold < qLow:   → DELAYED (confident)                   │   │
│   │  if threshold > qHigh:  → ON-TIME (confident)                   │   │
│   │  else:                  → AMBIGUOUS (use Stage 2)               │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                          │                                              │
│                          ▼ (ambiguous cases only)                       │
│   STAGE 2: Binary Classifier                                            │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │  XGBoost Classifier                                              │   │
│   │  Features: [original features, qLow, qHigh, qHigh - qLow]       │   │
│   │  Trained on undersampled ambiguous cases                        │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Key Results

| Metric | Training | Validation |
|--------|----------|------------|
| **F2 Score** | 0.617 | 0.621 |
| Precision | - | - |
| Recall | - | - |
| PR-AUC | - | - |

### Why Two Stages?

1. **Confidence-Based Routing**: Quantile regressors identify cases where the model is confident vs. uncertain
2. **Focused Classifier**: The second-stage classifier specializes on ambiguous cases
3. **Class Imbalance Handling**: Undersampling in the ambiguous region improves classifier performance

---

## Table of Contents

1. [Setup & Configuration](#1-setup--configuration)
2. [Data Loading & Preprocessing](#2-data-loading--preprocessing)
3. [Feature Engineering](#3-feature-engineering)
4. [Model Architecture](#4-model-architecture)
   - [4.1 IntervalClassifier](#41-intervalclassifier)
   - [4.2 IntervalClassifierModel](#42-intervalclassifiermodel)
5. [Training Pipeline](#5-training-pipeline)
6. [Evaluation](#6-evaluation)
7. [Results Analysis](#7-results-analysis)



## 1. Setup & Configuration

In [None]:
# -----------------------------------------------------------------------------
# Imports
# -----------------------------------------------------------------------------
from pyspark.sql import Window
from pyspark.sql.functions import col, lit, when, isnan, count as f_count
import pyspark.sql.functions as F
from pyspark.sql.types import NumericType, StringType
from pyspark.sql import DataFrame

from pyspark.ml import Estimator, Model, Pipeline
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

from xgboost.spark import SparkXGBRegressor, SparkXGBClassifier

from pyspark import StorageLevel
from mlflow.models.signature import infer_signature
import mlflow
import numpy as np
import math

# -----------------------------------------------------------------------------
# Configuration
# -----------------------------------------------------------------------------
# Data paths
BASE_PATH = "dbfs:/student-groups/Group_2_2"
DATASET_NAME = "5_year"
INPUT_DIR = f"{BASE_PATH}/{DATASET_NAME}_custom_joined/fe_graph_and_holiday/training_splits"
OUTPUT_DIR = f"{BASE_PATH}/2_stage_5y_preds"

# Model hyperparameters
DELAY_THRESHOLD = 15.0  # minutes - flights delayed >= 15 min are considered delayed
LOW_QUANTILE = 0.50     # Lower bound quantile
HIGH_QUANTILE = 0.98    # Upper bound quantile
MAX_DEPTH = 6
N_ESTIMATORS = 100
LEARNING_RATE = 0.05
NUM_ROUND = 200
NUM_WORKERS = 7
CLASSIFIER_MIN_CHILD_WEIGHT = 20

# MLflow configuration
EXPERIMENT_NAME = "/Shared/team_2_2/mlflow-2-stage-fe_graph_holiday-1"
MODEL_NAME = "XGB_5y_2_STAGE_fe_graph_holiday"

# Random seed for reproducibility
RANDOM_SEED = 42

# Enable MLflow autologging
mlflow.spark.autolog()

## 2. Data Loading & Preprocessing

In [None]:
# -----------------------------------------------------------------------------
# Feature Definitions
# -----------------------------------------------------------------------------

# Base features from flight data
BASE_FEATURES = [
    "QUARTER", "MONTH", "YEAR", "DAY_OF_MONTH", "DAY_OF_WEEK",
    "OP_CARRIER", "ORIGIN_AIRPORT_SEQ_ID", "DEST_AIRPORT_SEQ_ID",
    "CRS_ELAPSED_TIME", "DISTANCE", "DEP_DELAY_NEW", "utc_timestamp"
]

# Engineered features from Phase 2
ENGINEERED_FEATURES = [
    "CRS_DEP_MINUTES", "prev_flight_delay_in_minutes", "prev_flight_delay",
    "origin_delays_4h", "delay_origin_7d", "delay_origin_carrier_7d",
    "delay_route_7d", "flight_count_24h", "LANDING_TIME_DIFF_MINUTES",
    "AVG_ARR_DELAY_ORIGIN", "AVG_TAXI_OUT_ORIGIN"
]

# Weather features
WEATHER_FEATURES = [
    "HourlyDryBulbTemperature", "HourlyDewPointTemperature",
    "HourlyRelativeHumidity", "HourlyAltimeterSetting", "HourlyVisibility",
    "HourlyStationPressure", "HourlyWetBulbTemperature", "HourlyPrecipitation",
    "HourlyCloudCoverage", "HourlyCloudElevation", "HourlyWindSpeed"
]

# Graph-based features
GRAPH_FEATURES = [
    "out_degree", "in_degree", "weighted_out_degree", "weighted_in_degree",
    "N_RUNWAYS", "betweenness_unweighted", "closeness", "betweenness",
    "avg_origin_dep_delay", "avg_dest_arr_delay", "avg_daily_route_flights",
    "avg_route_delay", "avg_hourly_flights"
]

# Additional categorical/derived features
OTHER_FEATURES = [
    "IS_HOLIDAY", "IS_HOLIDAY_WINDOW", "AIRPORT_HUB_CLASS",
    "RATING", "AIRLINE_CATEGORY"
]

# Combine all features for model input
MODEL_COLUMNS = (BASE_FEATURES + ENGINEERED_FEATURES + 
                 WEATHER_FEATURES + GRAPH_FEATURES + OTHER_FEATURES)

In [None]:
def deduplicate_columns(df: DataFrame) -> DataFrame:
    """
    Remove duplicate column names from DataFrame.
    
    Args:
        df: Input DataFrame potentially with duplicate columns
        
    Returns:
        DataFrame with unique column names only
    """
    unique_cols = []
    seen = set()
    for col_name in df.columns:
        if col_name not in seen:
            unique_cols.append(col_name)
            seen.add(col_name)
    return df.select(unique_cols)


def clean_features(df: DataFrame, feature_cols: list) -> DataFrame:
    """
    Remove rows with null, NaN, or infinite values in feature columns.
    
    Args:
        df: Input DataFrame
        feature_cols: List of feature column names to check
        
    Returns:
        DataFrame with clean feature values
    """
    for f in feature_cols:
        dtype = dict(df.dtypes)[f]
        
        if df.schema[f].dataType.__class__.__bases__[0] is NumericType:
            df = df.filter(
                col(f).isNotNull() &
                (~isnan(col(f))) &
                (col(f) != float("inf")) &
                (col(f) != float("-inf"))
            )
        elif isinstance(df.schema[f].dataType, StringType):
            df = df.filter(col(f).isNotNull() & (col(f) != ""))
        else:
            df = df.filter(col(f).isNotNull())
    
    return df


def load_data(input_dir: str) -> tuple:
    """
    Load train and test datasets from parquet files.
    
    Args:
        input_dir: Path to directory containing train/test parquet files
        
    Returns:
        Tuple of (train_df, test_df)
    """
    train_df = spark.read.parquet(f"{input_dir}/train.parquet")
    test_df = spark.read.parquet(f"{input_dir}/test.parquet")
    return train_df, test_df


def prepare_data(train_df: DataFrame, test_df: DataFrame, 
                 model_cols: list) -> tuple:
    """
    Prepare data for training by filtering nulls and selecting features.
    
    Args:
        train_df: Raw training DataFrame
        test_df: Raw test DataFrame
        model_cols: List of columns to include
        
    Returns:
        Tuple of (cleaned_train_df, cleaned_test_df)
    """
    # Filter out null labels and select model columns
    train_df = train_df.filter(F.col("DEP_DELAY_NEW").isNotNull()).select(model_cols)
    test_df = test_df.filter(F.col("DEP_DELAY_NEW").isNotNull()).select(model_cols)
    
    # Remove duplicate columns
    train_df = deduplicate_columns(train_df)
    test_df = deduplicate_columns(test_df)
    
    # Clean feature values
    train_df = clean_features(train_df, model_cols)
    test_df = clean_features(test_df, model_cols)
    
    return train_df, test_df

In [None]:
# Load and prepare data
print("Loading data...")
train_df_raw, test_df_raw = load_data(INPUT_DIR)

print("Preparing data...")
train_df, test_df = prepare_data(train_df_raw, test_df_raw, MODEL_COLUMNS)

# Cache for performance
train_df = train_df.persist(StorageLevel.MEMORY_AND_DISK)
test_df = test_df.persist(StorageLevel.MEMORY_AND_DISK)

print(f"Training samples: {train_df.count():,}")
print(f"Test samples:     {test_df.count():,}")

## 3. Model Architecture

### 3.1 IntervalClassifier

The `IntervalClassifier` implements a two-stage approach:
1. **Stage 1**: Two quantile regressors predict confidence bounds
2. **Stage 2**: A classifier handles ambiguous cases where the threshold falls between bounds

In [None]:
class IntervalClassifier(Estimator, DefaultParamsReadable, DefaultParamsWritable):
    """
    Two-stage classifier using quantile regression for confidence intervals.
    
    Stage 1: Fits lower and upper quantile regressors to predict delay bounds.
    Stage 2: Trains a classifier on ambiguous cases (where threshold falls 
             between predicted bounds).
    
    At inference time:
    - If threshold < lower_bound: predict DELAYED (confident)
    - If threshold > upper_bound: predict ON-TIME (confident)  
    - Otherwise: use classifier prediction (ambiguous)
    
    Args:
        lowerEstimator: Regressor for lower quantile (e.g., 0.5)
        upperEstimator: Regressor for upper quantile (e.g., 0.98)
        baseClassifier: Binary classifier for ambiguous cases
        labelCol: Name of label column
        featuresCol: Name of features column
        predictionCol: Name of output prediction column
        threshold: Delay threshold in minutes (default: 15.0)
        quantile_gap: Gap between quantiles (default: 0.1)
        undersample_majority: Whether to balance classes in ambiguous region
        undersample_seed: Random seed for undersampling
    """

    def __init__(self,
                 lowerEstimator,
                 upperEstimator,
                 baseClassifier,
                 labelCol="delay_minutes",
                 featuresCol="features",
                 predictionCol="final_prediction",
                 threshold=15.0,
                 quantile_gap=0.1,
                 qLowCol="low_pred",
                 qHighCol="high_pred",
                 clfPredictionCol="clf_prediction",
                 undersample_majority=True,
                 undersample_seed=42):
        super().__init__()
        self.lowerEstimator = lowerEstimator
        self.upperEstimator = upperEstimator
        self.baseClassifier = baseClassifier
        self.labelCol = labelCol
        self.featuresCol = featuresCol
        self.predictionCol = predictionCol
        self.threshold = float(threshold)
        self.quantile_gap = float(quantile_gap)
        self.qLowCol = qLowCol
        self.qHighCol = qHighCol
        self.clfPredictionCol = clfPredictionCol
        self.undersample_majority = undersample_majority
        self.undersample_seed = undersample_seed
        self.qDiffCol = f"{qHighCol}_minus_{qLowCol}"

    def _fit(self, dataset: DataFrame) -> Model:
        """Fit the two-stage model."""
        # Stage 1: Fit quantile regressors
        print("Training Lower Quantile Estimator...")
        lowerModel = self.lowerEstimator.fit(dataset)
        
        print("Training Upper Quantile Estimator...")
        upperModel = self.upperEstimator.fit(dataset)

        # Add quantile predictions
        df_q = lowerModel.transform(dataset).withColumnRenamed("prediction", self.qLowCol)
        df_q = upperModel.transform(df_q).withColumnRenamed("prediction", self.qHighCol)

        # Filter to ambiguous cases (threshold between bounds)
        print("Filtering ambiguous cases...")
        thr = lit(self.threshold)
        df_ambig = df_q.filter(
            (col(self.qLowCol) <= thr) & (thr <= col(self.qHighCol))
        )

        # Create binary label
        df_ambig = df_ambig.withColumn(
            "bin_label",
            (col(self.labelCol) >= thr).cast("double")
        )

        # Undersample majority class
        if self.undersample_majority:
            df_ambig = self._undersample(df_ambig)

        # Augment features with quantile predictions
        df_ambig = self._augment_features(df_ambig)

        # Stage 2: Fit classifier on ambiguous cases
        print("Training Stage 2 Classifier...")
        clf = self.baseClassifier
        clf.setParams(
            label_col="bin_label",
            features_col=self.featuresCol,
            prediction_col=self.clfPredictionCol
        )
        clfModel = clf.fit(df_ambig)

        return IntervalClassifierModel(
            lowerModel=lowerModel,
            upperModel=upperModel,
            clfModel=clfModel,
            labelCol=self.labelCol,
            featuresCol=self.featuresCol,
            predictionCol=self.predictionCol,
            threshold=self.threshold,
            qLowCol=self.qLowCol,
            qHighCol=self.qHighCol,
            qDiffCol=self.qDiffCol,
            clfPredictionCol=self.clfPredictionCol
        )

    def _undersample(self, df_ambig: DataFrame) -> DataFrame:
        """Undersample majority class to balance ambiguous cases."""
        print("Undersampling majority class...")
        class_counts = (
            df_ambig.groupBy("bin_label")
            .agg(f_count("*").alias("cnt"))
            .collect()
        )

        if len(class_counts) != 2:
            raise ValueError("Ambiguous cases must contain both classes.")

        (label0, cnt0), (label1, cnt1) = [
            (row["bin_label"], row["cnt"]) for row in class_counts
        ]

        if cnt0 <= cnt1:
            minority_label, minority_cnt = label0, cnt0
            majority_label, majority_cnt = label1, cnt1
        else:
            minority_label, minority_cnt = label1, cnt1
            majority_label, majority_cnt = label0, cnt0

        print(f"  Majority class {majority_label}: {majority_cnt:,} → {minority_cnt:,}")

        if majority_cnt > 0 and minority_cnt > 0:
            frac_majority = float(minority_cnt) / float(majority_cnt)
            fractions = {
                float(minority_label): 1.0,
                float(majority_label): frac_majority
            }
            df_ambig = df_ambig.sampleBy(
                "bin_label", fractions=fractions, seed=self.undersample_seed
            )

        return df_ambig

    def _augment_features(self, df: DataFrame) -> DataFrame:
        """Add quantile predictions as additional features."""
        df = df.withColumn(self.qDiffCol, col(self.qHighCol) - col(self.qLowCol))
        
        augFeaturesCol = self.featuresCol + "_aug"
        assembler = VectorAssembler(
            inputCols=[self.featuresCol, self.qLowCol, self.qHighCol, self.qDiffCol],
            outputCol=augFeaturesCol
        )
        df = assembler.transform(df)
        df = df.drop(self.featuresCol).withColumnRenamed(augFeaturesCol, self.featuresCol)
        
        return df

In [None]:
class IntervalClassifierModel(Model, DefaultParamsReadable, DefaultParamsWritable):
    """
    Fitted two-stage interval classifier model.
    
    Applies the decision logic:
    - threshold < qLow: predict 1 (delayed)
    - threshold > qHigh: predict 0 (on-time)
    - otherwise: use classifier prediction
    """

    def __init__(self, lowerModel, upperModel, clfModel, labelCol, featuresCol,
                 predictionCol, threshold, qLowCol, qHighCol, qDiffCol, clfPredictionCol):
        super().__init__()
        self.lowerModel = lowerModel
        self.upperModel = upperModel
        self.clfModel = clfModel
        self.labelCol = labelCol
        self.featuresCol = featuresCol
        self.predictionCol = predictionCol
        self.threshold = float(threshold)
        self.qLowCol = qLowCol
        self.qHighCol = qHighCol
        self.qDiffCol = qDiffCol
        self.clfPredictionCol = clfPredictionCol

    def _transform(self, dataset: DataFrame) -> DataFrame:
        """Apply two-stage prediction logic."""
        # Get quantile predictions
        df_q = self.lowerModel.transform(dataset).withColumnRenamed("prediction", self.qLowCol)
        df_q = self.upperModel.transform(df_q).withColumnRenamed("prediction", self.qHighCol)

        # Augment features
        df_q = df_q.withColumn(self.qDiffCol, col(self.qHighCol) - col(self.qLowCol))
        augFeaturesCol = self.featuresCol + "_aug"
        assembler = VectorAssembler(
            inputCols=[self.featuresCol, self.qLowCol, self.qHighCol, self.qDiffCol],
            outputCol=augFeaturesCol
        )
        df_q = assembler.transform(df_q)
        df_q = df_q.drop(self.featuresCol).withColumnRenamed(augFeaturesCol, self.featuresCol)

        # Get classifier predictions
        df_clf = self.clfModel.transform(df_q)

        # Apply decision logic
        thr = lit(self.threshold)
        df_final = df_clf.withColumn(
            self.predictionCol,
            when(thr < col(self.qLowCol), lit(1.0))       # Confident: delayed
            .when(thr > col(self.qHighCol), lit(0.0))     # Confident: on-time
            .otherwise(col(self.clfPredictionCol))        # Ambiguous: use classifier
        )

        # Add decision source for analysis
        df_final = df_final.withColumn(
            "decision_source",
            when(thr < col(self.qLowCol), lit("quantile_high"))
            .when(thr > col(self.qHighCol), lit("quantile_low"))
            .otherwise(lit("classifier"))
        )

        return df_final

## 4. Training Pipeline

In [None]:
# -----------------------------------------------------------------------------
# Feature Encoding
# -----------------------------------------------------------------------------

# Categorical encoding for carrier
carrier_indexer = StringIndexer(
    inputCol="OP_CARRIER", 
    outputCol="carrier_idx", 
    handleInvalid="keep"
)
carrier_encoder = OneHotEncoder(
    inputCol="carrier_idx", 
    outputCol="carrier_vec"
)

# Numerical features for model input
NUMERICAL_FEATURES = [
    "QUARTER", "MONTH", "YEAR", "DAY_OF_MONTH", "DAY_OF_WEEK",
    "carrier_vec", "CRS_ELAPSED_TIME", "DISTANCE", "CRS_DEP_MINUTES",
    "prev_flight_delay_in_minutes", "prev_flight_delay", "origin_delays_4h",
    "delay_origin_7d", "delay_origin_carrier_7d", "delay_route_7d",
    "flight_count_24h", "LANDING_TIME_DIFF_MINUTES", "AVG_ARR_DELAY_ORIGIN",
    "AVG_TAXI_OUT_ORIGIN"
] + WEATHER_FEATURES + GRAPH_FEATURES + [
    "IS_HOLIDAY", "IS_HOLIDAY_WINDOW", "AIRPORT_HUB_CLASS", "RATING", "AIRLINE_CATEGORY"
]

# Vector assembler
assembler = VectorAssembler(
    inputCols=NUMERICAL_FEATURES,
    outputCol="features",
    handleInvalid="skip"
)

In [None]:
# -----------------------------------------------------------------------------
# Model Definition
# -----------------------------------------------------------------------------

# Stage 1: Quantile Regressors
xgb_regressor_low = SparkXGBRegressor(
    objective="reg:quantileerror",
    quantile_alpha=LOW_QUANTILE,
    num_round=NUM_ROUND,
    features_col="features",
    label_col="DEP_DELAY_NEW",
    num_workers=NUM_WORKERS,
    max_depth=MAX_DEPTH,
    n_estimators=N_ESTIMATORS,
    learning_rate=LEARNING_RATE
)

xgb_regressor_high = SparkXGBRegressor(
    objective="reg:quantileerror",
    quantile_alpha=HIGH_QUANTILE,
    num_round=NUM_ROUND,
    features_col="features",
    label_col="DEP_DELAY_NEW",
    num_workers=NUM_WORKERS,
    max_depth=MAX_DEPTH,
    n_estimators=N_ESTIMATORS,
    learning_rate=LEARNING_RATE
)

# Stage 2: Binary Classifier
classifier = SparkXGBClassifier(
    num_round=NUM_ROUND,
    features_col="features",
    label_col="bin_label",
    prediction_col="clf_prediction",
    max_depth=MAX_DEPTH,
    n_estimators=N_ESTIMATORS,
    learning_rate=LEARNING_RATE,
    num_workers=NUM_WORKERS,
    min_child_weight=CLASSIFIER_MIN_CHILD_WEIGHT
)

# Two-Stage Interval Classifier
interval_clf = IntervalClassifier(
    lowerEstimator=xgb_regressor_low,
    upperEstimator=xgb_regressor_high,
    baseClassifier=classifier,
    labelCol="DEP_DELAY_NEW",
    featuresCol="features",
    threshold=DELAY_THRESHOLD,
    predictionCol="final_prediction"
)

# Full Pipeline
pipeline = Pipeline(stages=[
    carrier_indexer,
    carrier_encoder,
    assembler,
    interval_clf
])

In [None]:
# -----------------------------------------------------------------------------
# Training
# -----------------------------------------------------------------------------

run_name = f"XGB-2_stage_5y_q{LOW_QUANTILE}_{HIGH_QUANTILE}"

with mlflow.start_run(run_name=run_name):
    # Log hyperparameters
    mlflow.log_params({
        "low_quantile": LOW_QUANTILE,
        "high_quantile": HIGH_QUANTILE,
        "threshold": DELAY_THRESHOLD,
        "max_depth": MAX_DEPTH,
        "n_estimators": N_ESTIMATORS,
        "learning_rate": LEARNING_RATE,
        "num_round": NUM_ROUND,
        "classifier_min_child_weight": CLASSIFIER_MIN_CHILD_WEIGHT
    })
    
    print("Training pipeline...")
    model = pipeline.fit(train_df)
    
    print("Generating predictions...")
    train_preds = model.transform(train_df)
    test_preds = model.transform(test_df)
    
    # Add binary labels for evaluation
    train_preds = train_preds.withColumn(
        "label_bin", 
        (col("DEP_DELAY_NEW") >= lit(DELAY_THRESHOLD)).cast("double")
    )
    test_preds = test_preds.withColumn(
        "label_bin",
        (col("DEP_DELAY_NEW") >= lit(DELAY_THRESHOLD)).cast("double")
    )
    
    print("✓ Training complete")

## 5. Evaluation

In [None]:
def compute_f2(df: DataFrame, label_col: str = "label_bin", 
               pred_col: str = "final_prediction", beta: float = 2.0) -> float:
    """
    Compute F2 score (emphasizes recall over precision).
    
    Args:
        df: DataFrame with predictions and labels
        label_col: Name of true label column
        pred_col: Name of prediction column
        beta: F-beta parameter (2.0 for F2)
        
    Returns:
        F2 score
    """
    tp = df.filter((col(label_col) == 1) & (col(pred_col) == 1)).count()
    fp = df.filter((col(label_col) == 0) & (col(pred_col) == 1)).count()
    fn = df.filter((col(label_col) == 1) & (col(pred_col) == 0)).count()
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else float("nan")
    recall = tp / (tp + fn) if (tp + fn) > 0 else float("nan")
    
    denom = (beta**2) * precision + recall
    f2 = ((1 + beta**2) * precision * recall / denom) if denom > 0 else float("nan")
    
    return f2, precision, recall


def evaluate_model(train_preds: DataFrame, test_preds: DataFrame):
    """Compute and display evaluation metrics."""
    # Compute metrics
    train_f2, train_prec, train_rec = compute_f2(train_preds)
    test_f2, test_prec, test_rec = compute_f2(test_preds)
    
    # Decision source breakdown
    print("\n" + "="*60)
    print("EVALUATION RESULTS")
    print("="*60)
    
    print(f"\n{'Metric':<20} {'Training':>12} {'Test':>12}")
    print("-"*44)
    print(f"{'F2 Score':<20} {train_f2:>12.4f} {test_f2:>12.4f}")
    print(f"{'Precision':<20} {train_prec:>12.4f} {test_prec:>12.4f}")
    print(f"{'Recall':<20} {train_rec:>12.4f} {test_rec:>12.4f}")
    
    # Decision source distribution
    print("\n" + "-"*44)
    print("Decision Source Distribution (Test Set):")
    test_preds.groupBy("decision_source").count().show()
    
    return {
        "train_f2": train_f2, "test_f2": test_f2,
        "train_precision": train_prec, "test_precision": test_prec,
        "train_recall": train_rec, "test_recall": test_rec
    }


# Run evaluation
metrics = evaluate_model(train_preds, test_preds)

# Log metrics to MLflow
with mlflow.start_run(run_name=run_name, nested=True):
    for name, value in metrics.items():
        mlflow.log_metric(name, value)

## 6. Summary

This notebook implements a two-stage classification pipeline that achieves **F2 = 0.621** on the test set.

### Key Findings

| Decision Source | Count | Description |
|-----------------|-------|-------------|
| `quantile_high` | ~X% | Confidently predicted as DELAYED |
| `quantile_low` | ~X% | Confidently predicted as ON-TIME |
| `classifier` | ~X% | Ambiguous cases resolved by Stage 2 |

### Design Decisions

1. **Quantile Selection**: α_low = 0.50, α_high = 0.98 provides good coverage
2. **Undersampling**: Balances classes in ambiguous region for better classifier training
3. **Feature Augmentation**: Adding quantile predictions as classifier features improves performance
4. **F2 Metric**: Emphasizes recall (catching delays) over precision

### Future Improvements

- Tune quantile thresholds more granularly
- Experiment with different classifiers for Stage 2
- Add confidence calibration for probability outputs