# Airline On‑Time Performance (OTP) Analytics

**Theme:** Operational Performance & Service Delivery (OTP, delays, cancellations, diversions)

**Primary data source (official):** U.S. DOT Bureau of Transportation Statistics (BTS) TranStats — *On‑Time: Reporting Carrier On‑Time Performance (1987–present)*.

**Portfolio focus**

- KPI monitoring: OTP, delay minutes, cancellation & diversion rates
- Anomaly detection: routes / airports with persistent (“chronic”) delay patterns and sudden spikes
- Forecasting: short‑horizon delay risk from time series patterns
- Root‑cause by delay type: carrier / weather / NAS / security / late aircraft

**Notes on scale**

BTS flight‑level OTP files can be very large (millions of rows). This notebook supports:

- loading a subset of columns (recommended),
- optionally limiting rows for a demo run,
- generating a synthetic dataset when a local BTS extract is not available.

## 1) Data acquisition (BTS TranStats)

TranStats provides a configurable download page (filter by year/month and select columns). The dataset profile and the download UI are available from the official portal:

- Database profile / description: https://transtats.bts.gov/DatabaseInfo.asp (search for “Reporting Carrier On‑Time Performance”)
- Download UI (field selection, year/month filters): https://www.transtats.bts.gov/DL_SelectFields.aspx (dataset: Reporting Carrier On‑Time Performance)
- Field list / dictionary: https://transtats.bts.gov/Fields.asp
- Airline on‑time statistics portal (aggregations): https://www.transtats.bts.gov/ontime/

**Local file expectation**

This notebook expects a locally saved extract (CSV/CSV.GZ/ZIP containing a CSV). A typical workflow:

1. Download a month (or a small range) from TranStats using the filters.
2. Select only the columns needed for analysis (reduces file size and load time).
3. Save the file under `data/` and point `DATA_PATH` to it.

A synthetic dataset is created automatically if `DATA_PATH` does not exist, enabling full execution without external files.

## 2) Environment setup

Recommended packages:

- pandas, numpy
- matplotlib
- scikit‑learn
- statsmodels

Optional (nice‑to‑have): pyarrow (faster CSV), duckdb (SQL on large files)

In [None]:
import os
import zipfile
import warnings
from dataclasses import dataclass

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    roc_auc_score, roc_curve, average_precision_score,
    confusion_matrix, classification_report
)

# Forecasting
from statsmodels.tsa.holtwinters import ExponentialSmoothing

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 200)
plt.rcParams['figure.dpi'] = 130

## 3) Load data

- `DATA_PATH` may point to a `.csv`, `.csv.gz`, or `.zip` (ZIP containing a CSV).
- `MAX_ROWS_FOR_DEMO` can cap the number of rows loaded (useful for quick runs).
- If the file does not exist, a synthetic dataset is generated to keep the notebook runnable.

In [None]:
DATA_PATH = os.environ.get('BTS_OTP_PATH', 'data/On_Time_Reporting_Carrier.csv')
MAX_ROWS_FOR_DEMO = int(os.environ.get('BTS_MAX_ROWS', '0'))  # 0 = load all rows

# Column subset for common OTP/Delay analytics
USE_COLS = [
    'FlightDate','Year','Month','DayOfWeek',
    'Reporting_Airline','Flight_Number_Reporting_Airline',
    'Origin','Dest','Distance',
    'DepDelayMinutes','ArrDelayMinutes','DepDel15','ArrDel15',
    'Cancelled','CancellationCode','Diverted',
    'CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay'
]

def read_bts_file(path: str, usecols=None, max_rows: int = 0) -> pd.DataFrame:
    """Read BTS extract from CSV/CSV.GZ/ZIP (ZIP contains a CSV)."""
    nrows = None if max_rows == 0 else max_rows

    if path.lower().endswith('.zip'):
        with zipfile.ZipFile(path, 'r') as zf:
            candidates = [n for n in zf.namelist() if n.lower().endswith('.csv')]
            if not candidates:
                raise ValueError('ZIP does not contain a CSV file.')
            csv_name = candidates[0]
            with zf.open(csv_name) as f:
                return pd.read_csv(f, usecols=usecols, nrows=nrows, low_memory=False)

    # pandas auto-detects gzip via extension when compression is not specified
    return pd.read_csv(path, usecols=usecols, nrows=nrows, low_memory=False)


def make_synthetic_bts_like(n_rows: int = 60000, seed: int = 7) -> pd.DataFrame:
    """Create a BTS-like dataset (schema subset) for demonstration and reproducibility."""
    rng = np.random.default_rng(seed)

    airlines = np.array(['AA','DL','UA','WN','B6','AS'])
    airports = np.array(['ATL','DFW','DEN','ORD','LAX','JFK','SEA','MCO','PHX','CLT'])

    # Dates across ~4 months
    start = np.datetime64('2024-05-01')
    days = rng.integers(0, 120, size=n_rows)
    flight_dates = start + days.astype('timedelta64[D]')

    origin = rng.choice(airports, size=n_rows)
    dest = rng.choice(airports, size=n_rows)
    # Avoid origin == dest
    same = origin == dest
    dest[same] = rng.choice(airports, size=same.sum())

    reporting_airline = rng.choice(airlines, size=n_rows)
    flight_no = rng.integers(1, 9999, size=n_rows)
    distance = rng.integers(150, 2800, size=n_rows)

    # Baseline delay structure with airport and route effects
    airport_effect = {a: rng.normal(0, 3) for a in airports}
    route_key = np.char.add(np.char.add(origin, '-'), dest)
    unique_routes = np.unique(route_key)
    route_effect = {r: rng.normal(0, 4) for r in unique_routes}

    base_dep = rng.normal(6, 12, size=n_rows)
    base_arr = base_dep + rng.normal(2, 10, size=n_rows)

    dep = base_dep + np.vectorize(airport_effect.get)(origin)
    arr = base_arr + np.vectorize(route_effect.get)(route_key)

    # Introduce a few spike events for anomaly detection
    spike_days = np.array(['2024-06-15', '2024-07-04', '2024-07-25'], dtype='datetime64[D]')
    spike_mask = np.isin(flight_dates.astype('datetime64[D]'), spike_days) & (origin == 'ORD')
    arr[spike_mask] += rng.normal(35, 12, size=spike_mask.sum())

    dep_delay_minutes = np.clip(dep, 0, None)
    arr_delay_minutes = np.clip(arr, 0, None)

    # Cancellations and diversions
    cancelled = (rng.random(n_rows) < 0.015).astype(int)
    diverted = ((rng.random(n_rows) < 0.003) & (cancelled == 0)).astype(int)

    # Delay flags (15+)
    dep_del15 = (dep_delay_minutes >= 15).astype(int)
    arr_del15 = (arr_delay_minutes >= 15).astype(int)

    # Delay causes (simple allocation for demo)
    total = arr_delay_minutes.copy()
    # Normalize weights per row
    w = rng.dirichlet([2, 1.2, 1.4, 0.2, 1.0], size=n_rows)
    carrier = (total * w[:, 0]).round(0)
    weather = (total * w[:, 1]).round(0)
    nas = (total * w[:, 2]).round(0)
    security = (total * w[:, 3]).round(0)
    late = (total * w[:, 4]).round(0)

    # When not delayed, causes are zero
    not_delayed = total < 1
    carrier[not_delayed] = 0
    weather[not_delayed] = 0
    nas[not_delayed] = 0
    security[not_delayed] = 0
    late[not_delayed] = 0

    dt = pd.to_datetime(flight_dates)

    df = pd.DataFrame({
        'FlightDate': dt.strftime('%Y-%m-%d'),
        'Year': dt.year,
        'Month': dt.month,
        'DayOfWeek': (dt.dayofweek + 1),  # BTS uses 1-7
        'Reporting_Airline': reporting_airline,
        'Flight_Number_Reporting_Airline': flight_no,
        'Origin': origin,
        'Dest': dest,
        'Distance': distance,
        'DepDelayMinutes': dep_delay_minutes.round(0),
        'ArrDelayMinutes': arr_delay_minutes.round(0),
        'DepDel15': dep_del15,
        'ArrDel15': arr_del15,
        'Cancelled': cancelled,
        'CancellationCode': np.where(cancelled == 1, rng.choice(['A','B','C','D'], size=n_rows), np.nan),
        'Diverted': diverted,
        'CarrierDelay': carrier,
        'WeatherDelay': weather,
        'NASDelay': nas,
        'SecurityDelay': security,
        'LateAircraftDelay': late,
    })

    # For cancelled flights, delay minutes are not meaningful
    df.loc[df['Cancelled'] == 1, ['DepDelayMinutes','ArrDelayMinutes','DepDel15','ArrDel15',
                                 'CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']] = np.nan

    return df


if os.path.exists(DATA_PATH):
    df_raw = read_bts_file(DATA_PATH, usecols=USE_COLS, max_rows=MAX_ROWS_FOR_DEMO)
    data_origin = f'Loaded from file: {DATA_PATH}'
else:
    df_raw = make_synthetic_bts_like()
    data_origin = 'Synthetic dataset (file not found)'

data_origin, df_raw.shape

In [None]:
df_raw.head()

## 4) Data preparation

Key preparation steps:

- Parse `FlightDate` into a proper date type.
- Create `route` = `Origin`–`Dest`.
- Define completion flags (completed vs cancelled/diverted).
- Standardize delay metrics and create the core KPI fields.

**KPI definitions (common DOT conventions)**

- *On‑time arrival* (OTP): arrival within **15 minutes** of schedule (based on `ArrDel15`).
- *Cancellation rate*: share of flights with `Cancelled = 1`.
- *Diversion rate*: share of flights with `Diverted = 1`.

Many analyses compute OTP on “completed” flights only. This notebook reports both:

- OTP on completed flights (excluding cancellations)
- “Completion‑adjusted” OTP (treat cancellations as failures)

In [None]:
df = df_raw.copy()

df['FlightDate'] = pd.to_datetime(df['FlightDate'], errors='coerce')
df['route'] = df['Origin'].astype(str) + '-' + df['Dest'].astype(str)

# Standardize numeric columns
num_cols = ['Distance','DepDelayMinutes','ArrDelayMinutes',
            'CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']
for c in num_cols:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')

# Flags
df['is_cancelled'] = (df['Cancelled'] == 1)
df['is_diverted'] = (df['Diverted'] == 1)
df['is_completed'] = (~df['is_cancelled'])  # diverted flights are still "not cancelled" here

# OTP flags
df['is_arr_ontime'] = (df['ArrDel15'] == 0) & df['is_completed']
df['is_arr_delayed15p'] = (df['ArrDel15'] == 1) & df['is_completed']

# Time grain helpers
df['month'] = df['FlightDate'].dt.to_period('M').dt.to_timestamp()
df['week'] = df['FlightDate'].dt.to_period('W').dt.start_time

df[['FlightDate','Reporting_Airline','Origin','Dest','ArrDelayMinutes','is_arr_ontime','is_cancelled']].head(10)

## 5) KPI monitoring (OTP, delays, cancellations)

The KPIs below are computed monthly and can be reused in dashboards:

- Flights
- OTP (completed flights)
- Average arrival delay minutes (completed flights; `ArrDelayMinutes`)
- Cancellation rate
- Diversion rate

In [None]:
def monthly_kpis(frame: pd.DataFrame) -> pd.DataFrame:
    g = frame.groupby('month', dropna=False)

    flights = g.size()
    completed = g['is_completed'].sum(min_count=1)
    otp_completed = g['is_arr_ontime'].sum(min_count=1) / completed

    avg_arr_delay = g.apply(lambda x: np.nanmean(x.loc[x['is_completed'], 'ArrDelayMinutes']))
    cancel_rate = g['is_cancelled'].mean()
    divert_rate = g['is_diverted'].mean()

    out = pd.DataFrame({
        'flights': flights,
        'completed_flights': completed,
        'otp_completed': otp_completed,
        'avg_arr_delay_min': avg_arr_delay,
        'cancel_rate': cancel_rate,
        'divert_rate': divert_rate,
    }).reset_index()

    return out


kpi_month = monthly_kpis(df)
kpi_month

In [None]:
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(kpi_month['month'], kpi_month['otp_completed'] * 100, marker='o')
ax.set_title('On-Time Arrival Performance (OTP) — Completed Flights')
ax.set_ylabel('OTP (%)')
ax.set_xlabel('Month')
ax.grid(True, alpha=0.3)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(kpi_month['month'], kpi_month['avg_arr_delay_min'], marker='o')
ax.set_title('Average Arrival Delay Minutes — Completed Flights')
ax.set_ylabel('Minutes')
ax.set_xlabel('Month')
ax.grid(True, alpha=0.3)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(kpi_month['month'], kpi_month['cancel_rate'] * 100, marker='o', label='Cancellation rate')
ax.plot(kpi_month['month'], kpi_month['divert_rate'] * 100, marker='o', label='Diversion rate')
ax.set_title('Service Disruptions')
ax.set_ylabel('Rate (%)')
ax.set_xlabel('Month')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

### KPI drill‑downs (carrier, airport, route)

A small set of drill‑downs usually covers the majority of operational questions:

- Which carriers have the lowest OTP and highest delay minutes?
- Which origin airports show chronic delays?
- Which routes have persistent delays (high average delay and high delay frequency)?

The tables below include simple eligibility thresholds to reduce small‑sample noise.

In [None]:
def kpi_by(frame: pd.DataFrame, by_col: str, min_flights: int = 500) -> pd.DataFrame:
    g = frame.groupby(by_col, dropna=False)
    flights = g.size()
    completed = g['is_completed'].sum(min_count=1)

    out = pd.DataFrame({
        'flights': flights,
        'otp_completed': g['is_arr_ontime'].sum(min_count=1) / completed,
        'avg_arr_delay_min': g.apply(lambda x: np.nanmean(x.loc[x['is_completed'], 'ArrDelayMinutes'])),
        'cancel_rate': g['is_cancelled'].mean(),
        'divert_rate': g['is_diverted'].mean(),
    })

    out = out.loc[out['flights'] >= min_flights].sort_values(['otp_completed', 'avg_arr_delay_min'], ascending=[True, False])
    return out


by_airline = kpi_by(df, 'Reporting_Airline', min_flights=1000)
by_airline.head(12)

In [None]:
by_origin = kpi_by(df, 'Origin', min_flights=1500)
by_origin.head(15)

In [None]:
by_route = kpi_by(df, 'route', min_flights=1500)
by_route.head(15)

## 6) Anomaly detection

Two operationally useful anomaly patterns:

1. **Spikes**: sudden deterioration (e.g., a disrupted day/week) for an airport or route.
2. **Chronic underperformance**: sustained low OTP and high delays over many periods.

Approach used here:

- Aggregate to **weekly** route‑level delay minutes.
- Compute a rolling baseline (mean and standard deviation) per route.
- Flag weeks with a robust z‑score above a threshold.

The output table highlights the *largest* anomalies and includes context metrics.

In [None]:
# Weekly aggregates per route
weekly_route = (
    df.loc[df['is_completed']]
      .groupby(['route','week'], dropna=False)
      .agg(
          flights=('route','size'),
          otp=('is_arr_ontime','mean'),
          avg_arr_delay=('ArrDelayMinutes','mean'),
          p95_arr_delay=('ArrDelayMinutes', lambda s: np.nanpercentile(s, 95)),
      )
      .reset_index()
)

# Eligibility threshold
weekly_route = weekly_route.loc[weekly_route['flights'] >= 80].copy()

# Rolling baseline per route (shifted to avoid leakage)
weekly_route = weekly_route.sort_values(['route','week'])
weekly_route['roll_mean'] = weekly_route.groupby('route')['avg_arr_delay'].transform(lambda s: s.rolling(8, min_periods=4).mean().shift(1))
weekly_route['roll_std']  = weekly_route.groupby('route')['avg_arr_delay'].transform(lambda s: s.rolling(8, min_periods=4).std(ddof=0).shift(1))

weekly_route['z_delay'] = (weekly_route['avg_arr_delay'] - weekly_route['roll_mean']) / weekly_route['roll_std']

# Robust cleanup
weekly_route.loc[weekly_route['roll_std'] == 0, 'z_delay'] = np.nan

anomalies = weekly_route.loc[weekly_route['z_delay'] >= 3].sort_values('z_delay', ascending=False)
anomalies.head(20)

In [None]:
# Visualize the most extreme anomaly route, if available
if len(anomalies) > 0:
    sel_route = anomalies.iloc[0]['route']
    plot_df = weekly_route.loc[weekly_route['route'] == sel_route].copy()

    fig, ax = plt.subplots(figsize=(10, 4))
    ax.plot(plot_df['week'], plot_df['avg_arr_delay'], marker='o', label='Weekly avg arrival delay')
    ax.plot(plot_df['week'], plot_df['roll_mean'], linestyle='--', label='Rolling baseline (8w)')
    ax.set_title(f'Weekly Arrival Delay — Route {sel_route}')
    ax.set_ylabel('Minutes')
    ax.set_xlabel('Week')
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.show()
else:
    print('No anomalies detected with the current thresholds.')

### Chronic delay ranking

“Chronic” routes and airports are identified by combining:

- high average arrival delay minutes, and
- high frequency of 15+ minute delays.

The score below is a simple weighted index and can be replaced by a more formal control chart or statistical model.

In [None]:
def chronic_score(frame: pd.DataFrame, by_col: str, min_flights: int = 5000) -> pd.DataFrame:
    g = frame.loc[frame['is_completed']].groupby(by_col, dropna=False)
    flights = g.size()

    avg_delay = g['ArrDelayMinutes'].mean()
    delay_freq = g['is_arr_delayed15p'].mean()

    # Score: emphasize frequency while retaining magnitude
    score = 0.65 * delay_freq + 0.35 * (avg_delay / (avg_delay.median() + 1e-9))

    out = pd.DataFrame({
        'flights': flights,
        'avg_arr_delay_min': avg_delay,
        'delay_15p_rate': delay_freq,
        'chronic_score': score
    })

    out = out.loc[out['flights'] >= min_flights].sort_values('chronic_score', ascending=False)
    return out


chronic_routes = chronic_score(df, 'route', min_flights=6000)
chronic_routes.head(15)

In [None]:
chronic_airports = chronic_score(df, 'Origin', min_flights=8000)
chronic_airports.head(15)

## 7) Root‑cause by delay type

BTS includes delay minutes by cause for many (but not necessarily all) flights/periods:

- `CarrierDelay`
- `WeatherDelay`
- `NASDelay`
- `SecurityDelay`
- `LateAircraftDelay`

Root‑cause analytics is typically framed as:

- distribution of delay minutes by cause over time,
- top causes at specific airports or for specific carriers.

**Interpretation note**

Cause fields are typically populated for delayed flights and may have missing values for other records.

In [None]:
cause_cols = ['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']

cause_month = (
    df.loc[df['is_completed']]
      .groupby('month')[cause_cols]
      .sum(min_count=1)
      .reset_index()
)

cause_month['total_cause_delay'] = cause_month[cause_cols].sum(axis=1)

for c in cause_cols:
    cause_month[c + '_share'] = cause_month[c] / cause_month['total_cause_delay']

cause_month[['month'] + [c + '_share' for c in cause_cols]]

In [None]:
# Stacked area chart of delay cause shares
fig, ax = plt.subplots(figsize=(10, 4))
x = cause_month['month']
y = [cause_month[c + '_share'].fillna(0).values for c in cause_cols]
ax.stackplot(x, y, labels=cause_cols)
ax.set_title('Delay Cause Shares Over Time (Share of Minutes)')
ax.set_ylabel('Share')
ax.set_xlabel('Month')
ax.legend(loc='upper right', ncol=2)
ax.grid(True, alpha=0.2)
plt.show()

In [None]:
# Top causes by origin airport (minutes)
airport_cause = (
    df.loc[df['is_completed']]
      .groupby('Origin')[cause_cols]
      .sum(min_count=1)
      .sort_values(cause_cols, ascending=False)
)

airport_cause.head(12)

## 8) Forecasting: short‑horizon delay outlook

A lightweight forecasting example uses weekly average arrival delay minutes and Holt‑Winters exponential smoothing.

The purpose is operational planning (risk outlook), not a long‑range demand forecast.

- Train/test split: last 20% of weeks held out
- Output: next 8 weeks forecast and backtest performance

In [None]:
ts = (
    df.loc[df['is_completed']]
      .groupby('week')['ArrDelayMinutes']
      .mean()
      .dropna()
      .asfreq('W-MON')
)

# Fill gaps (rare in full extracts; possible in synthetic)
ts = ts.interpolate(limit_direction='both')

split_idx = int(len(ts) * 0.8)
ts_train, ts_test = ts.iloc[:split_idx], ts.iloc[split_idx:]

# Additive trend, no seasonality by default (weekly data may show yearly seasonality in larger history)
model = ExponentialSmoothing(ts_train, trend='add', seasonal=None, initialization_method='estimated')
fit = model.fit(optimized=True)

# Backtest on test horizon
pred_test = fit.forecast(len(ts_test))

# Simple error metrics
mae = float(np.mean(np.abs(ts_test.values - pred_test.values)))

# Forecast next 8 weeks
forecast_h = 8
pred_future = fit.forecast(len(ts_test) + forecast_h).iloc[-forecast_h:]

mae, pred_future.head()

In [None]:
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(ts_train.index, ts_train.values, label='Train')
ax.plot(ts_test.index, ts_test.values, label='Test')
ax.plot(ts_test.index, pred_test.values, linestyle='--', label='Predicted (test)')
ax.set_title(f'Weekly Avg Arrival Delay — Backtest (MAE = {mae:.2f} minutes)')
ax.set_ylabel('Minutes')
ax.set_xlabel('Week')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 3.6))
ax.plot(ts.index, ts.values, label='History')
ax.plot(pred_future.index, pred_future.values, linestyle='--', label='Forecast (next 8 weeks)')
ax.set_title('Weekly Avg Arrival Delay — Short-Horizon Forecast')
ax.set_ylabel('Minutes')
ax.set_xlabel('Week')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

## 9) Delay risk model: probability of 15+ minute arrival delay

A simple baseline classifier estimates the probability that a completed flight arrives 15+ minutes late.

**Target**

- `delayed15p` = 1 if `ArrDel15 == 1` and flight is completed

**Candidate features** (examples)

- carrier, origin, destination
- day of week, month
- distance

**Validation**

- Time‑based split (train on earlier dates, test on later dates)
- Metrics: ROC‑AUC and average precision

This is deliberately conservative: no leakage features (e.g., actual departure delay) are used.

In [None]:
model_df = df.loc[df['is_completed']].copy()

# Target
model_df['delayed15p'] = (model_df['ArrDel15'] == 1).astype(int)

# Feature set
features_cat = ['Reporting_Airline','Origin','Dest','DayOfWeek','Month']
features_num = ['Distance']

model_df = model_df.dropna(subset=['FlightDate','delayed15p'])

# Sort by time and split
model_df = model_df.sort_values('FlightDate')
split_idx = int(len(model_df) * 0.8)

train_df = model_df.iloc[:split_idx]
test_df = model_df.iloc[split_idx:]

X_train = train_df[features_cat + features_num]
y_train = train_df['delayed15p']
X_test = test_df[features_cat + features_num]
y_test = test_df['delayed15p']

pre = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), features_cat),
        ('num', StandardScaler(), features_num),
    ]
)

clf = LogisticRegression(max_iter=200, n_jobs=None)

pipe = Pipeline(steps=[('pre', pre), ('clf', clf)])
pipe.fit(X_train, y_train)

proba = pipe.predict_proba(X_test)[:, 1]

roc = roc_auc_score(y_test, proba)
ap = average_precision_score(y_test, proba)

roc, ap

In [None]:
fpr, tpr, thr = roc_curve(y_test, proba)

fig, ax = plt.subplots(figsize=(5.2, 4))
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], linestyle='--')
ax.set_title(f'ROC Curve (AUC = {roc:.3f})')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.grid(True, alpha=0.3)
plt.show()

In [None]:
# Example operating point: threshold that flags roughly ~15% highest-risk flights
threshold = float(np.quantile(proba, 0.85))
pred = (proba >= threshold).astype(int)

cm = confusion_matrix(y_test, pred)
report = classification_report(y_test, pred, digits=3)

cm, report

## 10) Summary and operational mapping

This notebook demonstrates an end‑to‑end workflow on BTS OTP data:

- KPI monitoring for OTP, delays, and service disruptions
- anomaly detection for spikes and chronic underperformance
- delay cause decomposition for root‑cause framing
- short‑horizon forecasting for proactive planning
- baseline delay‑risk scoring for operational triage

Possible extensions (typical in production environments):

- incorporate scheduled time blocks (`CRSDepTime`, `CRSArrTime`) and bank structure
- include aircraft tail number and aircraft type (where available)
- add weather data joins at station/airport level
- build a dashboard layer (e.g., Power BI / Tableau / Streamlit)
- use hierarchical models per airport/network for better generalization