
# Unit 2 — Team Classification (Flights, BQML)

**Goal (team):** Build an *ops-ready* classifier in **BigQuery ML** to predict **`diverted`** on U.S. flights. Minimal handholding by design.

**What you deliver (inside this notebook):**
- One **LOGISTIC_REG** model (baseline), one **engineered** model using `TRANSFORM`
- **Evaluation** via `ML.EVALUATE` and **confusion matrices** (default 0.5 + your custom threshold)
- **Threshold choice** + 3–5 sentence ops justification
- Embedded **rubric** below (self-check before submission)

> Choose *one* dataset table that exists at your institution:  
> • `bigquery-public-data.faa.us_flights` **or** `bigquery-public-data.flights.*`  
> Make sure the table has `carrier`, `dep_delay`, `arr_delay` (for filters), `origin`, `dest`, `diverted` (or equivalent).


In [111]:
# --- Minimal setup (edit 3 vars) ---
from google.colab import auth
auth.authenticate_user()

import os
from google.cloud import bigquery

PROJECT_ID = "big-data-analysis-472319"      # e.g., mgmt-467-47888
REGION     = "us-central1"
DATASET    = "unit2_flights"
TABLE_PATH = "big-data-analysis-472319.unit2_flights.flights"   # or your `bigquery-public-data.flights` table/view

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"]     = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)

BQ Project: big-data-analysis-472319
Source table: big-data-analysis-472319.unit2_flights.flights


### Quick sanity check

In [101]:
preview_sql = f"SELECT * FROM `{TABLE_PATH}` LIMIT 5"
bq.query(preview_sql).result().to_dataframe()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,6,12,5,AS,64,N708AS,PSG,WRG,1621,...,1627,-18,0,0,,,,,,
1,2015,6,9,2,AS,64,N762AS,PSG,WRG,1621,...,1655,10,0,0,,,,,,
2,2015,9,21,1,AS,65,N767AS,WRG,PSG,1110,...,1134,2,0,0,,,,,,
3,2015,8,13,4,AS,64,N706AS,PSG,WRG,1621,...,1634,-10,0,0,,,,,,
4,2015,11,14,6,AS,64,N762AS,PSG,WRG,1520,...,1534,-9,0,0,,,,,,



## 1) Canonical mapping (adjust as needed)
Map to a minimal schema used in the rest of the notebook:
- `flight_date` (DATE), `dep_delay` (NUM), `distance` (NUM), `carrier` (STRING), `origin` (STRING), `dest` (STRING), `diverted` (BOOL)


In [102]:
CANONICAL_BASE_SQL = f"""
WITH canonical_flights AS (
  SELECT
    PARSE_DATE('%Y-%m-%d', CONCAT(CAST(YEAR AS STRING), '-', CAST(MONTH AS STRING), '-', CAST(DAY AS STRING))) AS flight_date,
    CAST(DEPARTURE_DELAY AS FLOAT64) AS dep_delay,
    CAST(DISTANCE AS FLOAT64) AS distance,
    CAST(AIRLINE AS STRING) AS carrier,
    CAST(ORIGIN_AIRPORT AS STRING) AS origin,
    CAST(DESTINATION_AIRPORT AS STRING) AS dest,
    CAST(
      CASE
        WHEN SAFE_CAST(DIVERTED AS INT64)=1 OR LOWER(CAST(DIVERTED AS STRING))='true'
        THEN TRUE ELSE FALSE
      END
    AS BOOL) AS diverted
  FROM `{TABLE_PATH}`
  WHERE DEPARTURE_DELAY IS NOT NULL
)
"""

print(CANONICAL_BASE_SQL[:600] + "\n...")


WITH canonical_flights AS (
  SELECT
    PARSE_DATE('%Y-%m-%d', CONCAT(CAST(YEAR AS STRING), '-', CAST(MONTH AS STRING), '-', CAST(DAY AS STRING))) AS flight_date,
    CAST(DEPARTURE_DELAY AS FLOAT64) AS dep_delay,
    CAST(DISTANCE AS FLOAT64) AS distance,
    CAST(AIRLINE AS STRING) AS carrier,
    CAST(ORIGIN_AIRPORT AS STRING) AS origin,
    CAST(DESTINATION_AIRPORT AS STRING) AS dest,
    CAST(
      CASE 
        WHEN SAFE_CAST(DIVERTED AS INT64)=1 OR LOWER(CAST(DIVERTED AS STRING))='true'
        THEN TRUE ELSE FALSE 
      END
    AS BOOL) AS diverted
  FROM `big-data-analysis-472319.
...


### 2) Split (80/20)

In [103]:
SPLIT_CLAUSE = """
, split AS (
  SELECT
    cf.*,
    CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS data_split
  FROM canonical_flights cf
)
"""
print(SPLIT_CLAUSE)


, split AS (
  SELECT
    cf.*,
    CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS data_split
  FROM canonical_flights cf
)




## 3) Baseline model — LOGISTIC_REG (`diverted`)
Use **only** a small set of signals for the baseline (keep it honest).


In [55]:
MODEL_BASE = f"{PROJECT_ID}.unit2_flights.clf_diverted_base"

sql_create_schema = f'CREATE SCHEMA IF NOT EXISTS `{PROJECT_ID}.unit2_flights`;'
bq.query(sql_create_schema).result()
print(f"Schema '{PROJECT_ID}.unit2_flights' ensured to exist.")

sql_create_model = f"""
CREATE OR REPLACE MODEL `{MODEL_BASE}`
OPTIONS (MODEL_TYPE='LOGISTIC_REG', INPUT_LABEL_COLS=['diverted']) AS

{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}

SELECT
  diverted,
  dep_delay, distance, carrier, origin, dest,
  EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week
FROM split
WHERE data_split='TRAIN';
"""

job = bq.query(sql_create_model)
job.result()
print("Baseline model trained:", MODEL_BASE)

Schema 'big-data-analysis-472319.unit2_flights' ensured to exist.
Baseline model trained: big-data-analysis-472319.unit2_flights.clf_diverted_base


### Confusion matrix — default 0.5 threshold

In [62]:
cm_default_sql = f"""
WITH scored AS (
  SELECT
    p.diverted AS label,
    CASE
      WHEN p.predicted_diverted_probs[OFFSET(0)].prob >= 0.5 THEN TRUE
      ELSE FALSE
    END AS pred_label,
    p.predicted_diverted_probs[OFFSET(0)].prob AS score
  FROM ML.PREDICT(MODEL `{MODEL_BASE}`,
    (
      {CANONICAL_BASE_SQL}
      {SPLIT_CLAUSE}
      SELECT
        diverted,
        dep_delay, distance, carrier, origin, dest,
        EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week
      FROM split
      WHERE data_split = 'EVAL'
    )
  ) AS p
)
SELECT
  SUM(CASE WHEN label=TRUE  AND pred_label=TRUE  THEN 1 ELSE 0 END) AS TP,
  SUM(CASE WHEN label=FALSE AND pred_label=TRUE  THEN 1 ELSE 0 END) AS FP,
  SUM(CASE WHEN label=TRUE  AND pred_label=FALSE THEN 1 ELSE 0 END) AS FN,
  SUM(CASE WHEN label=FALSE AND pred_label=FALSE THEN 1 ELSE 0 END) AS TN
FROM scored;
"""

bq.query(cm_default_sql).result().to_dataframe()

Unnamed: 0,TP,FP,FN,TN
0,2,5,3139,1143477


Interpretation: The model is highly biased toward predicting flights as “not diverted,” correctly identifying almost all non-diverted flights but missing nearly all actual diversions (only 2 true positives vs. 3,139 false negatives). Although overall accuracy appears high (~99.7%), the model is essentially useless for detecting diverted flights. Adjusting the threshold or addressing class imbalance is needed to improve detection of diversions.

### Confusion matrix — your custom threshold

In [63]:
CUSTOM_THRESHOLD = 0.75  # custom cutoff for divert prediction

cm_thresh_sql = f"""
WITH scored AS (
  SELECT
    p.diverted AS label,
    CASE
      WHEN p.predicted_diverted_probs[OFFSET(0)].prob >= {CUSTOM_THRESHOLD} THEN TRUE
      ELSE FALSE
    END AS pred_label,
    p.predicted_diverted_probs[OFFSET(0)].prob AS score
  FROM ML.PREDICT(MODEL `{MODEL_BASE}`,
    (
      {CANONICAL_BASE_SQL}
      {SPLIT_CLAUSE}
      SELECT
        diverted,
        dep_delay, distance, carrier, origin, dest,
        EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week
      FROM split
      WHERE data_split = 'EVAL'
    )
  ) AS p
)
SELECT
  SUM(CASE WHEN label=TRUE  AND pred_label=TRUE  THEN 1 ELSE 0 END) AS TP,
  SUM(CASE WHEN label=FALSE AND pred_label=TRUE  THEN 1 ELSE 0 END) AS FP,
  SUM(CASE WHEN label=TRUE  AND pred_label=FALSE THEN 1 ELSE 0 END) AS FN,
  SUM(CASE WHEN label=FALSE AND pred_label=FALSE THEN 1 ELSE 0 END) AS TN
FROM scored;
"""

bq.query(cm_thresh_sql).result().to_dataframe()

Unnamed: 0,TP,FP,FN,TN
0,0,0,3082,1144501


Interpretation: At a 0.75 threshold, the model predicts no flights as diverted, resulting in 0 true positives and 0 false positives. All actual diverted flights are missed (3,082 false negatives), while non-diverted flights are correctly identified. This threshold is far too strict, making the model completely ineffective at detecting diversions.


## 4) Engineered model — `TRANSFORM` (same label, stricter bar)
Create **route**, extract **day_of_week**, and **bucketize dep_delay**. Compare metrics to baseline.


In [90]:
MODEL_XFORM = f"{PROJECT_ID}.unit2_flights.clf_diverted_xform"

sql_xform_create = f"""
CREATE OR REPLACE MODEL `{MODEL_XFORM}`
TRANSFORM (
  CONCAT(origin, '-', dest) AS route,
  EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
  CASE
    WHEN dep_delay < -5  THEN 'early'
    WHEN dep_delay <=  5 THEN 'on_time'
    WHEN dep_delay <= 15 THEN 'minor'
    WHEN dep_delay <= 45 THEN 'moderate'
    ELSE 'major'
  END AS dep_delay_bucket,
  dep_delay,
  distance,
  carrier,
  origin,
  dest,
  diverted
)
OPTIONS (
  MODEL_TYPE='LOGISTIC_REG',
  INPUT_LABEL_COLS=['diverted']
)
AS
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}
SELECT * FROM split WHERE data_split = 'TRAIN'
"""

In [91]:
job = bq.query(sql_xform_create)
job.result()
print("✅ Engineered model created successfully:", MODEL_XFORM)

✅ Engineered model created successfully: big-data-analysis-472319.unit2_flights.clf_diverted_xform


In [92]:
# ML.EVALUATE for baseline logistic regression model

sql_eval_base = f"""
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}

SELECT *
FROM ML.EVALUATE(
  MODEL `{MODEL_BASE}`,
  (
    SELECT
      dep_delay,
      distance,
      carrier,
      origin,
      dest,
      EXTRACT(DAYOFWEEK FROM flight_date) AS day_of_week,
      diverted
    FROM split
    WHERE data_split = 'EVAL'
  )
)
"""

bq.query(sql_eval_base).result().to_dataframe()

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.125,0.000331,0.997359,0.00066,0.017672,0.71948


Interpretation: The baseline model shows very low precision (0.125) and almost zero recall (0.00033), meaning it rarely correctly identifies diverted flights. However, the overall accuracy is high (0.997) because most flights are not diverted. The F1 score is near zero, indicating poor balance between precision and recall, though the ROC AUC of 0.72 suggests the model has some ability to rank diverted flights higher than non-diverted ones.

In [114]:
sql_split = f"""
CREATE OR REPLACE TABLE `{PROJECT}.{DATASET}.split` AS
SELECT
  *,
  CASE
    WHEN MOD(ABS(FARM_FINGERPRINT(CAST(DATE(YEAR, MONTH, DAY) AS STRING))), 10) < 8 THEN 'TRAIN'
    ELSE 'TEST'
  END AS split_col,
  DATE(YEAR, MONTH, DAY) AS flight_date
FROM `{PROJECT}.{DATASET}.flights`;
"""

bq.query(sql_split).result()
print("✅ Split table recreated successfully with reconstructed flight_date.")

✅ Split table recreated successfully with reconstructed flight_date.


In [121]:
# ML.EVALUATE for engineered model with feature transformations
sql_eval_xform = f"""
SELECT *
FROM ML.EVALUATE(
  MODEL `{PROJECT}.{DATASET}.clf_diverted_xform`,
  (
    SELECT
      AIRLINE AS carrier,
      ORIGIN_AIRPORT AS origin,
      DESTINATION_AIRPORT AS dest,
      DISTANCE AS distance,
      DEPARTURE_DELAY AS dep_delay,
      CAST(DIVERTED AS BOOL) AS diverted,
      flight_date
    FROM `{PROJECT}.{DATASET}.split`
    WHERE split_col = 'TEST'
  )
)
"""

bq.query(sql_eval_xform).result().to_dataframe()

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,1.0,0.001917,0.997377,0.003827,0.01729,0.750426


Interpretation: The engineered model shows perfect precision (1.0) but extremely low recall (0.0019), meaning it only predicts diverted flights when highly confident, missing almost all actual diverted flights. Accuracy is slightly higher than baseline (0.9974) due to the dominance of non-diverted flights, and the F1 score remains very low (0.0038), reflecting the poor balance between precision and recall. The ROC AUC improves to 0.75, indicating a modestly better ability to rank diverted flights compared to the baseline.


### Write-up (concise)
- **Threshold chosen & ops rationale:** Threshold = 0.5 was initially used for general prediction, but a stricter threshold of 0.75 was tested to reduce false positives. In operations, a higher threshold is preferred if false alerts (FP) are costly, e.g., unnecessary diversion planning.
- **Baseline vs engineered — observed changes in AUC/precision/recall:** Baseline: Precision = 0.125, Recall = 0.00033, AUC = 0.719
Engineered: Precision = 1.0, Recall = 0.0019, AUC = 0.750.
Observed change: Precision dramatically improved in the engineered model, recall remains very low, and AUC modestly increased — the model is better at ranking high-risk flights but still misses almost all actual diversions.
- **Risk framing:** cost of FP vs FN for diversion planning; what is your acceptable FN-rate? False positives (unnecessary diversion planning) are operationally costly but less dangerous than false negatives (missed diversions that could impact safety or schedules). Acceptable FN-rate: Ideally <1–2%, but current model FN-rate is extremely high, meaning nearly all diversions are missed. Operational adjustments or further feature engineering may be needed to improve recall without excessively increasing FP.



---

## Rubric (Flights, 100 pts)
**Team-only deliverable in this notebook**

- Baseline LOGISTIC_REG + evaluation (AUC + confusion @0.5) — **20**  
- Custom threshold confusion matrix + ops justification — **20**  
- Engineered model with `TRANSFORM` (route, DOW, delay bucket) — **20**  
- Comparison table (baseline vs engineered) + 3–5 sentence interpretation — **20**  
- Reproducibility: parameters clear, no hidden magic; schema mapping documented — **10**  
- Governance notes: assumptions/limitations + slices you would monitor — **10**

> **Strictness:** No screenshots; use actual results cells. Keep explanations concise (bullet points OK).
