
# **Lab 5: Predicting Diversions (Classification with BQML)**
**Unit 2 • Week 8 (Thu) — Classification & Evaluation**

**Objective:** Train and evaluate a **logistic regression** model to classify whether a flight will be **diverted**. Interpret **precision/recall** and the **confusion matrix**, and practice threshold tuning.


## Setup & Authentication

In [26]:

from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd

PROJECT_ID = "sapient-office-471119-g4"
FULL_TABLE = "sapient-office-471119-g4.superstore_data.airdash_base_enriched"  # update if needed

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)
print("Using table:", FULL_TABLE)


Authenticated. Project: sapient-office-471119-g4
Using table: sapient-office-471119-g4.superstore_data.airdash_base_enriched


In [27]:
check_table_sql = f"SELECT 1 FROM `{FULL_TABLE}` LIMIT 1"

try:
  client.query(check_table_sql).result()
  print(f"Table '{FULL_TABLE}' exists and is accessible.")
except Exception as e:
  print(f"Error accessing table '{FULL_TABLE}': {e}")
  print("It's possible the dataset or table no longer exists, or there's still an access issue.")

Table 'sapient-office-471119-g4.superstore_data.airdash_base_enriched' exists and is accessible.



---
## Business Context

> An airline wants to proactively identify flights with a high probability of being **diverted** to better manage logistics and passenger communication.

**Question:** Which is more costly for the airline: a **false positive** (predict diversion, but no diversion) or a **false negative** (fail to predict a diversion that occurs)?  
Write your reasoning below in 4–6 sentences.


False Positive: The model predicts a diversion, but it doesn't happen. The airline might incur costs from unnecessary preparations (e.g., arranging alternative gates, notifying passengers prematurely, re-routing crew) and potentially cause passenger anxiety.
False Negative: The model fails to predict a diversion that does happen. This could lead to significant disruptions, including stranded passengers, missed connections, emergency re-routing at the last minute (which can be more expensive), and damage to the airline's reputation due to poor communication and handling of the situation.
In most airline scenarios, a false negative (a missed diversion prediction) is likely more costly due to the operational chaos, passenger dissatisfaction, and potential safety implications of being unprepared for an unexpected diversion. While false positives are not ideal and incur some costs, they are generally less impactful than the fallout from a true diversion without prior warning.




---
## Train a Classification Model (LOGISTIC_REG)

Use BQML to train a **logistic regression** model predicting `diverted` using a few features.


In [33]:
model_id = f"{PROJECT_ID}.superstore_data.flight_diverted_classifier"

create_model_sql = f"""
CREATE OR REPLACE MODEL `{model_id}`
OPTIONS(
  model_type='logistic_reg',
  input_label_cols=['diverted'],
  enable_global_explain=TRUE
) AS
SELECT
  CAST(diverted AS BOOL)    AS diverted,
  CAST(distance_miles AS FLOAT64) AS distance_miles,
  CAST(dep_delay_min AS FLOAT64) AS dep_delay_min,
  CAST(carrier  AS STRING)  AS carrier
FROM `{FULL_TABLE}`
WHERE diverted IS NOT NULL;
"""
print(f"Using table path: {FULL_TABLE}")
job = client.query(create_model_sql); job.result()
print("Model created:", model_id)

Using table path: sapient-office-471119-g4.superstore_data.airdash_base_enriched
Model created: sapient-office-471119-g4.superstore_data.flight_diverted_classifier


In [32]:
schema_sql = f"SELECT * FROM `{FULL_TABLE}` LIMIT 0"
schema_df = client.query(schema_sql).result().to_dataframe()
display(schema_df.columns)

Index(['date', 'month', 'carrier', 'origin', 'dest', 'route', 'distance_miles',
       'seats', 'passengers', 'dep_delay_min', 'arr_delay_min', 'cancelled',
       'diverted', 'revenue_usd', 'cost_usd', 'fuel_burn_gal', 'nps',
       'on_time', 'load_factor_pct_row', 'yield_usd_per_pax_mile', 'asm',
       'casm', 'operated_flight', 'margin_usd'],
      dtype='object')


---
## Evaluate with `ML.EVALUATE` — Validate

Get **precision**, **recall**, **log_loss**, and other metrics. Also compute a **confusion matrix**.


In [34]:

eval_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{model_id}`)"
eval_df = client.query(eval_sql).result().to_dataframe()
eval_df


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.995119,0.0,0.030853,0.498643


In [35]:

# Confusion matrix at default threshold
cm_sql = f"""
SELECT *
FROM ML.CONFUSION_MATRIX(MODEL `{model_id}`,
  (SELECT * FROM `{FULL_TABLE}` WHERE diverted IS NOT NULL LIMIT 200000))
"""
cm_df = client.query(cm_sql).result().to_dataframe()
cm_df


Unnamed: 0,expected_label,FALSE,TRUE
0,0,96062,0
1,1,462,0



**Gemini Explainer Prompt:**

```python
prompt = Write Python code using BigQuery ML that evaluates a classification model and shows both its evaluation metrics and confusion matrix. Use ML.EVALUATE to get precision, recall, accuracy, f1_score, log_loss, and roc_auc, then use ML.CONFUSION_MATRIX at the default threshold to display true/false predictions against expected labels. Return both results as pandas DataFrames
```
Paste your explanation below.


This prompt works because it tells the AI exactly what to generate: SQL for model evaluation and confusion matrix queries within BigQuery ML. It specifies the key functions (ML.EVALUATE and ML.CONFUSION_MATRIX) and the expected output format (to_dataframe() in Python). That precision ensures the AI returns code that matches both cells shown in your screenshot.


---
## Threshold Tuning

By default, `ML.PREDICT` uses a threshold of **0.5**. You can change it to 0.75 and observe impacts on FP/FN.

> **Task:** Author your own Gemini prompt asking for an `ML.PREDICT` example that uses **`STRUCT(0.75 AS threshold)`** and explains when/why an airline might pick a higher threshold.


In [37]:
# Example scaffold: predictions with a higher threshold (0.75)
pred_sql = f"""
SELECT *
FROM ML.PREDICT(
  MODEL `{model_id}`,
  (SELECT
     2000.0 AS distance_miles,
     30.0   AS dep_delay_min,
     'AA'   AS carrier
  ),
  STRUCT(0.75 AS threshold)
)
"""
pred_df = client.query(pred_sql).result().to_dataframe()
pred_df

Unnamed: 0,predicted_diverted,predicted_diverted_probs,distance_miles,dep_delay_min,carrier
0,False,"[{'label': True, 'prob': 0.00509451291516582},...",2000.0,30.0,AA



---
## ✅ Deliverable for Lab 5

- Completed `Lab5_Classification_BQML.ipynb` with:
  - Business context write-up (FP vs FN)
  - `CREATE MODEL` SQL
  - `ML.EVALUATE` + `ML.CONFUSION_MATRIX` outputs and explanations
  - Threshold tuning example (`STRUCT(0.75 AS threshold)`)
- Push to **GitHub** and submit the link on **Brightspace**.
