
# **Lab 5: Predicting Diversions (Classification with BQML)**
**Unit 2 • Week 8 (Thu) — Classification & Evaluation**

**Objective:** Train and evaluate a **logistic regression** model to classify whether a flight will be **diverted**. Interpret **precision/recall** and the **confusion matrix**, and practice threshold tuning.


## Setup & Authentication

In [1]:

# Authenticate and initialize BigQuery client
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd
import os

# ▶️ Prompt user to enter their GCP Project ID
PROJECT_ID = input("mgmt-46700-471119: ").strip()
REGION = "us-central1"

# ▶️ Export environment variable for consistency
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

# ▶️ Initialize BigQuery client
client = bigquery.Client(project=PROJECT_ID)

print("✅ Authenticated successfully!")
print("Project:", PROJECT_ID)
print("Region:", REGION)

# ▶️ Set active project for gcloud
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project

# ▶️ Example: choose your working table (change if needed)
FULL_TABLE = "mgmt-46700-471119.flights.flights predict"
print("Using table:", FULL_TABLE)

# Done: Auth + Project/Region set


mgmt-46700-471119: mgmt-46700-471119
✅ Authenticated successfully!
Project: mgmt-46700-471119
Region: us-central1
Updated property [core/project].
mgmt-46700-471119
Using table: mgmt-46700-471119.flights.flights predict



---
## Business Context

> An airline wants to proactively identify flights with a high probability of being **diverted** to better manage logistics and passenger communication.

**Question:** Which is more costly for the airline: a **false positive** (predict diversion, but no diversion) or a **false negative** (fail to predict a diversion that occurs)?  
Write your reasoning below in 4–6 sentences.



---
## Train a Classification Model (LOGISTIC_REG)

Use BQML to train a **logistic regression** model predicting `dep_delay` using a few features.


In [3]:
query = f"""
SELECT *
FROM `{FULL_TABLE}`
LIMIT 5
"""
df = client.query(query).to_dataframe()
print(df.columns.tolist())

['id', 'year', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight', 'tailnum', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute', 'time_hour', 'name']


In [10]:
model_id = f"{PROJECT_ID}.flights.flight_delay_predictor"

create_model_sql = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.flights.flight_delay_predictor`
OPTIONS(
  model_type='logistic_reg',
  input_label_cols=['delayed_label'],
  enable_global_explain=TRUE
) AS
SELECT
  CASE WHEN dep_delay > 15 THEN TRUE ELSE FALSE END AS delayed_label,
  CAST(distance  AS FLOAT64) AS distance,
  CAST(carrier   AS STRING)  AS carrier,
  CAST(origin    AS STRING)  AS origin,
  CAST(dest      AS STRING)  AS dest
FROM `{FULL_TABLE}`
WHERE dep_delay IS NOT NULL
LIMIT 500000;
"""
job = client.query(create_model_sql); job.result()
print("Model created:", model_id)

Model created: mgmt-46700-471119.flights.flight_delay_predictor



---
## Evaluate with `ML.EVALUATE` — Validate

Get **precision**, **recall**, **log_loss**, and other metrics. Also compute a **confusion matrix**.


In [12]:
eval_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{model_id}`)"
eval_df = client.query(eval_sql).result().to_dataframe()
eval_df

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error,median_absolute_error,r2_score,explained_variance
0,11.481701,257.534469,0.963442,8.880821,0.877425,0.877426


In [13]:
cm_sql = f"""
WITH labeled AS (
  SELECT *,
         CASE WHEN dep_delay > 15 THEN TRUE ELSE FALSE END AS delayed_label
  FROM `{FULL_TABLE}`
  WHERE dep_delay IS NOT NULL
  LIMIT 200000
)
SELECT *
FROM ML.CONFUSION_MATRIX(MODEL `{model_id}`, TABLE labeled)
"""
cm_df = client.query(cm_sql).result().to_dataframe()
cm_df

Unnamed: 0,expected_label,FALSE,TRUE
0,False,154665,53
1,True,45236,46



**Gemini Explainer Prompt:**

```python
prompt = """
# TASK: Explain a confusion matrix in a specific business context.
# CONTEXT: I'm predicting flight delays where 'delayed' means dep_delay > 15 minutes.
# The confusion matrix from ML.EVALUATE shows:
#   True Positives (TP) = 1200
#   False Positives (FP) = 300
#   False Negatives (FN) = 150
#   True Negatives (TN) = 9700
# GOAL: Explain what 'false positives' and 'false negatives' mean in plain English for an airline operations manager.
# Additionally, interpret what the model is good at, and where it might be weak.

```
Paste your explanation below.
True Positives (TP): Flights correctly predicted as delayed by the model

False Positives (FP): Flights predicted as delayed when they were not → May lead to unnecessary resource allocation

False Negatives (FN): Flights not predicted as delayed when they were → Impacts passenger experience and incurs higher costs

True Negatives (TN): Flights correctly predicted as not delayed


---
## Threshold Tuning

By default, `ML.PREDICT` uses a threshold of **0.5**. You can change it to 0.75 and observe impacts on FP/FN.

> **Task:** Author your own Gemini prompt asking for an `ML.PREDICT` example that uses **`STRUCT(0.75 AS threshold)`** and explains when/why an airline might pick a higher threshold.


In [15]:
# Example scaffold: predictions with a higher threshold (0.75)
pred_sql = f"""
SELECT *
FROM ML.PREDICT(
  MODEL `{model_id}`,
  (SELECT
     2000.0 AS distance,
     30.0   AS dep_delay,
     'AA'   AS carrier,
     'JFK'  AS origin,
     'LAX'  AS dest
  ),
  STRUCT(0.75 AS threshold)
)
"""
pred_df = client.query(pred_sql).result().to_dataframe()
pred_df

Unnamed: 0,predicted_delayed_label,predicted_delayed_label_probs,distance,dep_delay,carrier,origin,dest
0,False,"[{'label': True, 'prob': 0.1646973105751587}, ...",2000.0,30.0,AA,JFK,LAX



---
## ✅ Deliverable for Lab 5

- Completed `Lab5_Classification_BQML.ipynb` with:
  - Business context write-up (FP vs FN)
  - `CREATE MODEL` SQL
  - `ML.EVALUATE` + `ML.CONFUSION_MATRIX` outputs and explanations
  - Threshold tuning example (`STRUCT(0.75 AS threshold)`)
- Push to **GitHub** and submit the link on **Brightspace**.
