<a href="https://colab.research.google.com/github/brandonmoss124/mgmt467-analytics-portfolio/blob/main/Lab5_Classification_BQML_confident_nexus_475419_k9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Lab 5 — Predicting Diversions (Classification with **BQML**)

**Objective:** Train and evaluate a **logistic regression** model in **BigQuery ML** to classify whether a flight will be **diverted** (TRUE/FALSE).  
**Business context:** Proactively flag likely diversions to manage logistics and passenger communication.


## 1) Setup

In [None]:

# If running in Colab:
# from google.colab import auth
# auth.authenticate_user()

!pip -q install -U google-cloud-bigquery pandas-gbq google-cloud-bigquery-storage

from google.cloud import bigquery
import pandas as pd

print("BigQuery libraries installed.")


## 2) Configuration

In [None]:

# >>> EDIT THESE <<<
PROJECT_ID = "confident-nexus-475419-k9"
DATASET_ID = "superstore_data"     # Use an existing dataset or create one
LOCATION   = "US"                  # Match your dataset location
MODEL_ID   = "flight_diversion_classifier"

MODEL_FQN = f"`{PROJECT_ID}.{DATASET_ID}.{MODEL_ID}`"

client = bigquery.Client(project=PROJECT_ID, location=LOCATION)

print("Project:", PROJECT_ID)
print("Dataset:", DATASET_ID)
print("Model:", MODEL_FQN)



## 3) Business Question & Trade-offs

**Question:** Which is more costly for the airline — a **false positive** (predicting a diversion that doesn’t happen) or a **false negative** (missing a diversion that does happen)?

**Answer (document your reasoning):**  
*False negatives* usually cost more: if a real diversion isn’t predicted, the airline may have **no staff/equipment in place**, leading to **delays, missed connections, hotel/meal vouchers**, and reputational harm. *False positives* also have costs (unnecessary preparations/notifications), but they’re typically **less disruptive** than being surprised by an actual diversion.


## 4) Train a Classification Model

In [None]:

prompt = """
# TASK: Generate a BQML query to create a classification model.
# GOAL: Predict the 'diverted' column using 'distance', 'carrier', and 'dep_delay' as features.
# SPECIFICATIONS: The model type must be 'LOGISTIC_REG'.
"""
print(prompt)


In [None]:

create_model_sql = f"""
CREATE OR REPLACE MODEL {MODEL_FQN}
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['diverted_label'],
  enable_global_explain=TRUE
) AS
SELECT
  CAST(diverted AS BOOL) AS diverted_label,
  distance,
  carrier,
  dep_delay
FROM `bigquery-public-data.airline_ontime_data.flights`
WHERE diverted IS NOT NULL
  AND distance IS NOT NULL
  AND dep_delay IS NOT NULL
LIMIT 300000
"""
print(create_model_sql)
job = client.query(create_model_sql)
job.result()
print("Model created:", job.job_id)


## 5) Evaluate with ML.EVALUATE (+ Confusion Matrix)

In [None]:

evaluate_sql = f"""
SELECT *
FROM ML.EVALUATE(MODEL {MODEL_FQN})
"""
print(evaluate_sql)
eval_df = client.query(evaluate_sql).result().to_dataframe()
eval_df


In [None]:

# Confusion matrix at default threshold 0.5
cm_05_sql = f"""
SELECT *
FROM ML.CONFUSION_MATRIX(MODEL {MODEL_FQN}, STRUCT(0.5 AS threshold))
"""
print(cm_05_sql)
cm_05 = client.query(cm_05_sql).result().to_dataframe()
cm_05


In [None]:

# Confusion matrix at threshold 0.75
cm_075_sql = f"""
SELECT *
FROM ML.CONFUSION_MATRIX(MODEL {MODEL_FQN}, STRUCT(0.75 AS threshold))
"""
print(cm_075_sql)
cm_075 = client.query(cm_075_sql).result().to_dataframe()
cm_075


### Explainer Prompt (Confusion Matrix)

In [None]:

prompt = """
# TASK: Explain a confusion matrix in a specific business context.
# CONTEXT: I'm predicting flight diversions. My confusion matrix from ML.EVALUATE shows (example) 10 true positives, 50 false positives, 5 false negatives, and 10000 true negatives.
# GOAL: Explain what 'false positives' and 'false negatives' represent in plain English for an airline operations manager. Based on my numbers, what is my model good at, and where is it weak?
"""
print(prompt)



**Example interpretation:**  
- **False positives** = we prepped for a diversion that didn’t happen (extra cost, but manageable).  
- **False negatives** = a diversion occurred without warning (high disruption).  
If false negatives are low and true negatives are very high, the model is **good at identifying non-diversions**; if false positives are elevated, it tends to **over-warn**, which may be acceptable if the business prioritizes avoiding surprise diversions.


## 6) Challenge — Custom Threshold with ML.PREDICT

In [None]:

challenge_prompt = """
# TASK: Help me write an ML.PREDICT query in BigQuery ML that uses a custom threshold of 0.75.
# CONTEXT: I'm predicting 'diverted' using a logistic regression model.
# GOAL: Return predicted_label and predicted_probability with threshold=0.75, and explain why an airline might choose a higher threshold for predicting diversions.
"""
print(challenge_prompt)


In [None]:

predict_075_sql = f"""
SELECT *
FROM ML.PREDICT(
  MODEL {MODEL_FQN},
  (
    SELECT distance, carrier, dep_delay
    FROM `bigquery-public-data.airline_ontime_data.flights`
    WHERE diverted IS NOT NULL AND distance IS NOT NULL AND dep_delay IS NOT NULL
    LIMIT 1000
  ),
  STRUCT(0.75 AS threshold)
)
"""
print(predict_075_sql)
pred_df = client.query(predict_075_sql).result().to_dataframe()
pred_df.head()
