
# **Lab 5: Predicting Diversions (Classification with BQML)**
**Unit 2 • Week 8 (Thu) — Classification & Evaluation**

**Objective:** Train and evaluate a **logistic regression** model to classify whether a flight will be **diverted**. Interpret **precision/recall** and the **confusion matrix**, and practice threshold tuning.


## Setup & Authentication

In [None]:
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
import pandas as pd

PROJECT_ID = "imposing-coast-442802-a7"
FULL_TABLE = "bigquery-public-data.new_york_citibike.citibike_trips"  # update if needed

client = bigquery.Client(project=PROJECT_ID)
print("Authenticated. Project:", PROJECT_ID)
print("Using table:", FULL_TABLE)

Authenticated. Project: imposing-coast-442802-a7
Using table: bigquery-public-data.new_york_citibike.citibike_trips


In [None]:
IMPROVED_MODEL = f"{PROJECT_ID}.bqml_tutorial.usertype_is_customer_classifier"

create_v2_sql = f"""
CREATE OR REPLACE MODEL `{IMPROVED_MODEL}`
OPTIONS(
  model_type='logistic_reg',
  input_label_cols=['is_customer'],
  enable_global_explain=TRUE
) TRANSFORM (
  -- pass-through originals
  CAST(tripduration AS FLOAT64) AS tripduration,
  CAST(starttime AS TIMESTAMP) AS starttime,
  CAST(stoptime AS TIMESTAMP) AS stoptime,
  CAST(start_station_id AS STRING) AS start_station_id,
  CAST(end_station_id AS STRING) AS end_station_id,
  CAST(bikeid AS STRING) AS bikeid,
  CAST(usertype AS STRING) AS usertype,
  CAST(birth_year AS INT64) AS birth_year,
  CAST(gender AS STRING) AS gender,
  -- engineered
  CONCAT(CAST(start_station_id AS STRING), '-', CAST(end_station_id AS STRING)) AS route,
  EXTRACT(DAYOFWEEK FROM CAST(starttime AS DATE)) AS day_of_week,
  -- create the label column
  usertype = 'Customer' AS is_customer
) AS
SELECT
  tripduration,
  starttime,
  stoptime,
  start_station_id,
  end_station_id,
  bikeid,
  usertype,
  birth_year,
  gender
FROM `{FULL_TABLE}`
WHERE usertype IS NOT NULL
LIMIT 600000;
"""

job = client.query(create_v2_sql); job.result()
print("Improved model created:", IMPROVED_MODEL)


---
## Business Context

> An airline wants to proactively identify flights with a high probability of being **diverted** to better manage logistics and passenger communication.

**Question:** Which is more costly for the airline: a **false positive** (predict diversion, but no diversion) or a **false negative** (fail to predict a diversion that occurs)?  
Write your reasoning below in 4–6 sentences.


In [None]:
create_dataset_sql = f"""
CREATE SCHEMA IF NOT EXISTS `{PROJECT_ID}.bqml_tutorial`
OPTIONS(
    location="US"
);
"""
job = client.query(create_dataset_sql); job.result()
print(f"Dataset {PROJECT_ID}.bqml_tutorial created or already exists.")

Dataset imposing-coast-442802-a7.bqml_tutorial created or already exists.



---
## Train a Classification Model (LOGISTIC_REG)

Use BQML to train a **logistic regression** model predicting `diverted` using a few features.


In [None]:
model_id = f"{PROJECT_ID}.bqml_tutorial.usertype_classifier"

create_model_sql = f"""
CREATE OR REPLACE MODEL `{model_id}`
OPTIONS(
  model_type='logistic_reg',
  input_label_cols=['usertype'],
  enable_global_explain=TRUE
) AS
SELECT
  CAST(usertype AS STRING)    AS usertype,
  CAST(tripduration AS FLOAT64) AS tripduration,
  CAST(birth_year AS FLOAT64) AS birth_year,
  CAST(gender AS STRING)  AS gender
FROM `{FULL_TABLE}`
WHERE usertype IS NOT NULL AND tripduration IS NOT NULL AND birth_year IS NOT NULL AND gender IS NOT NULL
LIMIT 500000;
"""
job = client.query(create_model_sql); job.result()
print("Model created:", model_id)

Model created: imposing-coast-442802-a7.bqml_tutorial.usertype_classifier


In [None]:
query = f"SELECT column_name FROM `bigquery-public-data.new_york_citibike`.INFORMATION_SCHEMA.COLUMNS"
columns_df = client.query(query).result().to_dataframe()
print("Columns in the table:")
display(columns_df)

Columns in the table:


Unnamed: 0,column_name
0,station_id
1,name
2,short_name
3,latitude
4,longitude
5,region_id
6,rental_methods
7,capacity
8,eightd_has_key_dispenser
9,num_bikes_available



---
## Evaluate with `ML.EVALUATE` — Validate

Get **precision**, **recall**, **log_loss**, and other metrics. Also compute a **confusion matrix**.


In [None]:
eval_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{model_id}`)"
eval_df = client.query(eval_sql).result().to_dataframe()
eval_df

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.99236,1.0,0.99236,0.996165,0.044444,0.612626


In [None]:
# Confusion matrix at default threshold
cm_sql = f"""
SELECT *
FROM ML.CONFUSION_MATRIX(MODEL `{model_id}`,
  (SELECT CAST(usertype AS STRING) AS usertype,
          CAST(tripduration AS FLOAT64) AS tripduration,
          CAST(birth_year AS FLOAT64) AS birth_year,
          CAST(gender AS STRING) AS gender
   FROM `{FULL_TABLE}`
   WHERE usertype IS NOT NULL AND tripduration IS NOT NULL AND birth_year IS NOT NULL AND gender IS NOT NULL
   LIMIT 200000))
"""
cm_df = client.query(cm_sql).result().to_dataframe()
cm_df

Unnamed: 0,expected_label,Customer,Subscriber
0,Customer,1,3171
1,Subscriber,1,196827



**Gemini Explainer Prompt:**

```python
prompt =
```
Paste your explanation below.



---
## Threshold Tuning

By default, `ML.PREDICT` uses a threshold of **0.5**. You can change it to 0.75 and observe impacts on FP/FN.

> **Task:** Author your own Gemini prompt asking for an `ML.PREDICT` example that uses **`STRUCT(0.75 AS threshold)`** and explains when/why an airline might pick a higher threshold.


In [None]:
# Example scaffold: predictions with a higher threshold (0.75)
pred_sql = f"""
SELECT *
FROM ML.PREDICT(
  MODEL `{model_id}`,
  (SELECT
     CAST(1200 AS FLOAT64) AS tripduration, # Example trip duration in seconds
     CAST(1990 AS FLOAT64) AS birth_year,   # Example birth year
     CAST('Male' AS STRING) AS gender      # Example gender
  ),
  STRUCT(0.75 AS threshold)
)
"""
pred_df = client.query(pred_sql).result().to_dataframe()
pred_df

Unnamed: 0,predicted_usertype,predicted_usertype_probs,tripduration,birth_year,gender
0,Subscriber,"[{'label': 'Subscriber', 'prob': 0.86346753829...",1200.0,1990.0,Male



---
## ✅ Deliverable for Lab 5

- Completed `Lab5_Classification_BQML.ipynb` with:
  - Business context write-up (FP vs FN)
  - `CREATE MODEL` SQL
  - `ML.EVALUATE` + `ML.CONFUSION_MATRIX` outputs and explanations
  - Threshold tuning example (`STRUCT(0.75 AS threshold)`)
- Push to **GitHub** and submit the link on **Brightspace**.
