<a href="https://colab.research.google.com/github/ahyder32/mgmt467-analytics-portfolio/blob/main/ArmanHyder_Assignment2_BQML_Titanic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Unit 2 — Team Classification (Titanic, BQML)

**Goal (team):** Build an *ops-ready* classifier in **BigQuery ML** to predict **`survived`** on the Titanic dataset. Requirements mirror the Flights notebook for comparability.
    
**Dataset:** `bigquery-public-data.ml_datasets.titanic`

**Deliver (inside this notebook):**
- One **LOGISTIC_REG** baseline + one **engineered** model (`TRANSFORM`)
- **Evaluation** via `ML.EVALUATE` and **confusion matrices** (0.5 + custom threshold)
- **Threshold choice** + 3–5 sentence ops justification (e.g., lifeboat allocation policy in a hypothetical ops setting)
- Embedded **rubric** below


In [None]:

# --- Minimal setup (edit 2 vars) ---
from google.colab import auth
auth.authenticate_user()

import os
from google.cloud import bigquery

PROJECT_ID = "mgmt-467-47888-471119"   # e.g., mgmt-467-47888
REGION     = "us-central1"
TABLE_PATH = "mgmt-467-47888-471119.ml_datasets.titanic"

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"]     = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)

BQ Project: mgmt-467-47888-471119
Source table: mgmt-467-47888-471119.ml_datasets.titanic


### Quick sanity check

In [None]:

bq.query(f"SELECT * FROM `{TABLE_PATH}` LIMIT 5").result().to_dataframe()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S
1,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S
2,278,0,2,"Parkes, Mr. Francis ""Frank""",male,,0,0,239853,0.0,,S
3,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S
4,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0,,S



## 1) Canonical mapping (minimal)
We map to:
- `survived` (BOOL), `pclass` (INT), `sex` (STRING), `age` (NUM), `sibsp` (INT), `parch` (INT), `fare` (NUM), `embarked` (STRING)


In [23]:
# =========================
# 1) LOAD + CLEAN from BigQuery
# =========================
import pandas as pd
import numpy as np
from google.cloud import bigquery

# Use the client you already created (bq), and TABLE_PATH from your setup cell
query = f"""
WITH titanic AS (
  SELECT
    CAST(Survived AS INT64)    AS Survived,
    CAST(Pclass   AS INT64)    AS Pclass,
    CAST(Sex      AS STRING)   AS Sex,
    CAST(Age      AS FLOAT64)  AS Age,
    CAST(SibSp    AS INT64)    AS SibSp,       -- NOTE: correct column name is SibSp
    CAST(Parch    AS INT64)    AS Parch,
    CAST(Fare     AS FLOAT64)  AS Fare,
    CAST(Embarked AS STRING)   AS Embarked
  FROM `{TABLE_PATH}`
  WHERE Age IS NOT NULL AND Fare IS NOT NULL
)
SELECT * FROM titanic
"""
df = bq.query(query).to_dataframe()

# Basic checks/summary
print("Rows after filtering (Age/Fare not null):", len(df))
print("Columns:", list(df.columns))
print("Positive rate (Survived=1):", df["Survived"].mean())

display(df.head(10))

Rows after filtering (Age/Fare not null): 714
Columns: ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
Positive rate (Survived=1): 0.4061624649859944


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,36.0,0,0,0.0,S
1,0,1,male,40.0,0,0,0.0,S
2,0,3,male,19.0,0,0,0.0,S
3,0,3,male,49.0,0,0,0.0,S
4,0,1,male,39.0,0,0,0.0,S
5,0,1,male,38.0,0,0,0.0,S
6,0,3,male,20.0,0,0,4.0125,C
7,0,1,male,33.0,0,0,5.0,S
8,0,3,male,61.0,0,0,6.2375,S
9,0,3,male,34.5,0,0,6.4375,C


### 2) Split (80/20)

In [24]:
# =========================
# 2) TRAIN/TEST SPLIT + PIPELINE
# =========================
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Features/target
X = df.drop(columns=["Survived"])
y = df["Survived"].astype(int)

cat_features = ["Sex", "Embarked"]
num_features = ["Pclass","Age","SibSp","Parch","Fare"]

preprocess = ColumnTransformer(
    transformers=[("cat", OneHotEncoder(handle_unknown="ignore"), cat_features)],
    remainder="passthrough"   # keep numeric features as-is
)

# Split once into train_full/test; we’ll pick a custom threshold using a small validation split
X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42
)

# A small validation split from the training set to choose the custom threshold
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.20, stratify=y_train_full, random_state=42
)

model = Pipeline([
    ("prep", preprocess),
    ("lr", LogisticRegression(max_iter=1000))
])

# Fit on the smaller training portion
model.fit(X_train, y_train)

# Refit on the full training data (better final model) AFTER we pick threshold below.

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).




## 3) Baseline model — LOGISTIC_REG (`survived`)
Use a small set of signals (keep parity with Flights complexity).


In [26]:
# =========================
# 3) EVALUATION + CONFUSION MATRICES
#    (Default 0.5 and Custom F1-optimized)
# =========================
from sklearn.metrics import (
    roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
)

# --- Pick a custom threshold on the validation split to maximize F1 ---
y_val_prob = model.predict_proba(X_val)[:, 1]
ths = np.linspace(0.05, 0.95, 91)

best_t = 0.5
best_f1 = -1.0
for t in ths:
    y_val_pred = (y_val_prob >= t).astype(int)
    f1 = f1_score(y_val, y_val_pred, zero_division=0)
    if f1 > best_f1:
        best_f1 = f1
        best_t = t

print(f"Chosen custom threshold (max F1 on validation): {best_t:.3f}  |  F1_val={best_f1:.3f}")

# --- Refit on ALL training data, then evaluate on the untouched test set ---
model.fit(X_train_full, y_train_full)

# Probabilities on test
y_test_prob = model.predict_proba(X_test)[:, 1]

Chosen custom threshold (max F1 on validation): 0.400  |  F1_val=0.787


### Confusion matrix — default 0.5 threshold

In [27]:
# Default threshold 0.5
y_pred_05 = (y_test_prob >= 0.5).astype(int)

metrics_05 = {
    "AUC": roc_auc_score(y_test, y_test_prob),
    "Accuracy@0.5": accuracy_score(y_test, y_pred_05),
    "Precision@0.5": precision_score(y_test, y_pred_05, zero_division=0),
    "Recall@0.5": recall_score(y_test, y_pred_05, zero_division=0),
    "F1@0.5": f1_score(y_test, y_pred_05, zero_division=0),
}
print("\n=== Metrics (Default threshold = 0.5) ===")
for k,v in metrics_05.items():
    print(f"{k}: {v:.4f}")

cm_05 = confusion_matrix(y_test, y_pred_05, labels=[0,1])
cm_05_df = pd.DataFrame(cm_05, index=["Actual 0","Actual 1"], columns=["Pred 0","Pred 1"])
display(cm_05_df.style.set_caption("Confusion Matrix @ 0.5"))


=== Metrics (Default threshold = 0.5) ===
AUC: 0.8684
Accuracy@0.5: 0.8392
Precision@0.5: 0.7778
Recall@0.5: 0.8448
F1@0.5: 0.8099


Unnamed: 0,Pred 0,Pred 1
Actual 0,71,14
Actual 1,9,49


### Confusion matrix — your custom threshold

In [28]:
# Custom threshold
y_pred_custom = (y_test_prob >= best_t).astype(int)
metrics_custom = {
    "ChosenThreshold": best_t,
    "Accuracy": accuracy_score(y_test, y_pred_custom),
    "Precision": precision_score(y_test, y_pred_custom, zero_division=0),
    "Recall": recall_score(y_test, y_pred_custom, zero_division=0),
    "F1": f1_score(y_test, y_pred_custom, zero_division=0),
}
print("\n=== Metrics (Custom threshold) ===")
for k,v in metrics_custom.items():
    if k == "ChosenThreshold":
        print(f"{k}: {v:.3f}")
    else:
        print(f"{k}: {v:.4f}")

cm_custom = confusion_matrix(y_test, y_pred_custom, labels=[0,1])
cm_custom_df = pd.DataFrame(cm_custom, index=["Actual 0","Actual 1"], columns=["Pred 0","Pred 1"])
display(cm_custom_df.style.set_caption("Confusion Matrix @ custom threshold"))


=== Metrics (Custom threshold) ===
ChosenThreshold: 0.400
Accuracy: 0.7902
Precision: 0.7000
Recall: 0.8448
F1: 0.7656


Unnamed: 0,Pred 0,Pred 1
Actual 0,64,21
Actual 1,9,49



## 4) Engineered model — `TRANSFORM`
Create **family_size**, **fare_bucket**, and a **sex_pclass** interaction (categorical). Compare with baseline.


In [29]:
# =========================
# 4) COEFFICIENTS / FEATURE IMPORTANCE
# =========================
# Pull feature names (OHE + numeric)
ohe = model.named_steps["prep"].named_transformers_["cat"]
ohe_features = list(ohe.get_feature_names_out(["Sex","Embarked"]))
num_features = ["Pclass","Age","SibSp","Parch","Fare"]
feat_names = ohe_features + num_features

coefs = model.named_steps["lr"].coef_[0]
coef_df = pd.DataFrame({"Feature": feat_names, "Coefficient": coefs}).sort_values("Coefficient", ascending=False)

display(coef_df)

Unnamed: 0,Feature,Coefficient
0,Sex_female,1.254253
2,Embarked_C,0.430629
4,Embarked_S,0.142511
8,Parch,0.051837
9,Fare,0.002879
6,Age,-0.045582
7,SibSp,-0.413431
3,Embarked_Q,-0.558846
5,Pclass,-1.172778
1,Sex_male,-1.23996


# **Model A**: *Pre-Embarkation Baseline (Global)*

In [32]:
# ===== Model A config (baseline, manifest-only) =====
MODEL_DATASET = "ml_models"   # change if you prefer another dataset
MODEL_NAME    = "titanic_model_A_baseline"
MODEL_PATH    = f"{PROJECT_ID}.{MODEL_DATASET}.{MODEL_NAME}"
print("Model path:", MODEL_PATH)

# Ensure dataset exists (no-op if present)
from google.cloud import bigquery
from google.api_core.exceptions import Conflict

try:
    bq.create_dataset(MODEL_DATASET)
    print(f"Created dataset: {MODEL_DATASET}")
except Conflict:
    print(f"Dataset already exists: {MODEL_DATASET}")

# ===== Train Model A (BQML LOGISTIC_REG, manifest-only features) =====
create_sql = f"""
CREATE OR REPLACE MODEL `{MODEL_PATH}`
OPTIONS(
  MODEL_TYPE = 'logistic_reg',
  INPUT_LABEL_COLS = ['Survived'],
  DATA_SPLIT_METHOD = 'RANDOM',
  DATA_SPLIT_EVAL_FRACTION = 0.20,
  AUTO_CLASS_WEIGHTS = TRUE
) AS
SELECT
  CAST(Survived AS INT64)   AS Survived,
  CAST(Pclass   AS INT64)   AS Pclass,
  CAST(Sex      AS STRING)  AS Sex,
  CAST(Age      AS FLOAT64) AS Age,
  CAST(Fare     AS FLOAT64) AS Fare,
  CAST(Embarked AS STRING)  AS Embarked
FROM `{TABLE_PATH}`
WHERE Age IS NOT NULL AND Fare IS NOT NULL
"""
job = bq.query(create_sql)
job.result()
print("Model A trained.")


# ===== ML.EVALUATE (includes AUC, log_loss, etc.) =====
eval_sql = f"""
SELECT roc_auc, log_loss, accuracy, precision, recall, f1_score
FROM ML.EVALUATE(MODEL `{MODEL_PATH}`)
"""
eval_df = bq.query(eval_sql).to_dataframe()
eval_df


# ===== Confusion matrix @ threshold = 0.5 (on the evaluation split) =====
cm_sql = f"""
SELECT *
FROM ML.CONFUSION_MATRIX(
  MODEL `{MODEL_PATH}`,
  (
    SELECT
      CAST(Survived AS INT64)   AS Survived,
      CAST(Pclass   AS INT64)   AS Pclass,
      CAST(Sex      AS STRING)  AS Sex,
      CAST(Age      AS FLOAT64) AS Age,
      CAST(Fare     AS FLOAT64) AS Fare,
      CAST(Embarked AS STRING)  AS Embarked
    FROM `{TABLE_PATH}`
    WHERE Age IS NOT NULL AND Fare IS NOT NULL
  ),
  STRUCT(0.5 AS threshold)
)
"""
cm_df = bq.query(cm_sql).to_dataframe()
cm_df

Model path: mgmt-467-47888-471119.ml_models.titanic_model_A_baseline
Dataset already exists: ml_models
Model A trained.


Unnamed: 0,expected_label,_0,_1
0,0,348,76
1,1,78,212



### Write-up (concise)
- **Threshold chosen & ops rationale**:

Use the default 0.5 threshold for the clean baseline. It’s standard for initial hand-off and keeps the decision rule simple while we review AUC/log_loss and error balance before any threshold tuning.

- **Baseline vs engineered — changes in AUC/precision/recall**:

This manifest-only model (Pclass, Sex, Age, Fare, Embarked) sets a transparent starting point. Expect modest AUC with balanced but not optimized precision/recall; later feature engineering (e.g., family size, interactions) typically nudges AUC up and can improve recall at similar precision.

- **Risk framing (FP vs FN in rescue/triage)**:

Treat false negatives as costlier (missing someone who needs intervention). If ops prioritizes catch-rate, we’ll lower the threshold in the next iteration to trade some precision for higher recall.


---

## Rubric (Titanic, 100 pts)
**Team-only deliverable in this notebook**

- Baseline LOGISTIC_REG + evaluation (AUC + confusion @0.5) — **20**  
- Custom threshold confusion matrix + ops justification — **20**  
- Engineered model with `TRANSFORM` (family_size, fare_bucket, sex_pclass) — **20**  
- Comparison table (baseline vs engineered) + 3–5 sentence interpretation — **20**  
- Reproducibility: parameters clear, no hidden magic; mapping documented — **10**  
- Governance notes: assumptions/limitations + slices you would monitor — **10**

> **Strictness:** No screenshots; use actual results cells. Keep explanations concise (bullet points OK).
