<a href="https://colab.research.google.com/github/garci843/Unit1_TheLook_Team1/blob/main/Assignment_2/Individual/Unit2_Aditya_Ghorpade_BQML_Titanic_ModelD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Unit 2 — Team Classification (Titanic, BQML)

**Goal (team):** Build an *ops-ready* classifier in **BigQuery ML** to predict **`survived`** on the Titanic dataset. Requirements mirror the Flights notebook for comparability.
    
**Dataset:** `bigquery-public-data.ml_datasets.titanic`

**Deliver (inside this notebook):**
- One **LOGISTIC_REG** baseline + one **engineered** model (`TRANSFORM`)
- **Evaluation** via `ML.EVALUATE` and **confusion matrices** (0.5 + custom threshold)
- **Threshold choice** + 3–5 sentence ops justification (e.g., lifeboat allocation policy in a hypothetical ops setting)
- Embedded **rubric** below


In [77]:
# --- Minimal setup (edit 2 vars) ---
from google.colab import auth
auth.authenticate_user()

import os
from google.cloud import bigquery

PROJECT_ID = "sunlit-plasma-471119-s7"
REGION     = "us-central1"
TABLE_PATH = "sunlit-plasma-471119-s7.titanic.Titanic"

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"]     = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)


BQ Project: sunlit-plasma-471119-s7
Source table: sunlit-plasma-471119-s7.titanic.Titanic


### Quick sanity check

In [78]:
bq.query(f"SELECT * FROM `{TABLE_PATH}` LIMIT 5").result().to_dataframe()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S
1,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S
2,278,0,2,"Parkes, Mr. Francis ""Frank""",male,,0,0,239853,0.0,,S
3,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S
4,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0,,S



## 1) Canonical mapping (minimal)
We map to:
- `survived` (BOOL), `pclass` (INT), `sex` (STRING), `age` (NUM), `sibsp` (INT), `parch` (INT), `fare` (NUM), `embarked` (STRING)


In [79]:

CANONICAL_BASE_SQL = f'''
WITH titanic_c AS (
  SELECT
    CAST(survived AS BOOL) AS survived,
    CAST(pclass AS INT64)  AS pclass,
    CAST(sex AS STRING)    AS sex,
    CAST(age AS FLOAT64)   AS age,
    CAST(sibsp AS INT64)   AS sibsp,
    CAST(parch AS INT64)   AS parch,
    CAST(fare AS FLOAT64)  AS fare,
    CAST(embarked AS STRING) AS embarked
  FROM `{TABLE_PATH}`
  WHERE age IS NOT NULL AND fare IS NOT NULL
)
'''
print(CANONICAL_BASE_SQL[:500] + "\n...")



WITH titanic_c AS (
  SELECT
    CAST(survived AS BOOL) AS survived,
    CAST(pclass AS INT64)  AS pclass,
    CAST(sex AS STRING)    AS sex,
    CAST(age AS FLOAT64)   AS age,
    CAST(sibsp AS INT64)   AS sibsp,
    CAST(parch AS INT64)   AS parch,
    CAST(fare AS FLOAT64)  AS fare,
    CAST(embarked AS STRING) AS embarked
  FROM `sunlit-plasma-471119-s7.titanic.Titanic`
  WHERE age IS NOT NULL AND fare IS NOT NULL
)

...


### 2) Split (80/20)

In [80]:

SPLIT_CLAUSE = r'''
, split AS (
  SELECT t.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM titanic_c t
)
'''
print(SPLIT_CLAUSE)



, split AS (
  SELECT t.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM titanic_c t
)




## 3) Baseline model — LOGISTIC_REG (`survived`)
Use a small set of signals (keep parity with Flights complexity).


In [81]:
# Create or replace schema

SCHEMA = f"{PROJECT_ID}.unit2_titanic"
MODEL_BASE = f"{SCHEMA}.clf_survived_base"

# Create schema
sql_create_schema = f'''
CREATE SCHEMA IF NOT EXISTS `{SCHEMA}`;
'''
job = bq.query(sql_create_schema); _ = job.result()
print(f"Schema created or exists: {SCHEMA}")

Schema created or exists: sunlit-plasma-471119-s7.unit2_titanic


In [82]:
# Recreate Model B features to use in recreating Model B
sql_xform_features = f"""
CREATE OR REPLACE TABLE `titanic.Model_B_features` AS
SELECT
  Pclass,
  Sex,
  Age,
  Fare,
  Embarked,
  SibSp + Parch + 1 AS family_size,
  CASE
    WHEN Fare <= 15 THEN 'Low'
    WHEN Fare > 15 AND Fare <= 50 THEN 'Mid' -- Corrected AND condition
    WHEN Fare > 50 THEN 'High'
  END AS fare_bucket,
  CONCAT(Sex, '_', CAST(Pclass AS STRING)) AS sex_pclass,
  Survived
  FROM `titanic.Titanic`;
"""
job = bq.query(sql_xform_features); _ = job.result()

In [83]:
MODEL_XFORM = f"{SCHEMA}.clf_survived_xform"
print(MODEL_XFORM)

sunlit-plasma-471119-s7.unit2_titanic.clf_survived_xform


In [84]:
# 1) Make sure SCHEMA + MODEL_XFORM are defined
SCHEMA = f"{PROJECT_ID}.unit2_titanic"   # or f"{PROJECT_ID}.titanic" if that's your model dataset
MODEL_XFORM = f"{SCHEMA}.clf_survived_xform"
print("MODEL_XFORM =", MODEL_XFORM)

# 2) Ensure the schema exists
sql_create_schema = f"""
CREATE SCHEMA IF NOT EXISTS `{SCHEMA}`;
"""
job = bq.query(sql_create_schema); job.result()

# 3) Create the engineered (Model B) logistic regression model
sql_model_xform = f"""
CREATE OR REPLACE MODEL `{MODEL_XFORM}`
OPTIONS(model_type='logistic_reg', input_label_cols=['Survived']) AS
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}
SELECT
  pclass  AS Pclass,
  sex     AS Sex,
  age     AS Age,
  fare    AS Fare,
  embarked AS Embarked,
  (sibsp + parch + 1) AS family_size,
  CASE
    WHEN fare <= 15 THEN 'Low'
    WHEN fare <= 50 THEN 'Mid'
    ELSE 'High'
  END AS fare_bucket,
  CONCAT(sex, '_', CAST(pclass AS STRING)) AS sex_pclass,
  survived AS Survived
FROM split
WHERE split.split = 'TRAIN';
"""

job = bq.query(sql_model_xform)
job.result()
print("Engineered model trained:", MODEL_XFORM)


MODEL_XFORM = sunlit-plasma-471119-s7.unit2_titanic.clf_survived_xform
Engineered model trained: sunlit-plasma-471119-s7.unit2_titanic.clf_survived_xform


# 4) Threshold sweep to minimize expected cost

In [85]:
# --- Model D: parameters ---
C_FP = 1
C_FN = 1.7

# initial guess; will be refined by the sweep below
CHOSEN_THRESHOLD = 0.5

# fairness slice to monitor ("sex" or "pclass" are natural here)
FAIRNESS_SLICE_COL = "sex"  # or "pclass"
print("MODEL_XFORM:", MODEL_XFORM)

MODEL_XFORM: sunlit-plasma-471119-s7.unit2_titanic.clf_survived_xform


In [86]:
thresholds = [round(x/100, 2) for x in range(20, 76, 5)]  # 0.05..0.95

sql_sweep = f"""
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}
, scored AS (
  SELECT
    Survived AS label,
    Sex,
    Pclass,
    -- BQML outputs array<struct<label, prob>>; take class=1 prob
    predicted_Survived_probs[OFFSET(1)].prob AS p_survive
  FROM ML.PREDICT(
    MODEL `{MODEL_XFORM}`,
    (
      -- Recreate engineered features for EVAL split (match Model B training cols)
      SELECT
        survived AS Survived,
        pclass   AS Pclass,
        sex      AS Sex,
        age      AS Age,
        fare     AS Fare,
        embarked AS Embarked,
        (sibsp + parch + 1) AS family_size,
        CASE
          WHEN fare <= 15 THEN 'Low'
          WHEN fare <= 50 THEN 'Mid'
          ELSE 'High'
        END AS fare_bucket,
        CONCAT(sex, '_', CAST(pclass AS STRING)) AS sex_pclass
      FROM split
      WHERE split.split = 'EVAL'
    )
  )
),
grid AS (
  SELECT threshold FROM UNNEST([{','.join(str(t) for t in thresholds)}]) AS threshold
),
labeled AS (
  SELECT
    g.threshold,
    s.label,
    IF(s.p_survive >= g.threshold, 1, 0) AS y_hat
  FROM scored s
  CROSS JOIN grid g
),
agg AS (
  SELECT
    threshold,
    COUNTIF(label=TRUE AND y_hat=1) AS tp,
    COUNTIF(label=FALSE AND y_hat=1) AS fp,
    COUNTIF(label=FALSE AND y_hat=0) AS tn,
    COUNTIF(label=TRUE AND y_hat=0) AS fn
  FROM labeled
  GROUP BY threshold
)
SELECT
  threshold,
  tp, fp, tn, fn,
  SAFE_DIVIDE(tp, tp + fn) AS recall,
  SAFE_DIVIDE(tp, tp + fp) AS precision,
  SAFE_DIVIDE(tp + tn, tp + tn + fp + fn) AS accuracy,
  {C_FP} * fp + {C_FN} * fn AS total_cost
FROM agg
ORDER BY threshold;
"""

sweep_df = bq.query(sql_sweep).to_dataframe()
display(sweep_df)

best = sweep_df.sort_values("total_cost", ascending=True).iloc[0]
CHOSEN_THRESHOLD = float(best["threshold"])
print(f"Suggested cost-minimizing threshold: {CHOSEN_THRESHOLD:.2f} (expected cost={best['total_cost']:.2f})")

Unnamed: 0,threshold,tp,fp,tn,fn,recall,precision,accuracy,total_cost
0,0.2,30,76,1,23,0.566038,0.283019,0.238462,115.1
1,0.25,26,75,2,27,0.490566,0.257426,0.215385,120.9
2,0.3,25,75,2,28,0.471698,0.25,0.207692,122.6
3,0.35,23,75,2,30,0.433962,0.234694,0.192308,126.0
4,0.4,22,72,5,31,0.415094,0.234043,0.207692,124.7
5,0.45,21,71,6,32,0.396226,0.228261,0.207692,125.4
6,0.5,19,71,6,34,0.358491,0.211111,0.192308,128.8
7,0.55,14,67,10,39,0.264151,0.17284,0.184615,133.3
8,0.6,10,64,13,43,0.188679,0.135135,0.176923,137.1
9,0.65,7,60,17,46,0.132075,0.104478,0.184615,138.2


Suggested cost-minimizing threshold: 0.20 (expected cost=115.10)


# 5) Confusion & expected cost — 0.5 vs chosen threshold

In [87]:
sql_conf = f"""
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}
, scored AS (
  SELECT
    Survived AS label,
    Sex,
    Pclass,
    predicted_Survived_probs[OFFSET(1)].prob AS p_survive
  FROM ML.PREDICT(
    MODEL `{MODEL_XFORM}`,
    (
      SELECT
        survived AS Survived,
        pclass   AS Pclass,
        sex      AS Sex,
        age      AS Age,
        fare     AS Fare,
        embarked AS Embarked,
        (sibsp + parch + 1) AS family_size,
        CASE
          WHEN fare <= 15 THEN 'Low'
          WHEN fare <= 50 THEN 'Mid'
          ELSE 'High'
        END AS fare_bucket,
        CONCAT(sex, '_', CAST(pclass AS STRING)) AS sex_pclass
      FROM split
      WHERE split.split = 'EVAL'
    )
  )
),
labeled AS (
  SELECT
    threshold,
    label,
    Sex,
    Pclass,
    p_survive,
    IF(p_survive >= threshold, 1, 0) AS y_hat
  FROM scored,
  UNNEST([0.5, {CHOSEN_THRESHOLD}]) AS threshold
),
agg AS (
  SELECT
    CASE WHEN threshold = 0.5
         THEN 'baseline_0.5'
         ELSE CONCAT('chosen_', CAST({CHOSEN_THRESHOLD} AS STRING))
    END AS scenario,
    threshold,
    COUNTIF(label=TRUE AND y_hat=1) AS tp,
    COUNTIF(label=FALSE AND y_hat=1) AS fp,
    COUNTIF(label=FALSE AND y_hat=0) AS tn,
    COUNTIF(label=TRUE AND y_hat=0) AS fn
  FROM labeled
  GROUP BY scenario, threshold
)
SELECT
  scenario,
  threshold,
  tp, fp, tn, fn,
  SAFE_DIVIDE(tp, tp + fn) AS recall,
  SAFE_DIVIDE(tp, tp + fp) AS precision,
  SAFE_DIVIDE(tp + tn, tp + tn + fp + fn) AS accuracy,
  {C_FP} * fp + {C_FN} * fn AS total_cost
FROM agg
ORDER BY scenario;
"""

conf_df = bq.query(sql_conf).to_dataframe()
print("=== Model D: Confusion + expected cost (0.5 vs chosen) ===")
display(conf_df)

=== Model D: Confusion + expected cost (0.5 vs chosen) ===


Unnamed: 0,scenario,threshold,tp,fp,tn,fn,recall,precision,accuracy,total_cost
0,baseline_0.5,0.5,16,58,8,41,0.280702,0.216216,0.195122,127.7
1,chosen_0.2,0.2,31,66,0,26,0.54386,0.319588,0.252033,110.2


# 3) Fairness by slice (precision/recall + parity gaps)

In [95]:
sql_fair = f"""
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}
, scored AS (
  SELECT
    Survived AS label,
    {FAIRNESS_SLICE_COL} AS slice_value,
    predicted_Survived_probs[OFFSET(1)].prob AS p_survive
  FROM ML.PREDICT(
    MODEL `{MODEL_XFORM}`,
    (
      SELECT
        survived AS Survived,
        pclass   AS Pclass,
        sex      AS Sex,
        age      AS Age,
        fare     AS Fare,
        embarked AS Embarked,
        (sibsp + parch + 1) AS family_size,
        CASE
          WHEN fare <= 15 THEN 'Low'
          WHEN fare <= 50 THEN 'Mid'
          ELSE 'High'
        END AS fare_bucket,
        CONCAT(sex, '_', CAST(pclass AS STRING)) AS sex_pclass
      FROM split
      WHERE split.split = 'EVAL'
    )
  )
),
by_slice AS (
  SELECT
    slice_value,
    COUNTIF(label=TRUE AND y_hat=1) AS tp,
    COUNTIF(label=FALSE AND y_hat=1) AS fp,
    COUNTIF(label=TRUE AND y_hat=0) AS fn
  FROM (
    SELECT
      label,
      slice_value,
      IF(p_survive >= {CHOSEN_THRESHOLD}, 1, 0) AS y_hat
    FROM scored
  )
  GROUP BY slice_value
),
with_metrics AS (
  SELECT
    slice_value,
    SAFE_DIVIDE(tp, tp + fp) AS precision,
    SAFE_DIVIDE(tp, tp + fn) AS recall
  FROM by_slice
),
parity AS (
  SELECT 'precision' AS metric, MAX(precision) - MIN(precision) AS gap_pp FROM with_metrics
  UNION ALL
  SELECT 'recall',   MAX(recall)    - MIN(recall)    AS gap_pp FROM with_metrics
)
SELECT
  'by_slice' AS section,
  CAST(NULL AS STRING) AS metric,
  slice_value,
  precision,
  recall,
  CAST(NULL AS FLOAT64) AS gap_pp
FROM with_metrics
UNION ALL
SELECT
  'parity_gap' AS section,
  metric,
  CAST(NULL AS STRING) AS slice_value,
  CAST(NULL AS FLOAT64) AS precision,
  CAST(NULL AS FLOAT64) AS recall,
  gap_pp
FROM parity
ORDER BY section, metric, slice_value;
"""

fair_df = bq.query(sql_fair).to_dataframe()
print(f"=== Model D: Fairness @ threshold={CHOSEN_THRESHOLD:.2f}, slice={FAIRNESS_SLICE_COL} ===")
display(fair_df)


=== Model D: Fairness @ threshold=0.20, slice=sex ===


Unnamed: 0,section,metric,slice_value,precision,recall,gap_pp
0,by_slice,,female,0.551724,0.363636,
1,by_slice,,male,0.252525,1.0,
2,parity_gap,precision,,,,0.355341
3,parity_gap,recall,,,,0.6



### Write-up (concise)
Using a 1.7× cost penalty for false negatives, the cost-minimization sweep identifies T = 0.20 as the optimal operating threshold, reducing expected triage cost from 127.7 → 110.2 compared to the default 0.50 cutoff. At this threshold, recall improves substantially (0.28 → 0.54), meaning the system more reliably identifies passengers who truly require attention—an important safety consideration in limited-resource emergencies. However, a fairness audit by sex reveals large disparities: precision differs by 35.5 percentage points and recall differs by 60.0 percentage points, with the model capturing male survivors far more consistently than female survivors. Because both parity gaps exceed the 5-pp fairness threshold, this policy would be flagged as operationally unfair. We recommend either implementing subgroup-specific thresholds or pairing this model with operational safeguards to mitigate gender-based inequities before deployment.



---

## Rubric (Titanic, 100 pts)
**Team-only deliverable in this notebook**

- Baseline LOGISTIC_REG + evaluation (AUC + confusion @0.5) — **20**  
- Custom threshold confusion matrix + ops justification — **20**  
- Engineered model with `TRANSFORM` (family_size, fare_bucket, sex_pclass) — **20**  
- Comparison table (baseline vs engineered) + 3–5 sentence interpretation — **20**  
- Reproducibility: parameters clear, no hidden magic; mapping documented — **10**  
- Governance notes: assumptions/limitations + slices you would monitor — **10**

> **Strictness:** No screenshots; use actual results cells. Keep explanations concise (bullet points OK).
