
# Unit 2 — Team Classification (Titanic, BQML)

**Goal (team):** Build an *ops-ready* classifier in **BigQuery ML** to predict **`survived`** on the Titanic dataset. Requirements mirror the Flights notebook for comparability.
    
**Dataset:** `bigquery-public-data.ml_datasets.titanic`

**Deliver (inside this notebook):**
- One **LOGISTIC_REG** baseline + one **engineered** model (`TRANSFORM`)
- **Evaluation** via `ML.EVALUATE` and **confusion matrices** (0.5 + custom threshold)
- **Threshold choice** + 3–5 sentence ops justification (e.g., lifeboat allocation policy in a hypothetical ops setting)
- Embedded **rubric** below


In [45]:

# --- Minimal setup (edit 2 vars) ---
from google.colab import auth
auth.authenticate_user()

import os
from google.cloud import bigquery

PROJECT_ID = "mgmt-46700-471119"   # e.g., mgmt-467-47888
REGION     = "us-central1"
TABLE_PATH = "mgmt-46700-471119.assignment2.assignment2titanic"

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"]     = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)


BQ Project: mgmt-46700-471119
Source table: mgmt-46700-471119.assignment2.assignment2titanic


### Quick sanity check

In [46]:

bq.query(f"SELECT * FROM `{TABLE_PATH}` LIMIT 5").result().to_dataframe()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S
1,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S
2,278,0,2,"Parkes, Mr. Francis ""Frank""",male,,0,0,239853,0.0,,S
3,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S
4,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0,,S



## 1) Canonical mapping (minimal)
We map to:
- `survived` (BOOL), `pclass` (INT), `sex` (STRING), `age` (NUM), `sibsp` (INT), `parch` (INT), `fare` (NUM), `embarked` (STRING)


In [47]:
CANONICAL_BASE_SQL = f'''
WITH titanic_c AS (
  SELECT
    PassengerId, -- Added PassengerId
    CAST(survived AS BOOL) AS survived,
    CAST(pclass AS INT64)  AS pclass,
    CAST(sex AS STRING)    AS sex,
    CAST(age AS FLOAT64)   AS age,
    CAST(sibsp AS INT64)   AS sibsp,
    CAST(parch AS INT64)   AS parch,
    CAST(fare AS FLOAT64)  AS fare,
    CAST(embarked AS STRING) AS embarked
  FROM `{TABLE_PATH}`
  WHERE age IS NOT NULL AND fare IS NOT NULL
)
'''
print(CANONICAL_BASE_SQL[:500] + "\n...")


WITH titanic_c AS (
  SELECT
    PassengerId, -- Added PassengerId
    CAST(survived AS BOOL) AS survived,
    CAST(pclass AS INT64)  AS pclass,
    CAST(sex AS STRING)    AS sex,
    CAST(age AS FLOAT64)   AS age,
    CAST(sibsp AS INT64)   AS sibsp,
    CAST(parch AS INT64)   AS parch,
    CAST(fare AS FLOAT64)  AS fare,
    CAST(embarked AS STRING) AS embarked
  FROM `mgmt-46700-471119.assignment2.assignment2titanic`
  WHERE age IS NOT NULL AND fare IS NOT NULL
)

...


### 2) Split (80/20)

In [48]:

SPLIT_CLAUSE = r'''
, split_data AS (
  SELECT t.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS data_split
  FROM titanic_c t
)
'''
print(SPLIT_CLAUSE)



, split_data AS (
  SELECT t.*,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS data_split
  FROM titanic_c t
)




## 3) Baseline model — LOGISTIC_REG (`survived`)
Use a small set of signals (keep parity with Flights complexity).


In [49]:
SCHEMA = f"{PROJECT_ID}.unit2_titanic"
MODEL_BASE = f"{SCHEMA}.clf_survived_base"

# 1. Create Schema
bq.query(f"CREATE SCHEMA IF NOT EXISTS `{SCHEMA}`;").result()

# 2. Create Model
sql_create_model = f'''
CREATE OR REPLACE MODEL `{MODEL_BASE}`
OPTIONS (MODEL_TYPE='LOGISTIC_REG', INPUT_LABEL_COLS=['survived']) AS
{CANONICAL_BASE_SQL}
{SPLIT_CLAUSE}
SELECT
  survived, pclass, sex, age, sibsp, parch, fare, embarked
FROM split_data
WHERE data_split='TRAIN'
;
'''
job_create = bq.query(sql_create_model); _ = job_create.result()
print("Baseline model trained:", MODEL_BASE)

# 3. Evaluate Model and display results
sql_evaluate_model = f'''
SELECT * FROM ML.EVALUATE(
  MODEL `{MODEL_BASE}`,
  (
    {CANONICAL_BASE_SQL}
    {SPLIT_CLAUSE}
    SELECT survived, pclass, sex, age, sibsp, parch, fare, embarked
    FROM split_data WHERE data_split='EVAL'
  )
);
'''
job_eval = bq.query(sql_evaluate_model); eval_df = job_eval.result().to_dataframe()
eval_df


Baseline model trained: mgmt-46700-471119.unit2_titanic.clf_survived_base


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.708333,0.829268,0.815789,0.764045,0.416471,0.890305


### Confusion matrix — default 0.5 threshold

In [50]:
titanic_c_def = CANONICAL_BASE_SQL[len("WITH "):]
split_data_def = SPLIT_CLAUSE[len(", "):]

cm_default_sql = f'''
WITH
  {titanic_c_def},
  {split_data_def},
  scored AS (
    SELECT
      t.survived AS label,
      p.predicted_survived AS pred_label,
      p.predicted_survived_probs[OFFSET(0)].prob AS score
    FROM split_data t
    JOIN ML.PREDICT(MODEL `{MODEL_BASE}`,
        (
          {CANONICAL_BASE_SQL}
          {SPLIT_CLAUSE}
          SELECT PassengerId, pclass, sex, age, sibsp, parch, fare, embarked FROM split_data WHERE data_split='EVAL'
        )
    ) AS p
    ON t.PassengerId = p.PassengerId
    WHERE t.data_split='EVAL'
  )
SELECT
  SUM(CASE WHEN label=TRUE  AND pred_label=TRUE  THEN 1 ELSE 0 END) AS TP,
  SUM(CASE WHEN label=FALSE AND pred_label=TRUE  THEN 1 ELSE 0 END) AS FP,
  SUM(CASE WHEN label=TRUE  AND pred_label=FALSE THEN 1 ELSE 0 END) AS FN,
  SUM(CASE WHEN label=FALSE AND pred_label=FALSE THEN 1 ELSE 0 END) AS TN
FROM scored;
'''
bq.query(cm_default_sql).result().to_dataframe()

Unnamed: 0,TP,FP,FN,TN
0,7,0,3,19


### Confusion matrix — your custom threshold

In [51]:

CUSTOM_THRESHOLD = 0.6   # TODO: justify in ops (e.g., conservative rescue policy)

titanic_c_def = CANONICAL_BASE_SQL[len("WITH "):]
split_data_def = SPLIT_CLAUSE[len(", "):]

cm_thresh_sql = f'''
WITH
  {titanic_c_def},
  {split_data_def},
  scored AS (
    SELECT
      scored_predictions.survived AS label,
      CAST(scored_predictions.score >= {CUSTOM_THRESHOLD} AS BOOL) AS pred_label
    FROM (
      SELECT
        t.survived, -- Explicitly select survived
        t.PassengerId, -- Keep PassengerId for the join
        p.predicted_survived_probs[OFFSET(0)].prob AS score
      FROM split_data t
      JOIN ML.PREDICT(MODEL `{MODEL_BASE}`,
          (
            {CANONICAL_BASE_SQL}
            {SPLIT_CLAUSE}
            SELECT PassengerId, pclass, sex, age, sibsp, parch, fare, embarked FROM split_data WHERE data_split='EVAL'
          )
      ) AS p
      ON t.PassengerId = p.PassengerId
      WHERE t.data_split='EVAL'
    ) AS scored_predictions
  )
SELECT
  SUM(CASE WHEN label=TRUE  AND pred_label=TRUE  THEN 1 ELSE 0 END) AS TP,
  SUM(CASE WHEN label=FALSE AND pred_label=TRUE  THEN 1 ELSE 0 END) AS FP,
  SUM(CASE WHEN label=TRUE  AND pred_label=FALSE THEN 1 ELSE 0 END) AS FN,
  SUM(CASE WHEN label=FALSE AND pred_label=FALSE THEN 1 ELSE 0 END) AS TN
FROM scored;
'''
bq.query(cm_thresh_sql).result().to_dataframe()

Unnamed: 0,TP,FP,FN,TN
0,6,3,6,17



## 4) Engineered model — `TRANSFORM`
Create **family_size**, **fare_bucket**, and a **sex_pclass** interaction (categorical). Compare with baseline.


In [52]:
# Model B — Engineered uplift (global + engineered).
MODEL_XFORM = f"{SCHEMA}.clf_survived_xform"

sql_xform = f'''
CREATE OR REPLACE MODEL `{MODEL_XFORM}`
OPTIONS (MODEL_TYPE='LOGISTIC_REG', INPUT_LABEL_COLS=['survived']) AS
SELECT
  survived,
  pclass,
  sex,
  age,
  sibsp,
  parch,
  fare,
  embarked,
  -- Engineered features
  (sibsp + parch + 1) AS family_size,
  CASE
    WHEN fare < 10 THEN 'low'
    WHEN fare < 50 THEN 'mid'
    ELSE 'high'
  END AS fare_bucket,
  CONCAT(sex, '_', CAST(pclass AS STRING)) AS sex_pclass
FROM (
  SELECT *,
         CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM (
    SELECT
      CAST(survived AS BOOL) AS survived,
      CAST(pclass AS INT64)  AS pclass,
      CAST(sex AS STRING)    AS sex,
      CAST(age AS FLOAT64)   AS age,
      CAST(sibsp AS INT64)   AS sibsp,
      CAST(parch AS INT64)   AS parch,
      CAST(fare AS FLOAT64)  AS fare,
      CAST(embarked AS STRING) AS embarked
    FROM `{TABLE_PATH}`
    WHERE age IS NOT NULL AND fare IS NOT NULL
  )
)
WHERE split='TRAIN';
'''

job_train_xform = bq.query(sql_xform).result()
print("Engineered model trained (Model B):", MODEL_XFORM)


Engineered model trained (Model B): mgmt-46700-471119.unit2_titanic.clf_survived_xform


In [40]:
# Comparison of Baseline and Engineered Models
sql_eval_xform_fixed = f'''
-- Baseline Model A
SELECT 'baseline' AS model_version, *
FROM ML.EVALUATE(
  MODEL `{MODEL_BASE}`,
  (
    SELECT * FROM (
      SELECT
        CAST(survived AS BOOL) AS survived,  -- CAST 为 BOOL
        pclass, sex, age, sibsp, parch, fare, embarked,
        CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
      FROM `{TABLE_PATH}`
      WHERE age IS NOT NULL AND fare IS NOT NULL
    )
    WHERE split='EVAL'
  )
)
UNION ALL
-- Engineered Model B
SELECT 'engineered' AS model_version, *
FROM ML.EVALUATE(
  MODEL `{MODEL_XFORM}`,
  (
    SELECT * FROM (
      SELECT
        CAST(survived AS BOOL) AS survived,  -- CAST 为 BOOL
        pclass, sex, age, sibsp, parch, fare, embarked,
        (sibsp + parch + 1) AS family_size,
        CASE WHEN fare < 10 THEN 'low'
             WHEN fare < 50 THEN 'mid'
             ELSE 'high'
        END AS fare_bucket,
        CONCAT(sex, '_', CAST(pclass AS STRING)) AS sex_pclass,
        CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
      FROM `{TABLE_PATH}`
      WHERE age IS NOT NULL AND fare IS NOT NULL
    )
    WHERE split='EVAL'
  )
);
'''

# Execute the query and display the results.
job_eval_xform_fixed = bq.query(sql_eval_xform_fixed)
eval_xform_df = job_eval_xform_fixed.result().to_dataframe()
eval_xform_df


Unnamed: 0,model_version,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,baseline,0.774194,0.716418,0.773973,0.744186,0.462546,0.855277
1,engineered,0.875,0.688525,0.837662,0.770642,0.36593,0.913628


In [44]:
# Model B: Engineered uplift - confusion @ 0.5
sql_confusion_xform = f'''
WITH scored AS (
  SELECT
    CAST(survived AS BOOL) AS label,
    predicted_survived AS pred_label,
    predicted_survived_probs[OFFSET(0)].prob AS score,
    CASE WHEN RAND() < 0.8 THEN 'TRAIN' ELSE 'EVAL' END AS split
  FROM ML.PREDICT(MODEL `{MODEL_XFORM}`,
    (
      SELECT *,
             (sibsp + parch + 1) AS family_size,
             CASE
               WHEN fare < 10 THEN 'low'
               WHEN fare < 50 THEN 'mid'
               ELSE 'high'
             END AS fare_bucket,
             CONCAT(sex, '_', CAST(pclass AS STRING)) AS sex_pclass
      FROM `{TABLE_PATH}`
      WHERE age IS NOT NULL AND fare IS NOT NULL
    )
  )
)
SELECT
  SUM(CASE WHEN label=TRUE  AND score >= 0.5 THEN 1 ELSE 0 END) AS TP,
  SUM(CASE WHEN label=FALSE AND score >= 0.5 THEN 1 ELSE 0 END) AS FP,
  SUM(CASE WHEN label=TRUE  AND score < 0.5  THEN 1 ELSE 0 END) AS FN,
  SUM(CASE WHEN label=FALSE AND score < 0.5  THEN 1 ELSE 0 END) AS TN
FROM scored
WHERE split='EVAL';
'''

# Execute the query and display the results.
confusion_xform_df = bq.query(sql_confusion_xform).result().to_dataframe()
confusion_xform_df


Unnamed: 0,TP,FP,FN,TN
0,37,8,19,67



### Write-up (concise)
- **Threshold chosen & ops rationale:** …  
- **Baseline vs engineered — changes in AUC/precision/recall:** …  
- **Risk framing:** FP vs FN trade in a rescue/triage-like context: what error hurts more and why? …


### Write-up Answer:

-   **Threshold chosen & ops rationale:** For the engineered model's initial evaluation, a default threshold of 0.5 was chosen. This threshold implies an equal cost for false positives and false negatives, providing a balanced starting point for classification. In an operational context like lifeboat allocation, this balanced approach helps in neutrally identifying potential survivors without explicit bias towards over-rescuing or under-rescuing, pending further cost-benefit analysis of specific error types.

-   **Baseline vs engineered — changes in AUC/precision/recall:** The engineered model (`clf_survived_xform`) significantly improved overall performance compared to the baseline. Its AUC increased from 0.855 to 0.914, and Log Loss decreased from 0.463 to 0.366, indicating better discriminative power and calibration. Precision also saw a notable boost from 0.774 to 0.875, suggesting fewer incorrect survival predictions. While recall slightly decreased from 0.716 to 0.689, the overall gains in other metrics, especially AUC and Precision, demonstrate the value of the added features (`family_size`, `fare_bucket`, `sex_pclass`) in capturing more intricate survival patterns.

-   **Risk framing:** In a rescue or triage-like context such as the Titanic disaster, False Negatives (FN - predicting someone will not survive who actually does) are typically more detrimental. An FN implies a missed opportunity to save a life, which carries significant ethical and human costs. While False Positives (FP - predicting survival when it doesn't occur) can lead to misallocation of scarce resources, the direct consequence of an FN (a preventable loss of life) often outweighs the indirect cost of an FP.

## Reproducibility

This section details the parameters and methods used for model training and evaluation, ensuring full transparency and reproducibility of the results.

### Model Training and Evaluation Parameters

-   **Model Type:** Logistic Regression
-   **Input Features:**
    -   **Baseline Model Features:** `pclass`, `sex`, `age`, `sibsp`, `parch`, `fare`, `embarked`
    -   **Engineered Model Features:** In addition to baseline features, includes `family_size`, `fare_bucket`, and `sex_pclass`.
-   **Label Column:** `survived` (BOOLEAN)
-   **Data Splitting Method:** A random 80/20 split (`RAND() < 0.8`) was applied to the preprocessed dataset, creating 'TRAIN' and 'EVAL' sets. This split ensures that 80% of the data is used for training the model and 20% for evaluation.
-   **Evaluation Threshold:** The primary evaluation threshold for classification was 0.5 for comparison purposes, but a custom threshold of 0.6 was also explored.

### Feature Mapping and Derivation

All features are derived from the `bigquery-public-data.ml_datasets.titanic` dataset, with explicit type casting and engineering for advanced features. No hidden logic or external data sources are used.

-   **Base Features (for both models):**
    -   `survived`: `CAST(survived AS BOOL)`
    -   `pclass`: `CAST(pclass AS INT64)`
    -   `sex`: `CAST(sex AS STRING)`
    -   `age`: `CAST(age AS FLOAT64)`
    -   `sibsp`: `CAST(sibsp AS INT64)`
    -   `parch`: `CAST(parch AS INT64)`
    -   `fare`: `CAST(fare AS FLOAT64)`
    -   `embarked`: `CAST(embarked AS STRING)`
    -   **Filtering:** Only rows where `age` IS NOT NULL AND `fare` IS NOT NULL were included in the analysis.

-   **Engineered Features (for Engineered Model only):**
    -   `family_size`: Calculated as `(sibsp + parch + 1)`, representing the total number of family members including the passenger.
    -   `fare_bucket`: Categorized `fare` into 'low' (`< 10`), 'mid' (`< 50`), or 'high' (`>= 50`) bands.
    -   `sex_pclass`: An interaction feature created by concatenating `sex` and `pclass` (e.g., 'male_1', 'female_3').

All transformations and model definitions are explicitly defined in the provided SQL queries, ensuring full reproducibility of the models and their evaluation metrics.


## Governance Notes

### Model Assumptions and Limitations

-   **Assumptions:**
    -   The Titanic dataset is assumed to be sufficiently representative of passenger demographics and survival factors for the purpose of demonstrating binary classification. Real-world applicability would require more robust data validation.
    -   The imputation strategy (dropping rows with NULL `age` or `fare`) is assumed to not introduce significant bias for this demonstration. In a production system, a more sophisticated imputation method would be considered.
-   **Limitations:**
    -   **Dataset Size:** The dataset, while suitable for this exercise, is relatively small and historical, limiting the generalizability of the findings to modern contexts or larger populations.
    -   **Feature Engineering:** The engineered features are simple and handcrafted. More complex relationships might be captured with advanced feature engineering techniques or automated feature learning.
    -   **Random Split Variability:** The random data split can lead to slight variations in metrics upon re-execution, though general trends should remain consistent.

### Key Data Slices to Monitor

To ensure fair and reliable model performance, continuous monitoring of predictions across specific data slices is recommended:

-   **Demographic Groups:** Monitor performance metrics (accuracy, precision, recall) for different `sex` and `pclass` categories, and potentially for `age` bands, to detect biases or performance degradation.
-   **Fare Categories:** Track performance across `fare_bucket` categories (`low`, `mid`, `high`) as `fare` is a significant predictor of `pclass` and implicitly survival.
-   **Embarkation Point:** Monitor `embarked` locations (C, Q, S) to ensure consistent performance for passengers from different origins.

### Potential Risks and Implications of Errors

In a rescue or triage-like context, the implications of different types of errors are critical:

-   **False Negatives (FN):** Predicting a passenger will *not* survive when they *actually do*. In a life-or-death scenario, FNs are generally considered more harmful. An FN represents a missed opportunity to save a life, leading to a preventable loss. For our engineered model at a 0.5 threshold, we observed 19 False Negatives (from `confusion_xform_df`).
-   **False Positives (FP):** Predicting a passenger *will* survive when they *do not*. FPs can lead to misallocation of scarce resources (e.g., assigning a lifeboat spot to someone who wouldn't have survived anyway). While undesirable, the direct human cost is often considered less severe than an FN. For our engineered model at a 0.5 threshold, we observed 8 False Positives (from `confusion_xform_df`).

Given this risk profile, a model designed for a real-world rescue scenario might prioritize reducing False Negatives, potentially by adjusting the classification threshold, even if it slightly increases False Positives. The chosen threshold of 0.5 represents a balanced approach, but further optimization could be explored based on the specific operational costs of each error type.

## Titanic (BigQuery): Model B Summary: Engineered Uplift (Titanic Survival Prediction)

**Goal:** Model B aimed to achieve an *engineered uplift* in Titanic survival predictions by specifically capturing family and class interactions, building upon the baseline model's capabilities.

**Features Used:** This model utilized the baseline features (`pclass`, `sex`, `age`, `sibsp`, `parch`, `fare`, `embarked`) and augmented them with three engineered features:
*   `family_size`: Calculated as `sibsp + parch + 1`, representing the total number of family members.
*   `fare_bucket`: Categorized fare into 'low' (< 10), 'mid' (10-49), and 'high' (>= 50).
*   `sex_pclass`: An interaction term concatenating `sex` and `pclass` (e.g., 'male_1', 'female_3').

**Differences from Baseline:** Unlike the baseline model (Model A), Model B directly incorporates richer relational and categorical information through its engineered features, specifically targeting the known social dynamics of the Titanic disaster.

**Performance Improvements (vs. Model A):** Model B demonstrated significant performance gains over the baseline. Its **AUC increased from 0.855 to 0.914**, and **Log Loss decreased from 0.463 to 0.366**, indicating a more discriminative and better-calibrated model. Precision also improved notably **from 0.774 to 0.875**, suggesting fewer false positive predictions, although recall slightly decreased from 0.716 to 0.689.

**Why Engineered Features Helped:** The engineered features likely improved performance by capturing critical non-linear relationships and interactions not apparent in the raw features. `family_size` and `sex_pclass` directly address hypotheses about survival biases (e.g., "women and children first," class-based resource allocation), allowing the model to better differentiate survival patterns within these key demographic groups.


---

## Rubric (Titanic, 100 pts)
**Team-only deliverable in this notebook**

- Baseline LOGISTIC_REG + evaluation (AUC + confusion @0.5) — **20**  
- Custom threshold confusion matrix + ops justification — **20**  
- Engineered model with `TRANSFORM` (family_size, fare_bucket, sex_pclass) — **20**  
- Comparison table (baseline vs engineered) + 3–5 sentence interpretation — **20**  
- Reproducibility: parameters clear, no hidden magic; mapping documented — **10**  
- Governance notes: assumptions/limitations + slices you would monitor — **10**

> **Strictness:** No screenshots; use actual results cells. Keep explanations concise (bullet points OK).
