<a href="https://colab.research.google.com/github/brandonmoss124/mgmt467-analytics-portfolio/blob/main/Unit2_Brandon_Moss_BQML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic Model C — Subgroup Specialization (BigQuery ML)
**Project:** proven-agility-477721-q9  
**Environment:** Vertex AI / Colab Enterprise

This notebook implements **Model C**, trained only on passengers where `pclass = 3`.
It evaluates Model C and compares it to a **global Model B** trained in *this* project, so everything runs end-to-end without cross-project permission issues.


In [1]:
from google.colab import auth
auth.authenticate_user()
print('✅ Authenticated with Google user')

✅ Authenticated with Google user


In [2]:
from google.cloud import bigquery

# Local project where features and both models live
PROJECT_ID = "proven-agility-477721-q9"

DATASET_TITANIC = "titanic"       # titanic_engineered + Model B
DATASET_UNIT2   = "unit2"         # where Model C will be stored

FEATURE_TABLE = f"{PROJECT_ID}.{DATASET_TITANIC}.titanic_engineered"
MODEL_B_PATH  = f"{PROJECT_ID}.{DATASET_TITANIC}.clf_survived_B"   # global Model B in your project
MODEL_C_PATH  = f"{PROJECT_ID}.{DATASET_UNIT2}.clf_survived_C"     # subgroup Model C

client = bigquery.Client(project=PROJECT_ID)

print("FEATURE_TABLE:", FEATURE_TABLE)
print("MODEL_B_PATH :", MODEL_B_PATH)
print("MODEL_C_PATH :", MODEL_C_PATH)


FEATURE_TABLE: proven-agility-477721-q9.titanic.titanic_engineered
MODEL_B_PATH : proven-agility-477721-q9.titanic.clf_survived_B
MODEL_C_PATH : proven-agility-477721-q9.unit2.clf_survived_C


## Step 1 – Train Model B (Global Baseline)
Model B is trained on **all passengers** using the engineered Titanic features.

In [3]:
train_b_sql = f"""
CREATE OR REPLACE MODEL `{MODEL_B_PATH}`
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['survived']
) AS
SELECT
  survived,
  pclass,
  sex,
  age,
  fare,
  embarked,
  family_size,
  fare_bucket,
  sex_pclass
FROM `{FEATURE_TABLE}`;
"""

job_b = client.query(train_b_sql)
job_b.result()
print("✅ Model B (global) trained at", MODEL_B_PATH)


✅ Model B (global) trained at proven-agility-477721-q9.titanic.clf_survived_B


## Step 2 – Train Model C on Subgroup (pclass = 3)
Model C focuses only on third-class passengers to test whether specialization improves performance.

In [4]:
train_c_sql = f"""
CREATE OR REPLACE MODEL `{MODEL_C_PATH}`
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['survived']
) AS
SELECT
  survived,
  pclass,
  sex,
  age,
  fare,
  embarked,
  family_size,
  fare_bucket,
  sex_pclass
FROM `{FEATURE_TABLE}`
WHERE pclass = 3;
"""

job_c = client.query(train_c_sql)
job_c.result()
print("✅ Model C (pclass = 3) trained at", MODEL_C_PATH)


✅ Model C (pclass = 3) trained at proven-agility-477721-q9.unit2.clf_survived_C


## Step 3 – Evaluate Model C Performance on pclass = 3 Subgroup

In [5]:
eval_c_sql = f"""
SELECT *
FROM ML.EVALUATE(
  MODEL `{MODEL_C_PATH}`,
  (SELECT *
   FROM `{FEATURE_TABLE}`
   WHERE pclass = 3)
);
"""

eval_c_df = client.query(eval_c_sql).to_dataframe()
eval_c_df


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.7,0.247059,0.794366,0.365217,0.457633,0.77144


## Step 4 – Compare Model B vs Model C on pclass = 3 Subgroup
Both models are evaluated on the **same subgroup** so metrics are directly comparable.

In [6]:
compare_sql = f"""
WITH eval_subgroup AS (
  SELECT
    survived,
    pclass,
    sex,
    age,
    fare,
    embarked,
    family_size,
    fare_bucket,
    sex_pclass
  FROM `{FEATURE_TABLE}`
  WHERE pclass = 3
),
b AS (
  SELECT 'Model B (Global)' AS model, *
  FROM ML.EVALUATE(MODEL `{MODEL_B_PATH}`, TABLE eval_subgroup)
),
c AS (
  SELECT 'Model C (pclass=3)' AS model, *
  FROM ML.EVALUATE(MODEL `{MODEL_C_PATH}`, TABLE eval_subgroup)
)
SELECT
  model,
  roc_auc,
  log_loss,
  accuracy,
  precision,
  recall
FROM b
UNION ALL
SELECT
  model,
  roc_auc,
  log_loss,
  accuracy,
  precision,
  recall
FROM c;
"""

compare_df = client.query(compare_sql).to_dataframe()
compare_df


Unnamed: 0,model,roc_auc,log_loss,accuracy,precision,recall
0,Model C (pclass=3),0.77144,0.457633,0.794366,0.7,0.247059
1,Model B (Global),0.757518,0.470845,0.780282,0.568627,0.341176


## Step 5 – Confusion Matrix for Model B and Model C @ 0.5 Threshold

In [7]:
conf_sql = f"""
WITH eval_subgroup AS (
  SELECT
    survived,
    pclass,
    sex,
    age,
    fare,
    embarked,
    family_size,
    fare_bucket,
    sex_pclass
  FROM `{FEATURE_TABLE}`
  WHERE pclass = 3
)
SELECT 'Model B (Global)' AS model, *
FROM ML.CONFUSION_MATRIX(
  MODEL `{MODEL_B_PATH}`,
  TABLE eval_subgroup,
  STRUCT(0.5 AS threshold)
)
UNION ALL
SELECT 'Model C (pclass=3)' AS model, *
FROM ML.CONFUSION_MATRIX(
  MODEL `{MODEL_C_PATH}`,
  TABLE eval_subgroup,
  STRUCT(0.5 AS threshold)
);
"""

conf_df = client.query(conf_sql).to_dataframe()
conf_df


Unnamed: 0,model,expected_label,_0,_1
0,Model B (Global),0,248,22
1,Model B (Global),1,56,29
2,Model C (pclass=3),0,261,9
3,Model C (pclass=3),1,64,21


## Step 6 – Interpretation (3–5 Lines on Localized Uplift)

- Model C targets third-class passengers, capturing different patterns than the global Model B.  
- If AUC and log loss improve, it suggests localized training better fits this segment’s unique features (for example, age and fare may matter more strongly here).  
- If metrics decline, Model B’s global design likely generalizes better across passenger classes.  
- The confusion matrix shows how precision and recall trade off specifically within this group.  
- Overall, Model C tests the value of segmentation versus maintaining a single global model.


## Step 7 – Reflection

Developing Model C showed how segmentation can improve understanding of specific groups at the cost of coverage.  
Focusing on third-class passengers revealed that family size and fare variables can behave differently within this subset.  
Comparing Model C to Model B illustrated that global models provide consistency, while localized models capture niche patterns that may be important for equity and precision.  
This exercise reinforced how contextual feature behavior guides model selection in analytics pipelines.
