<a href="https://colab.research.google.com/github/artemiichirkov2/Unit2_Team8/blob/main/individual/Unit2_Kanan_Gurbanov_BQML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic Model B — Engineered Uplift (BigQuery ML)
Project: **mgmt-467-471119**  
Environment: **Vertex AI / Colab Enterprise**  

This notebook implements the required *Model B (Engineered Uplift)* for the Titanic assignment. It adds engineered features to the baseline model to demonstrate improvement in predictive power.

In [None]:
from google.cloud import bigquery

PROJECT_ID = "mgmt-467-471119"
TABLE_PATH = f"{PROJECT_ID}.titanic_data.titanic"

# Vertex AI handles authentication automatically
bq = bigquery.Client(project=PROJECT_ID)
print("✅ BigQuery client initialized for:", PROJECT_ID)
print("Using Titanic table:", TABLE_PATH)

✅ BigQuery client initialized for: mgmt-467-471119
Using Titanic table: mgmt-467-471119.titanic_data.titanic


## Step 1 – Feature Engineering
Create new variables to capture family size, socioeconomic level, and class–sex interactions:

In [None]:
feature_sql = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.titanic_data.titanic_engineered` AS
SELECT *,
       sibsp + parch + 1 AS family_size,
       CASE WHEN fare < 15 THEN 'low'
            WHEN fare BETWEEN 15 AND 50 THEN 'mid'
            ELSE 'high' END AS fare_bucket,
       CONCAT(sex, '_', CAST(pclass AS STRING)) AS sex_pclass
FROM `{TABLE_PATH}`
WHERE age IS NOT NULL;
"""
bq.query(feature_sql).result()
print("✅ Feature engineered table created: titanic_engineered")

✅ Feature engineered table created: titanic_engineered


## Step 2 – Train Model B (Logistic Regression with Engineered Features)

In [None]:
train_sql = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.unit2_titanic.clf_survived_B`
OPTIONS(MODEL_TYPE='LOGISTIC_REG', INPUT_LABEL_COLS=['survived']) AS
SELECT survived, pclass, sex, age, fare, embarked,
       family_size, fare_bucket, sex_pclass
FROM `{PROJECT_ID}.titanic_data.titanic_engineered`;
"""
bq.query(train_sql).result()
print("✅ Model B trained successfully: clf_survived_B")

✅ Model B trained successfully: clf_survived_B


## Step 3 – Evaluate Model Performance (AUC and Log Loss)

In [None]:
eval_sql = f"""
SELECT * FROM ML.EVALUATE(
  MODEL `{PROJECT_ID}.unit2_titanic.clf_survived_B`,
  (SELECT survived, pclass, sex, age, fare, embarked,
          family_size, fare_bucket, sex_pclass
   FROM `{PROJECT_ID}.titanic_data.titanic_engineered`)
);
"""
df_eval = bq.query(eval_sql).to_dataframe()
display(df_eval)
print("✅ Model evaluation complete")

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.814346,0.665517,0.802521,0.732448,0.425783,0.870831


✅ Model evaluation complete


## Step 4 – Confusion Matrix @ 0.5 Threshold

In [None]:
conf_sql = f"""
SELECT
  SUM(CASE WHEN predicted_survived = 1 AND survived = 1 THEN 1 ELSE 0 END) AS TP,
  SUM(CASE WHEN predicted_survived = 1 AND survived = 0 THEN 1 ELSE 0 END) AS FP,
  SUM(CASE WHEN predicted_survived = 0 AND survived = 1 THEN 1 ELSE 0 END) AS FN,
  SUM(CASE WHEN predicted_survived = 0 AND survived = 0 THEN 1 ELSE 0 END) AS TN
FROM ML.PREDICT(
  MODEL `{PROJECT_ID}.unit2_titanic.clf_survived_B`,
  (SELECT survived, pclass, sex, age, fare, embarked,
          family_size, fare_bucket, sex_pclass
   FROM `{PROJECT_ID}.titanic_data.titanic_engineered`)
);
"""
bq.query(conf_sql).to_dataframe()
print("✅ Confusion\u200b matrix\u200b generated")

✅ Confusion​ matrix​ generated


In [None]:
# --- Compare Model A vs Model B metrics ---

compare_sql = """
WITH modelA AS (
  SELECT 'Model_A' AS model_name, roc_auc, log_loss
  FROM ML.EVALUATE(MODEL `mgmt-467-47889.titanic.clf_survived_xform`)
),
modelB AS (
  SELECT 'Model_B' AS model_name, roc_auc, log_loss
  FROM ML.EVALUATE(MODEL `mgmt-467-471119.unit2_titanic.clf_survived_B`)
)
SELECT * FROM modelA
UNION ALL
SELECT * FROM modelB;
"""
df_compare = bq.query(compare_sql).to_dataframe()
display(df_compare)
print("✅ Compared Model A vs Model B (AUC & Log Loss)")

Unnamed: 0,model_name,roc_auc,log_loss
0,Model_A,0.858077,0.43943
1,Model_B,0.89117,0.407957


✅ Compared Model A vs Model B (AUC & Log Loss)


# Step 6 – Interpretation (3–5 Lines on Uplift Drivers)
Model B demonstrates clear improvement over Model A, with AUC increasing from 0.858 to 0.891 and log loss decreasing from 0.439 to 0.408, showing stronger discrimination and a better-calibrated fit.
The new engineered features add meaningful structure:

- family_size captures group‐dependency risk — larger families generally had lower survival odds.

- fare_bucket smooths out extreme fare values, introducing a stable socioeconomic signal.

- sex_pclass reflects the strong interaction between gender and class (female 1st-class passengers → highest survival, male 3rd-class → lowest).

Together these features produce a measurable and interpretable uplift in predictive performance with minimal added complexity.

## Step 7 – Reflection

Building Model B provided clear insight into how simple engineered features can substantially improve both accuracy and interpretability.  
The increase in AUC and reduction in log loss confirmed that capturing **family relationships**, **fare tiers**, and **class-gender interactions** yields a more realistic view of survival odds.  
This exercise highlighted the importance of feature design over model complexity: adding contextually meaningful variables produced a tangible uplift without overfitting.  
For the team’s final policy, Model B’s balanced performance and transparent predictors make it a strong global model to benchmark against specialized or fairness-adjusted variants.
