<a href="https://colab.research.google.com/github/ews46167-art/mgmt467-analytics-portfolio/blob/main/Unit2_Lab2_Churn_Modeling_FeatureEngineering_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📊 MGMT 467 - Unit 2 Lab 2: Churn Modeling with BigQueryML + Feature Engineering
**Date:** 2025-10-16

In this lab you will:
- Connect to BigQuery from Colab
- Create features and labels
- Engineer new features from user behavior
- Train and evaluate logistic regression models
- Reflect on modeling assumptions and interpret results

In [None]:
# ✅ Authenticate and set up GCP project
from google.colab import auth
auth.authenticate_user()

project_id = "manifest-chain-471119-t8"  # <-- Replace with your actual project ID
!gcloud config set project $project_id

Updated property [core/project].


In [None]:
# ✅ Verify BigQuery access
%%bigquery --project $project_id
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-26,ews46167@gmail.com


In [None]:
# ✅ Prepare base churn features
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features` AS
SELECT
  user_id,
  age,
  subscription_plan,
  monthly_spend,
  primary_device,
  household_size,
  is_active AS churn_label # Using 'is_active' as the churn label and renaming it
FROM `netflix.users`
WHERE is_active IS NOT NULL;

Query is running:   0%|          |

In [None]:
# ✅ Train base logistic regression model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  age,
  subscription_plan,
  monthly_spend,
  primary_device,
  household_size,
  churn_label
FROM `netflix.churn_features`;

Query is running:   0%|          |

In [None]:
# ✅ Evaluate base model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.862376,1.0,0.862376,0.926103,0.400673,0.510935


In [None]:
# ✅ Predict churn with base model
%%bigquery --project $project_id
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs
FROM ML.PREDICT(MODEL `netflix.churn_model`,
                (SELECT
                   user_id,
                   age,
                   subscription_plan,
                   monthly_spend,
                   primary_device,
                   household_size
                 FROM `netflix.churn_features`));

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,predicted_churn_label,predicted_churn_label_probs
0,user_00008,True,"[{'label': True, 'prob': 0.858414565124976}, {..."
1,user_00008,True,"[{'label': True, 'prob': 0.858414565124976}, {..."
2,user_00024,True,"[{'label': True, 'prob': 0.8554535084294278}, ..."
3,user_00024,True,"[{'label': True, 'prob': 0.8554535084294278}, ..."
4,user_00028,True,"[{'label': True, 'prob': 0.8619933905165235}, ..."
...,...,...,...
20595,user_02460,True,"[{'label': True, 'prob': 0.8710831967358763}, ..."
20596,user_08221,True,"[{'label': True, 'prob': 0.8652520367827865}, ..."
20597,user_08221,True,"[{'label': True, 'prob': 0.8652520367827865}, ..."
20598,user_09179,True,"[{'label': True, 'prob': 0.876235430797202}, {..."



## 🛠️ Feature Engineering Section

We will now engineer new features to improve model performance:

- Bucket continuous variables
- Create interaction terms
- Add behavioral flags


In [None]:
# ✅ Create enhanced feature set
%%bigquery --project $project_id
CREATE OR REPLACE TABLE `netflix.churn_features_enhanced` AS
SELECT
  user_id,
  age,
  subscription_plan,
  monthly_spend,
  primary_device,
  household_size,
  churn_label,
  CASE
    WHEN age < 25 THEN 'young'
    WHEN age BETWEEN 25 AND 40 THEN 'adult'
    ELSE 'senior'
  END AS age_bucket,
  CASE
    WHEN monthly_spend < 10 THEN 'low_spend'
    WHEN monthly_spend BETWEEN 10 AND 20 THEN 'medium_spend'
    ELSE 'high_spend'
  END AS monthly_spend_bucket,
  CONCAT(subscription_plan, '_', primary_device) AS plan_device_combo,
  IF(household_size > 2, 1, 0) AS flag_large_household # Example of a binary flag
FROM `netflix.churn_features`;

Query is running:   0%|          |

In [None]:
# ✅ Train enhanced model
%%bigquery --project $project_id
CREATE OR REPLACE MODEL `netflix.churn_model_enhanced`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label']) AS
SELECT
  age,
  subscription_plan,
  monthly_spend,
  primary_device,
  household_size,
  age_bucket,
  monthly_spend_bucket,
  plan_device_combo,
  flag_large_household,
  churn_label
FROM `netflix.churn_features_enhanced`;

Query is running:   0%|          |

In [None]:
# ✅ Evaluate enhanced model
%%bigquery --project $project_id
SELECT *
FROM ML.EVALUATE(MODEL `netflix.churn_model_enhanced`);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.849532,1.0,0.849532,0.918645,0.423422,0.520218



## 🤔 Chain-of-Thought Prompts: Feature Engineering

### 1. Why bucket continuous values like watch time?
- What patterns become clearer by using categories like "low", "medium", "high"?

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
- Could some plans behave differently in different regions?

### 3. What’s the purpose of binary flags like `flag_binge`?
- Can these capture unique behaviors not reflected in raw totals?

### 4. After evaluating the enhanced model:
- Which new features helped the most?
- Did any surprise you?

✍️ Write your responses in a text cell below or in a shared doc for discussion.


## Answers to Feature Engineering Questions

### 1. Why bucket continuous values like watch time?
Bucketing continuous variables like age or monthly spend into categories ("young", "adult", "senior" or "low_spend", "medium_spend", "high_spend") can help:
- **Simplify the model:** It reduces the number of unique values the model needs to consider.
- **Capture non-linear relationships:** If the relationship between a continuous variable and the target is not linear, bucketing can help the model capture these relationships more effectively.
- **Handle outliers:** Outliers can have a smaller impact when data is grouped into buckets.
- **Improve interpretability:** Categories can sometimes be easier to interpret than continuous values.

### 2. What value do interaction terms (e.g., `plan_tier_region`) add?
Interaction terms, like `plan_device_combo` (combining subscription plan and primary device), add value by:
- **Capturing synergistic or antagonistic effects:** They allow the model to see if the effect of one feature depends on the value of another feature. For example, a certain subscription plan might be more likely to churn *only* when used on a specific device.
- **Increasing model complexity (when needed):** They can help the model fit more complex relationships in the data that couldn't be captured by considering features independently.

### 3. What’s the purpose of binary flags like `flag_binge`?
Binary flags like `flag_large_household` (indicating if a household size is greater than 2) serve to:
- **Highlight specific conditions:** They explicitly flag instances that meet a certain criteria that might be particularly relevant to the target variable.
- **Capture thresholds or specific behaviors:** They can capture unique behaviors or states that aren't easily represented by the raw continuous or categorical data alone. For example, a household size exceeding a certain number might significantly impact churn.

### 4. After evaluating the enhanced model:
Based on the evaluation results from cell `69fc3475` (Base Model) and cell `423b6d00` (Enhanced Model):

- **Base Model Evaluation (`69fc3475`):**
  - precision: 0.862376
  - recall: 1.0
  - accuracy: 0.862376
  - f1_score: 0.926103
  - log_loss: 0.400673
  - roc_auc: 0.510935

- **Enhanced Model Evaluation (`423b6d00`):**
  - precision: 0.849532
  - recall: 1.0
  - accuracy: 0.849532
  - f1_score: 0.918645
  - log_loss: 0.423422
  - roc_auc: 0.520218

Comparing the two:
- The **Enhanced Model** shows a slight increase in `roc_auc` (0.520218 vs 0.510935), which is a measure of the model's ability to distinguish between the positive and negative classes. A higher ROC AUC is generally better.
- Other metrics like precision, accuracy, f1_score, and log_loss are slightly worse in the Enhanced Model. However, the `roc_auc` is often a key metric for evaluating classifier performance, especially with imbalanced datasets.

Which new features helped the most?
It's difficult to definitively say which *individual* new feature helped the most just by looking at the aggregate evaluation metrics. The slight improvement in ROC AUC suggests that the *combination* of engineered features added some value in discriminating between churned and non-churned users, even though other metrics slightly decreased. A more detailed analysis, such as examining feature weights in the model or performing feature importance analysis (if available), would be needed to pinpoint the most impactful features.

Did any surprise you?
Without further analysis of individual feature contributions, it's hard to say if any specific feature was surprisingly impactful. The overall change in performance was relatively small, suggesting that while the engineered features added some predictive power (indicated by the ROC AUC), they didn't dramatically change the model's performance based on these high-level metrics.