<a href="https://colab.research.google.com/github/elebon26/mgmt467-analytics-portfolio/blob/main/Labs/Unit2/(Ethan_Lebon_Completed)_Unit2_Lab2_PromptStudio_Tasks5onwards.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio — Feature Engineering & Beyond

**Date:** 2025-10-16  
This notebook continues from Task 5 onward, focusing on feature engineering and model iteration using AI-assisted prompt design.

You'll continue to:
- Generate SQL using prompt templates
- Build and test new features
- Retrain and evaluate your ML model
- Reflect on the effect of engineered features



## Task 5.0: Bucket a Continuous Feature

**🎯 Goal:** Group 'total_minutes' into categories: low, medium, high.  
**📌 Requirements:** Use CASE WHEN or IF statements to create 'watch_time_bucket'.

---

### 🧠 Prompt Template  
> Write SQL that creates a new column watch_time_bucket based on total_minutes thresholds (<100, 100–300, >300).

---

### 👩‍🏫 Example Prompt  
> Create a new column watch_time_bucket with values 'low', 'medium', or 'high' based on total_minutes.

---

### 🔍 Exploration  
How does churn rate vary across these buckets?


In [3]:
!pip -q install google-cloud-bigquery
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
SRC_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features"
DST_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_v2"

bq = bigquery.Client(project=PROJECT_ID)

# 1) Create a new features table with the bucketed column
create_sql = f"""
CREATE OR REPLACE TABLE `{DST_TABLE}` AS
SELECT
  /* keep original fields */
  user_id,
  region,
  plan_tier,
  age_band,
  total_minutes,
  churn_label,
  /* new bucket feature */
  CASE
    WHEN total_minutes IS NULL         THEN 'unknown'
    WHEN total_minutes < 100           THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket
FROM `{SRC_TABLE}`;
"""

print("🚀 Creating table with bucketed feature:", DST_TABLE)
bq.query(create_sql).result()
print("✅ Created:", DST_TABLE)

# 2) Quick exploration: churn rate & counts by bucket
explore_sql = f"""
SELECT
  watch_time_bucket,
  COUNT(*) AS users,
  AVG(CAST(churn_label AS FLOAT64)) AS churn_rate
FROM `{DST_TABLE}`
GROUP BY watch_time_bucket
ORDER BY
  CASE watch_time_bucket
    WHEN 'unknown' THEN 0
    WHEN 'low' THEN 1
    WHEN 'medium' THEN 2
    WHEN 'high' THEN 3
    ELSE 4
  END;
"""
print("📊 Churn rate by watch_time_bucket")
explore_df = bq.query(explore_sql).result().to_dataframe()
explore_df

🚀 Creating table with bucketed feature: mgmt-467-1234.netflix.churn_features_v2
✅ Created: mgmt-467-1234.netflix.churn_features_v2
📊 Churn rate by watch_time_bucket


Unnamed: 0,watch_time_bucket,users,churn_rate
0,low,403004,0.659666
1,medium,209116,0.656994
2,high,335480,0.66113


* * *
**🔍 Exploration: Short Answer**

Based on the exploration, the churn rate is very similar across the low, medium, and high watch time buckets.
* * *


## Task 5.1: Create a Binary Flag Feature

**🎯 Goal:** Add a binary column flag_binge (1 if total_minutes > 500).  
**📌 Requirements:** Use IF logic to create a binary column in SQL.

---

### 🧠 Prompt Template  
> Write a SQL query that adds flag_binge = 1 if total_minutes > 500, else 0.

---

### 👩‍🏫 Example Prompt  
> Add a binary column flag_binge to identify binge-watchers.

---

### 🔍 Exploration  
Are binge-watchers more or less likely to churn?


In [4]:
from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
SRC_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_v2"
DST_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_v3"

bq = bigquery.Client(project=PROJECT_ID)

# 1️⃣ Create table with binary flag feature
create_sql = f"""
CREATE OR REPLACE TABLE `{DST_TABLE}` AS
SELECT
  *,
  IF(total_minutes > 500, 1, 0) AS flag_binge
FROM `{SRC_TABLE}`;
"""

print("🚀 Creating table with binary flag feature:", DST_TABLE)
bq.query(create_sql).result()
print("✅ Created:", DST_TABLE)

# 2️⃣ Exploration: churn rate by binge flag
explore_sql = f"""
SELECT
  flag_binge,
  COUNT(*) AS users,
  AVG(CAST(churn_label AS FLOAT64)) AS churn_rate
FROM `{DST_TABLE}`
GROUP BY flag_binge
ORDER BY flag_binge;
"""

print("📊 Churn rate by binge-watcher flag")
explore_df = bq.query(explore_sql).result().to_dataframe()
explore_df


🚀 Creating table with binary flag feature: mgmt-467-1234.netflix.churn_features_v3
✅ Created: mgmt-467-1234.netflix.churn_features_v3
📊 Churn rate by binge-watcher flag


Unnamed: 0,flag_binge,users,churn_rate
0,0,756340,0.658952
1,1,191260,0.662135


* * *
**🔍 Exploration:**


Based on the exploration, binge-watchers (flag_binge = 1) have a slightly higher churn rate (0.662135) compared to non-binge-watchers (flag_binge = 0) (0.658952).

* * *


## Task 5.2: Create an Interaction Term

**🎯 Goal:** Create plan_region_combo by combining plan_tier and region.  
**📌 Requirements:** Use CONCAT or STRING functions.

---

### 🧠 Prompt Template  
> Generate SQL to create a new column by combining plan_tier and region with an underscore.

---

### 👩‍🏫 Example Prompt  
> Create a column called plan_region_combo as CONCAT(plan_tier, '_', region).

---

### 🔍 Exploration  
Which plan-region combos have highest churn?


In [5]:
from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
SRC_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_v3"
DST_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_v4"

bq = bigquery.Client(project=PROJECT_ID)

# 1️⃣ Create table with interaction feature
create_sql = f"""
CREATE OR REPLACE TABLE `{DST_TABLE}` AS
SELECT
  *,
  CONCAT(plan_tier, '_', region) AS plan_region_combo
FROM `{SRC_TABLE}`;
"""

print("🚀 Creating table with interaction term:", DST_TABLE)
bq.query(create_sql).result()
print("✅ Created:", DST_TABLE)

# 2️⃣ Exploration: churn rate by plan-region combo
explore_sql = f"""
SELECT
  plan_region_combo,
  COUNT(*) AS users,
  AVG(CAST(churn_label AS FLOAT64)) AS churn_rate
FROM `{DST_TABLE}`
GROUP BY plan_region_combo
ORDER BY churn_rate DESC
LIMIT 15;
"""

print("📊 Top 15 plan-region combos by churn rate")
explore_df = bq.query(explore_sql).result().to_dataframe()
explore_df


🚀 Creating table with interaction term: mgmt-467-1234.netflix.churn_features_v4
✅ Created: mgmt-467-1234.netflix.churn_features_v4
📊 Top 15 plan-region combos by churn rate


Unnamed: 0,plan_region_combo,users,churn_rate
0,Standard_Canada,101292,0.66489
1,Premium+_Canada,25852,0.664243
2,Basic_Canada,56488,0.661804
3,Premium_Canada,101200,0.661779
4,Premium_USA,231748,0.659268
5,Basic_USA,129352,0.658451
6,Premium+_USA,69460,0.658221
7,Standard_USA,232208,0.656653



## Task 5.3: Add Missingness Indicator Flags

**🎯 Goal:** Add binary flags to capture NULL values in age_band and avg_rating.  
**📌 Requirements:** Use IS NULL logic to create new flag columns.

---

### 🧠 Prompt Template  
> Create a new column is_missing_[col_name] that is 1 when column is NULL, else 0.

---

### 👩‍🏫 Example Prompt  
> Add is_missing_age that flags rows where age_band IS NULL.

---

### 🔍 Exploration  
Do missing values correlate with churn?


In [7]:
from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
SRC_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_v4"
DST_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_v5"

bq = bigquery.Client(project=PROJECT_ID)

# 1️⃣ Create table with missingness flags
create_sql = f"""
CREATE OR REPLACE TABLE `{DST_TABLE}` AS
SELECT
  *,
  IF(age_band IS NULL, 1, 0) AS is_missing_age_band
FROM `{SRC_TABLE}`;
"""

print("🚀 Creating table with missingness indicator flags:", DST_TABLE)
bq.query(create_sql).result()
print("✅ Created:", DST_TABLE)

# 2️⃣ Exploration: churn rate by missingness flags
explore_sql = f"""
SELECT
  is_missing_age_band,
  COUNT(*) AS users,
  AVG(CAST(churn_label AS FLOAT64)) AS churn_rate
FROM `{DST_TABLE}`
GROUP BY is_missing_age_band
ORDER BY churn_rate DESC;
"""

print("📊 Churn rate by missingness flags")
explore_df = bq.query(explore_sql).result().to_dataframe()
explore_df

🚀 Creating table with missingness indicator flags: mgmt-467-1234.netflix.churn_features_v5
✅ Created: mgmt-467-1234.netflix.churn_features_v5
📊 Churn rate by missingness flags


Unnamed: 0,is_missing_age_band,users,churn_rate
0,0,947600,0.659595


* * *
**🔍 Exploration: Short Answer**

Based on the exploration, there are no missing values in the age_band column, so we cannot determine if missing values in age_band correlate with churn from this data.
* * *


## Task 5.5: Assemble Enhanced Feature Table

**🎯 Goal:** Create churn_features_enhanced with all engineered columns.  
**📌 Requirements:** Include all prior features + engineered columns.

---

### 🧠 Prompt Template  
> Generate SQL to create churn_features_enhanced with new columns: watch_time_bucket, plan_region_combo, flag_binge, etc.

---

### 👩‍🏫 Example Prompt  
> Build a new table churn_features_enhanced with all original features + engineered ones.

---

### 🔍 Exploration  
Are row counts stable? Any NULLs introduced?


In [17]:
from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
SRC_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_v5"
DST_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_enhanced"

bq = bigquery.Client(project=PROJECT_ID)

create_sql = f"""
CREATE OR REPLACE TABLE `{DST_TABLE}` AS
SELECT
  user_id,
  region,
  plan_tier,
  age_band,
  total_minutes,
  churn_label,
  -- engineered features
  watch_time_bucket,
  flag_binge,
  plan_region_combo,
  is_missing_age_band
FROM `{SRC_TABLE}`;
"""

print("🚀 Creating enhanced feature table:", DST_TABLE)
bq.query(create_sql).result()
print("✅ Created:", DST_TABLE)

# 2️⃣ Sanity check: confirm row count and nulls
check_sql = f"""
SELECT
  COUNT(*) AS total_rows,
  SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) AS null_user_id,
  SUM(CASE WHEN churn_label IS NULL THEN 1 ELSE 0 END) AS null_churn_label
FROM `{DST_TABLE}`;
"""
bq.query(check_sql).result().to_dataframe()


🚀 Creating enhanced feature table: mgmt-467-1234.netflix.churn_features_enhanced
✅ Created: mgmt-467-1234.netflix.churn_features_enhanced


Unnamed: 0,total_rows,null_user_id,null_churn_label
0,947600,0,0


* * *
**🔍 Exploration:**

Based on the sanity check, the row count in the churn_features_enhanced table is the same as the source table, indicating that the row counts are stable. There are also no NULL values for user_id and churn_label.
* * *


## Task 6: Retrain Model on Engineered Features

**🎯 Goal:** Train a logistic regression model using churn_features_enhanced.  
**📌 Requirements:** Use BQML logistic_reg model with new feature columns.

---

### 🧠 Prompt Template  
> Write CREATE MODEL SQL using enhanced features including flags and buckets.

---

### 👩‍🏫 Example Prompt  
> Retrain churn_model_enhanced using watch_time_bucket, flag_binge, plan_region_combo.

---

### 🔍 Exploration  
Does model accuracy improve?


In [19]:
# Retrain logistic regression on enhanced features (fixed options)
from google.cloud import bigquery
bq = bigquery.Client(project="mgmt-467-1234")

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
SRC_TABLE  = f"{PROJECT_ID}.{DATASET}.churn_features_enhanced"
MODEL_NAME = f"{PROJECT_ID}.{DATASET}.churn_model_enhanced"

train_sql = f"""
CREATE OR REPLACE MODEL `{MODEL_NAME}`
OPTIONS(
  model_type = 'LOGISTIC_REG',
  input_label_cols = ['churn_label'],
  auto_class_weights = TRUE,
  data_split_method = 'AUTO_SPLIT',
  max_iterations = 50,
  enable_global_explain = TRUE
) AS
SELECT
  region,
  plan_tier,
  age_band,
  total_minutes,
  watch_time_bucket,
  flag_binge,
  plan_region_combo,
  is_missing_age_band,
  churn_label
FROM `{SRC_TABLE}`;
"""

print("🚀 Training enhanced churn model …")
bq.query(train_sql).result()
print("✅ Model trained:", MODEL_NAME)

# Evaluate
eval_sql = f"SELECT * FROM ML.EVALUATE(MODEL `{MODEL_NAME}`);"
eval_df = bq.query(eval_sql).result().to_dataframe()
print("📊 Evaluation:")
eval_df


🚀 Training enhanced churn model …
✅ Model trained: mgmt-467-1234.netflix.churn_model_enhanced
📊 Evaluation:


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.668545,0.468978,0.495478,0.551256,0.693116,0.511066


* * *
**🔍 Exploration: Short Answer**

Based on the evaluation, the enhanced model has an accuracy of 0.495478 so it has not significantly improved.
* * *


## Task 7: Compare Model Performance

**🎯 Goal:** Compare base model vs enhanced model using ML.EVALUATE.  
**📌 Requirements:** Use same evaluation query for both models.

---

### 🧠 Prompt Template  
> Write a SQL query to evaluate churn_model_enhanced and compare with churn_model.

---

### 👩‍🏫 Example Prompt  
> Compare ML.EVALUATE output from both models side-by-side.

---

### 🔍 Exploration  
Which features made the most difference?


In [25]:
from google.cloud import bigquery
import pandas as pd

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
BASE_MODEL = f"{PROJECT_ID}.{DATASET}.churn_logreg"
ENH_MODEL  = f"{PROJECT_ID}.{DATASET}.churn_model_enhanced"

bq = bigquery.Client(project=PROJECT_ID)

def get_top_weights(model_name, top_k=10):
    # Do NOT reference 'category' or other optional fields
    sql = f"""
    SELECT *
    FROM ML.WEIGHTS(MODEL `{model_name}`)
    ORDER BY ABS(weight) DESC
    LIMIT {top_k};
    """
    df = bq.query(sql).result().to_dataframe()
    print(f"Columns returned for {model_name}: {list(df.columns)}")
    # Reorder if some helpful columns exist
    preferred = [c for c in ["processed_input","feature","weight","standardized_weight","p_value","stddev"] if c in df.columns]
    return df[preferred] if preferred else df

print("💡 Top weights (base):")
weights_base = get_top_weights(BASE_MODEL, 10)
display(weights_base)

print("💡 Top weights (enhanced):")
weights_enh = get_top_weights(ENH_MODEL, 10)
display(weights_enh)


💡 Top weights (base):
Columns returned for mgmt-467-1234.netflix.churn_logreg: ['processed_input', 'weight', 'category_weights']


Unnamed: 0,processed_input,weight
0,__INTERCEPT__,-0.0001153022
1,total_minutes,3.959847e-07
2,region,
3,plan_tier,
4,age_band,


💡 Top weights (enhanced):
Columns returned for mgmt-467-1234.netflix.churn_model_enhanced: ['processed_input', 'weight', 'category_weights']


Unnamed: 0,processed_input,weight
0,flag_binge,0.001946387
1,__INTERCEPT__,-0.0006810665
2,total_minutes,9.91243e-07
3,is_missing_age_band,0.0
4,region,
5,plan_tier,
6,age_band,
7,watch_time_bucket,
8,plan_region_combo,


* * *
**🔍 Exploration:**

Based on the top weights from both models, the flag_binge feature seemes to have the largest impact (absolute weight) in the enhanced model compared to the base model. The total_minutes feature also has a notable weight in both models.
* * *