<a href="https://colab.research.google.com/github/bjrodarmel/mgmt467-analytics-portfolio/blob/main/Unit2_Lab2_PartBFinsihed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio — Feature Engineering & Beyond

**Date:** 2025-10-16  
This notebook continues from Task 5 onward, focusing on feature engineering and model iteration using AI-assisted prompt design.

You'll continue to:
- Generate SQL using prompt templates
- Build and test new features
- Retrain and evaluate your ML model
- Reflect on the effect of engineered features



## Task 5.0: Bucket a Continuous Feature

**🎯 Goal:** Group 'total_minutes' into categories: low, medium, high.  
**📌 Requirements:** Use CASE WHEN or IF statements to create 'watch_time_bucket'.

---

### 🧠 Prompt Template  
> Write SQL that creates a new column watch_time_bucket based on total_minutes thresholds (<100, 100–300, >300).

---

### 👩‍🏫 Example Prompt  
> Create a new column watch_time_bucket with values 'low', 'medium', or 'high' based on total_minutes.

---

### 🔍 Exploration  
How does churn rate vary across these buckets?


In [None]:
# prompt: Using BigQuery to get the tables in mgmt-467.netflix, group total_minutes into categories: low, medium, high. Try to use the table cleaned_features

from google.cloud import bigquery

client = bigquery.Client()

# Construct a BigQuery SQL query
query = """
SELECT
  *,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    WHEN total_minutes > 300 THEN 'high'
  END AS watch_time_bucket
FROM
  `mgmt-467.netflix.cleaned_features`
"""

# Run the query
query_job = client.query(query)

# Fetch the results
results = query_job.result()

# Convert to a Pandas DataFrame for easier analysis
import pandas as pd
df = pd.DataFrame(results.to_dataframe())

print(df.head())


      user_id  region plan_tier age_band  avg_rating  total_minutes  \
0  user_00008  Canada     Basic      65+         0.0          819.8   
1  user_00008  Canada     Basic      65+         0.0          819.8   
2  user_00009     USA   Premium    18-24         0.0          776.0   
3  user_00009     USA   Premium    18-24         0.0          776.0   
4  user_00012     USA   Premium    35-44         0.0         1199.2   

   churn_label watch_time_bucket  
0            0              high  
1            0              high  
2            0              high  
3            0              high  
4            0              high  



## Task 5.1: Create a Binary Flag Feature

**🎯 Goal:** Add a binary column flag_binge (1 if total_minutes > 500).  
**📌 Requirements:** Use IF logic to create a binary column in SQL.

---

### 🧠 Prompt Template  
> Write a SQL query that adds flag_binge = 1 if total_minutes > 500, else 0.

---

### 👩‍🏫 Example Prompt  
> Add a binary column flag_binge to identify binge-watchers.

---

### 🔍 Exploration  
Are binge-watchers more or less likely to churn?


In [None]:
# prompt: I need you to now continue the work from above and add a binary column flag_binge (1 if the duration is >500). Use if logic to create a binary column

# Construct a BigQuery SQL query
query = """
SELECT
  *,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    WHEN total_minutes > 300 THEN 'high'
  END AS watch_time_bucket,
  CASE
    WHEN total_minutes > 500 THEN 1
    ELSE 0
  END AS flag_binge
FROM
  `mgmt-467.netflix.cleaned_features`
"""

# Run the query
query_job = client.query(query)

# Fetch the results
results = query_job.result()

# Convert to a Pandas DataFrame for easier analysis
df = pd.DataFrame(results.to_dataframe())

print(df.head())

      user_id  region plan_tier age_band  avg_rating  total_minutes  \
0  user_00008  Canada     Basic      65+         0.0          819.8   
1  user_00008  Canada     Basic      65+         0.0          819.8   
2  user_00009     USA   Premium    18-24         0.0          776.0   
3  user_00009     USA   Premium    18-24         0.0          776.0   
4  user_00012     USA   Premium    35-44         0.0         1199.2   

   churn_label watch_time_bucket  flag_binge  
0            0              high           1  
1            0              high           1  
2            0              high           1  
3            0              high           1  
4            0              high           1  



## Task 5.2: Create an Interaction Term

**🎯 Goal:** Create plan_region_combo by combining plan_tier and region.  
**📌 Requirements:** Use CONCAT or STRING functions.

---

### 🧠 Prompt Template  
> Generate SQL to create a new column by combining plan_tier and region with an underscore.

---

### 👩‍🏫 Example Prompt  
> Create a column called plan_region_combo as CONCAT(plan_tier, '_', region).

---

### 🔍 Exploration  
Which plan-region combos have highest churn?


In [None]:
# prompt: Now create a plan_region_combo by combining plain_tier and region from the users table

# Construct a BigQuery SQL query
query = """
SELECT
  *,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    WHEN total_minutes > 300 THEN 'high'
  END AS watch_time_bucket,
  CASE
    WHEN total_minutes > 500 THEN 1
    ELSE 0
  END AS flag_binge,
  CONCAT(plan_tier, '_', region) AS plan_region_combo
FROM
  `mgmt-467.netflix.cleaned_features`
"""

# Run the query
query_job = client.query(query)

# Fetch the results
results = query_job.result()

# Convert to a Pandas DataFrame for easier analysis
df = pd.DataFrame(results.to_dataframe())

print(df.head())

      user_id  region plan_tier age_band  avg_rating  total_minutes  \
0  user_00008  Canada     Basic      65+         0.0          819.8   
1  user_00008  Canada     Basic      65+         0.0          819.8   
2  user_00009     USA   Premium    18-24         0.0          776.0   
3  user_00009     USA   Premium    18-24         0.0          776.0   
4  user_00012     USA   Premium    35-44         0.0         1199.2   

   churn_label watch_time_bucket  flag_binge plan_region_combo  
0            0              high           1      Basic_Canada  
1            0              high           1      Basic_Canada  
2            0              high           1       Premium_USA  
3            0              high           1       Premium_USA  
4            0              high           1       Premium_USA  



## Task 5.3: Add Missingness Indicator Flags

**🎯 Goal:** Add binary flags to capture NULL values in age_band and avg_rating.  
**📌 Requirements:** Use IS NULL logic to create new flag columns.

---

### 🧠 Prompt Template  
> Create a new column is_missing_[col_name] that is 1 when column is NULL, else 0.

---

### 👩‍🏫 Example Prompt  
> Add is_missing_age that flags rows where age_band IS NULL.

---

### 🔍 Exploration  
Do missing values correlate with churn?


In [None]:
# prompt: Now add binary flags to capture NULL values in age_band and avg_rating. Use IS NULL logic to create new flag columns

# Construct a BigQuery SQL query
query = """
SELECT
  *,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    WHEN total_minutes > 300 THEN 'high'
  END AS watch_time_bucket,
  CASE
    WHEN total_minutes > 500 THEN 1
    ELSE 0
  END AS flag_binge,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  CASE
    WHEN age_band IS NULL THEN 1
    ELSE 0
  END AS is_missing_age_band,
  CASE
    WHEN avg_rating IS NULL THEN 1
    ELSE 0
  END AS is_missing_avg_rating
FROM
  `mgmt-467.netflix.cleaned_features`
"""

# Run the query
query_job = client.query(query)

# Fetch the results
results = query_job.result()

# Convert to a Pandas DataFrame for easier analysis
df = pd.DataFrame(results.to_dataframe())

print(df.head())


      user_id  region plan_tier age_band  avg_rating  total_minutes  \
0  user_00008  Canada     Basic      65+         0.0          819.8   
1  user_00008  Canada     Basic      65+         0.0          819.8   
2  user_00009     USA   Premium    18-24         0.0          776.0   
3  user_00009     USA   Premium    18-24         0.0          776.0   
4  user_00012     USA   Premium    35-44         0.0         1199.2   

   churn_label watch_time_bucket  flag_binge plan_region_combo  \
0            0              high           1      Basic_Canada   
1            0              high           1      Basic_Canada   
2            0              high           1       Premium_USA   
3            0              high           1       Premium_USA   
4            0              high           1       Premium_USA   

   is_missing_age_band  is_missing_avg_rating  
0                    0                      0  
1                    0                      0  
2                    0          


## Task 5.4: Create Time-Based Features (Optional)

**🎯 Goal:** Add a column days_since_last_login.  
**📌 Requirements:** Use DATE_DIFF with CURRENT_DATE and last_login_date.

---

### 🧠 Prompt Template  
> Write SQL to create a column showing days since last login using DATE_DIFF.

---

### 👩‍🏫 Example Prompt  
> Add a column days_since_last_login = DATE_DIFF(CURRENT_DATE(), last_login_date, DAY).

---

### 🔍 Exploration  
Does login recency affect churn rate?



## Task 5.5: Assemble Enhanced Feature Table

**🎯 Goal:** Create churn_features_enhanced with all engineered columns.  
**📌 Requirements:** Include all prior features + engineered columns.

---

### 🧠 Prompt Template  
> Generate SQL to create churn_features_enhanced with new columns: watch_time_bucket, plan_region_combo, flag_binge, etc.

---

### 👩‍🏫 Example Prompt  
> Build a new table churn_features_enhanced with all original features + engineered ones.

---

### 🔍 Exploration  
Are row counts stable? Any NULLs introduced?


In [None]:
# prompt: Create a new churn_features_enhanced table with new columns. Include all prior features plus the newly engineered columns we have created above

# Construct a BigQuery SQL query to create the new table
query = """
CREATE OR REPLACE TABLE
  `mgmt-467.netflix.churn_features_enhanced` AS
SELECT
  *,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes BETWEEN 100 AND 300 THEN 'medium'
    WHEN total_minutes > 300 THEN 'high'
  END AS watch_time_bucket,
  CASE
    WHEN total_minutes > 500 THEN 1
    ELSE 0
  END AS flag_binge,
  CONCAT(plan_tier, '_', region) AS plan_region_combo,
  CASE
    WHEN age_band IS NULL THEN 1
    ELSE 0
  END AS is_missing_age_band,
  CASE
    WHEN avg_rating IS NULL THEN 1
    ELSE 0
  END AS is_missing_avg_rating
FROM
  `mgmt-467.netflix.cleaned_features`
"""

# Run the query to create the table
query_job = client.query(query)

# Wait for the job to complete
query_job.result()

print("Table `mgmt-467.netflix.churn_features_enhanced` created successfully.")

# Optional: Display the first few rows of the new table to verify
query = """
SELECT *
FROM `mgmt-467.netflix.churn_features_enhanced`
LIMIT 5
"""
query_job = client.query(query)
results = query_job.result()
df_enhanced = pd.DataFrame(results.to_dataframe())
print("\nFirst 5 rows of the enhanced table:")
print(df_enhanced.head())

Table `mgmt-467.netflix.churn_features_enhanced` created successfully.

First 5 rows of the enhanced table:
      user_id  region plan_tier age_band  avg_rating  total_minutes  \
0  user_00008  Canada     Basic      65+         0.0          819.8   
1  user_00008  Canada     Basic      65+         0.0          819.8   
2  user_00015  Canada     Basic    45-54         4.0         1060.4   
3  user_00015  Canada     Basic    45-54         4.0         1060.4   
4  user_00021  Canada     Basic    35-44         3.0          450.8   

   churn_label watch_time_bucket  flag_binge plan_region_combo  \
0            0              high           1      Basic_Canada   
1            0              high           1      Basic_Canada   
2            1              high           1      Basic_Canada   
3            1              high           1      Basic_Canada   
4            0              high           0      Basic_Canada   

   is_missing_age_band  is_missing_avg_rating  
0                   


## Task 6: Retrain Model on Engineered Features

**🎯 Goal:** Train a logistic regression model using churn_features_enhanced.  
**📌 Requirements:** Use BQML logistic_reg model with new feature columns.

---

### 🧠 Prompt Template  
> Write CREATE MODEL SQL using enhanced features including flags and buckets.

---

### 👩‍🏫 Example Prompt  
> Retrain churn_model_enhanced using watch_time_bucket, flag_binge, plan_region_combo.

---

### 🔍 Exploration  
Does model accuracy improve?


In [None]:
# prompt: Now train a logistic regression model using the new table we just created. Make sure to use BQML logistic_reg model with the new features

# Construct a BigQuery SQL query to create and train the logistic regression model
query = """
CREATE OR REPLACE MODEL
  `mgmt-467.netflix.churn_model_enhanced`
OPTIONS(
  MODEL_TYPE='LOGISTIC_REG',
  INPUT_LABEL_COLS=['churn_label']
) AS
SELECT
  * # Select all columns, BigQuery ML will use 'churn_label' as the label based on INPUT_LABEL_COLS
FROM
  `mgmt-467.netflix.churn_features_enhanced`
"""

# Run the query to create and train the model
query_job = client.query(query)

# Wait for the job to complete
query_job.result()

print("Model `mgmt-467.netflix.churn_model_enhanced` trained successfully.")

# Optional: Display model training statistics
# stats = client.get_model_training_statistics(f"`mgmt-467.netflix.churn_model_enhanced`")
# print("\nModel training statistics:")
# stats

Model `mgmt-467.netflix.churn_model_enhanced` trained successfully.



## Task 7: Compare Model Performance

**🎯 Goal:** Compare base model vs enhanced model using ML.EVALUATE.  
**📌 Requirements:** Use same evaluation query for both models.

---

### 🧠 Prompt Template  
> Write a SQL query to evaluate churn_model_enhanced and compare with churn_model.

---

### 👩‍🏫 Example Prompt  
> Compare ML.EVALUATE output from both models side-by-side.

---

### 🔍 Exploration  
Which features made the most difference?


In [None]:
# prompt: Now compare the base model already created to the enhanced version we just made using ML.EVALUATE. Use the same evaluation query for both models.

# Construct a BigQuery SQL query to evaluate the base model
query_base = """
SELECT
  *
FROM
  ML.EVALUATE(MODEL `mgmt-467.netflix.churn_model`)
"""

# Run the query for the base model
query_job_base = client.query(query_base)
results_base = query_job_base.result()
df_base_eval = pd.DataFrame(results_base.to_dataframe())

print("Evaluation results for the base model:")
print(df_base_eval)

# Construct a BigQuery SQL query to evaluate the enhanced model
query_enhanced = """
SELECT
  *
FROM
  ML.EVALUATE(MODEL `mgmt-467.netflix.churn_model_enhanced`)
"""

# Run the query for the enhanced model
query_job_enhanced = client.query(query_enhanced)
results_enhanced = query_job_enhanced.result()
df_enhanced_eval = pd.DataFrame(results_enhanced.to_dataframe())

print("\nEvaluation results for the enhanced model:")
df_enhanced_eval


Evaluation results for the base model:
   precision  recall  accuracy  f1_score  log_loss   roc_auc
0        0.0     0.0  0.849617       0.0  0.423495  0.505052

Evaluation results for the enhanced model:


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,1.0,0.841017,0.975834,0.913644,0.090881,0.987422
