<a href="https://colab.research.google.com/github/gabeunix/mgmt467-analytics-portfolio/blob/main/GWang_Unit2_Lab5_PromptStudio_Tasks5onwards.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio — Feature Engineering & Beyond

**Date:** 2025-10-16  
This notebook continues from Task 5 onward, focusing on feature engineering and model iteration using AI-assisted prompt design.

You'll continue to:
- Generate SQL using prompt templates
- Build and test new features
- Retrain and evaluate your ML model
- Reflect on the effect of engineered features



## Task 5.0: Bucket a Continuous Feature

**🎯 Goal:** Group 'total_minutes' into categories: low, medium, high.  
**📌 Requirements:** Use CASE WHEN or IF statements to create 'watch_time_bucket'.

---

### 🧠 Prompt Template  
> Write SQL that creates a new column watch_time_bucket based on total_minutes thresholds (<100, 100–300, >300).

---

### 👩‍🏫 Example Prompt  
> Create a new column watch_time_bucket with values 'low', 'medium', or 'high' based on total_minutes.

---

### 🔍 Exploration  
How does churn rate vary across these buckets?


In [1]:
%%bigquery --project boxwood-veld-471119-r6
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-27,wanggabriel791@gmail.com


In [2]:
%%bigquery --project boxwood-veld-471119-r6
CREATE OR REPLACE TABLE `netflix.churn_features_bucketed` AS
SELECT
  *,
  CASE
    WHEN total_minutes < 100 THEN 'low'
    WHEN total_minutes >= 100 AND total_minutes <= 300 THEN 'medium'
    ELSE 'high'
  END AS watch_time_bucket
FROM `netflix.churn_features`;

Query is running:   0%|          |

In [3]:
%%bigquery --project boxwood-veld-471119-r6
SELECT
  watch_time_bucket,
  COUNT(*) AS n_users,
  AVG(CAST(churn_label AS FLOAT64)) AS churn_rate
FROM `netflix.churn_features_bucketed`
GROUP BY watch_time_bucket
ORDER BY
  CASE watch_time_bucket
    WHEN 'low' THEN 1 WHEN 'medium' THEN 2 ELSE 3
  END;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,watch_time_bucket,n_users,churn_rate
0,low,52,0.076923
1,medium,1145,0.146725
2,high,9103,0.148522


Explanation: Based on the image, the churn rate varies significantly with watch time:

Low watch time: This bucket has the lowest churn rate at approximately 7.7%.

Medium watch time: The churn rate jumps significantly to about 14.7%.

High watch time: This bucket has the highest churn rate at approximately 14.9%, almost the same as the medium bucket.


## Task 5.1: Create a Binary Flag Feature

**🎯 Goal:** Add a binary column flag_binge (1 if total_minutes > 500).  
**📌 Requirements:** Use IF logic to create a binary column in SQL.

---

### 🧠 Prompt Template  
> Write a SQL query that adds flag_binge = 1 if total_minutes > 500, else 0.

---

### 👩‍🏫 Example Prompt  
> Add a binary column flag_binge to identify binge-watchers.

---

### 🔍 Exploration  
Are binge-watchers more or less likely to churn?


In [4]:
%%bigquery --project boxwood-veld-471119-r6
-- Add a binary flag: 1 if total_minutes > 500 else 0
CREATE OR REPLACE TABLE `netflix.churn_features_flagged` AS
SELECT
  *,
  CASE
    WHEN total_minutes > 500 THEN 1
    ELSE 0
  END AS flag_binge
FROM `netflix.churn_features_bucketed`;

Query is running:   0%|          |

In [5]:
%%bigquery --project boxwood-veld-471119-r6
SELECT
  *
FROM
  `netflix.churn_features_flagged`
LIMIT 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,region,plan_tier,age_band,avg_rating,total_minutes,churn_label,watch_time_bucket,flag_binge
0,user_05355,USA,Premium+,25-34,0.0,5.5,0,low,0
1,user_05035,USA,Premium,25-34,0.0,38.6,0,low,0
2,user_04576,Canada,Premium,35-44,0.0,44.1,0,low,0
3,user_00400,USA,Basic,unknown,0.0,45.0,0,low,0
4,user_01200,Canada,Premium,18-24,0.0,56.5,0,low,0
5,user_08766,USA,Premium,35-44,0.0,58.4,0,low,0
6,user_02623,Canada,Premium+,18-24,0.0,60.4,0,low,0
7,user_08154,USA,Standard,unknown,0.0,70.6,0,low,0
8,user_00757,USA,Standard,25-34,0.0,73.4,0,low,0
9,user_06254,Canada,Standard,25-34,0.0,81.2,0,low,0


In [6]:
%%bigquery --project boxwood-veld-471119-r6
-- Compare churn between binge (1) vs non-binge (0)
SELECT
  flag_binge,
  COUNT(*) AS n_users,
  AVG(CAST(churn_label AS FLOAT64)) AS churn_rate
FROM `netflix.churn_features_flagged`
GROUP BY flag_binge
ORDER BY flag_binge DESC;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,flag_binge,n_users,churn_rate
0,1,6087,0.149335
1,0,4213,0.145977


Explanation: Users flagged as binge-watchers (over 500 minutes) had a churn rate of around 14.9%, while non-binge watchers had a slightly lower churn rate of around 14.6%. Although the difference is not big, it shows that binge watchers have a slightly higher likelihood of churning.


## Task 5.2: Create an Interaction Term

**🎯 Goal:** Create plan_region_combo by combining plan_tier and region.  
**📌 Requirements:** Use CONCAT or STRING functions.

---

### 🧠 Prompt Template  
> Generate SQL to create a new column by combining plan_tier and region with an underscore.

---

### 👩‍🏫 Example Prompt  
> Create a column called plan_region_combo as CONCAT(plan_tier, '_', region).

---

### 🔍 Exploration  
Which plan-region combos have highest churn?


In [7]:
%%bigquery --project boxwood-veld-471119-r6
-- Create plan_region_combo by combining plan_tier and region
CREATE OR REPLACE TABLE `netflix.churn_features_combined` AS
SELECT
  *,
  CONCAT(plan_tier, '_', region) AS plan_region_combo
FROM
  `netflix.churn_features_flagged`;

Query is running:   0%|          |

In [8]:
%%bigquery --project boxwood-veld-471119-r6
SELECT
  plan_region_combo,
  COUNT(*) AS n_users,
  AVG(CAST(churn_label AS FLOAT64)) AS churn_rate
FROM
  `netflix.churn_features_combined`
GROUP BY
  plan_region_combo
ORDER BY
  churn_rate DESC;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,plan_region_combo,n_users,churn_rate
0,Basic_USA,1406,0.158606
1,Standard_Canada,1101,0.158038
2,Premium_USA,2519,0.150854
3,Premium_Canada,1100,0.148182
4,Premium+_Canada,281,0.142349
5,Standard_USA,2524,0.141046
6,Premium+_USA,755,0.140397
7,Basic_Canada,614,0.13355


Response: Based on the image, the two plan-region combos with the highest churn rates are:

Basic_USA: 15.86%

Standard_Canada: 15.80%


## Task 5.3: Add Missingness Indicator Flags

**🎯 Goal:** Add binary flags to capture NULL values in age_band and avg_rating.  
**📌 Requirements:** Use IS NULL logic to create new flag columns.

---

### 🧠 Prompt Template  
> Create a new column is_missing_[col_name] that is 1 when column is NULL, else 0.

---

### 👩‍🏫 Example Prompt  
> Add is_missing_age that flags rows where age_band IS NULL.

---

### 🔍 Exploration  
Do missing values correlate with churn?


In [9]:
%%bigquery --project boxwood-veld-471119-r6
-- Add binary flags to capture NULL values in age_band and avg_rating
CREATE OR REPLACE TABLE `netflix.churn_features_missing_flags` AS
SELECT
  *,
  CASE
    WHEN age_band IS NULL THEN 1
    ELSE 0
  END AS is_missing_age_band,
  CASE
    WHEN avg_rating IS NULL THEN 1
    ELSE 0
  END AS is_missing_avg_rating
FROM
  `netflix.churn_features_combined`;

Query is running:   0%|          |

In [10]:
%%bigquery --project boxwood-veld-471119-r6
SELECT
  is_missing_age_band,
  is_missing_avg_rating,
  COUNT(*) AS n_users,
  AVG(CAST(churn_label AS FLOAT64)) AS churn_rate
FROM
  `netflix.churn_features_missing_flags`
GROUP BY
  is_missing_age_band,
  is_missing_avg_rating
ORDER BY
  is_missing_age_band,
  is_missing_avg_rating;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,is_missing_age_band,is_missing_avg_rating,n_users,churn_rate
0,0,0,10300,0.147961


Exploration: Based on the data provided, it's not possible to determine if missing values correlate with churn. The table only shows one group: the 10,300 users who are not missing data for either age_band or avg_rating (both flags are 0). This group has a churn rate of 14.8%.


## Task 5.4: Create Time-Based Features (Optional)

**🎯 Goal:** Add a column days_since_last_login.  
**📌 Requirements:** Use DATE_DIFF with CURRENT_DATE and last_login_date.

---

### 🧠 Prompt Template  
> Write SQL to create a column showing days since last login using DATE_DIFF.

---

### 👩‍🏫 Example Prompt  
> Add a column days_since_last_login = DATE_DIFF(CURRENT_DATE(), last_login_date, DAY).

---

### 🔍 Exploration  
Does login recency affect churn rate?



## Task 5.5: Assemble Enhanced Feature Table

**🎯 Goal:** Create churn_features_enhanced with all engineered columns.  
**📌 Requirements:** Include all prior features + engineered columns.

---

### 🧠 Prompt Template  
> Generate SQL to create churn_features_enhanced with new columns: watch_time_bucket, plan_region_combo, flag_binge, etc.

---

### 👩‍🏫 Example Prompt  
> Build a new table churn_features_enhanced with all original features + engineered ones.

---

### 🔍 Exploration  
Are row counts stable? Any NULLs introduced?


In [11]:
%%bigquery --project boxwood-veld-471119-r6
-- Create churn_features_enhanced with all engineered columns
CREATE OR REPLACE TABLE `netflix.churn_features_enhanced` AS
SELECT
  *
FROM
  `netflix.churn_features_missing_flags`;

Query is running:   0%|          |

In [12]:
%%bigquery --project boxwood-veld-471119-r6
SELECT
  *
FROM
  `netflix.churn_features_enhanced`
LIMIT 10;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,region,plan_tier,age_band,avg_rating,total_minutes,churn_label,watch_time_bucket,flag_binge,plan_region_combo,is_missing_age_band,is_missing_avg_rating
0,user_04850,Canada,Basic,18-24,4.0,806.5,0,high,1,Basic_Canada,0,0
1,user_03483,Canada,Basic,18-24,4.5,636.0,0,high,1,Basic_Canada,0,0
2,user_02486,Canada,Basic,18-24,4.0,676.8,0,high,1,Basic_Canada,0,0
3,user_07034,Canada,Basic,18-24,3.0,341.5,0,high,0,Basic_Canada,0,0
4,user_01195,Canada,Basic,18-24,0.0,422.6,0,high,0,Basic_Canada,0,0
5,user_06003,Canada,Basic,18-24,5.0,524.0,0,high,1,Basic_Canada,0,0
6,user_09239,Canada,Basic,18-24,4.333333,442.2,0,high,0,Basic_Canada,0,0
7,user_00180,Canada,Basic,18-24,5.0,534.7,0,high,1,Basic_Canada,0,0
8,user_06030,Canada,Basic,18-24,3.5,702.3,0,high,1,Basic_Canada,0,0
9,user_09255,Canada,Basic,18-24,0.0,272.5,0,medium,0,Basic_Canada,0,0


In [13]:
%%bigquery --project boxwood-veld-471119-r6
-- Check row counts
SELECT
  (SELECT COUNT(*) FROM `netflix.churn_features`)            AS n_base,
  (SELECT COUNT(*) FROM `netflix.churn_features_enhanced`)   AS n_enhanced;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,n_base,n_enhanced
0,10300,10300


In [14]:
%%bigquery --project boxwood-veld-471119-r6
-- Check for NULLs in engineered features
SELECT
  SUM(CASE WHEN watch_time_bucket     IS NULL THEN 1 ELSE 0 END) AS null_watch_time_bucket,
  SUM(CASE WHEN flag_binge            IS NULL THEN 1 ELSE 0 END) AS null_flag_binge,
  SUM(CASE WHEN plan_region_combo     IS NULL THEN 1 ELSE 0 END) AS null_plan_region_combo,
  SUM(CASE WHEN is_missing_age_band   IS NULL THEN 1 ELSE 0 END) AS null_is_missing_age_band,
  SUM(CASE WHEN is_missing_avg_rating IS NULL THEN 1 ELSE 0 END) AS null_is_missing_avg_rating
FROM `netflix.churn_features_enhanced`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,null_watch_time_bucket,null_flag_binge,null_plan_region_combo,null_is_missing_age_band,null_is_missing_avg_rating
0,0,0,0,0,0


**Exploration:**Yes,row counts are stable: The original table (n_base) and the new enhanced table (n_enhanced) both have 10,300 rows.

No NULLs were introduced: The query that checked for NULLs in all the newly engineered features (e.g., null_watch_time_bucket, null_flag_binge, null_plan_region_combo) returned 0 for all columns, confirming no new NULLs were created.


## Task 6: Retrain Model on Engineered Features

**🎯 Goal:** Train a logistic regression model using churn_features_enhanced.  
**📌 Requirements:** Use BQML logistic_reg model with new feature columns.

---

### 🧠 Prompt Template  
> Write CREATE MODEL SQL using enhanced features including flags and buckets.

---

### 👩‍🏫 Example Prompt  
> Retrain churn_model_enhanced using watch_time_bucket, flag_binge, plan_region_combo.

---

### 🔍 Exploration  
Does model accuracy improve?


In [15]:
%%bigquery --project boxwood-veld-471119-r6
-- Train a logistic regression model using original features
CREATE OR REPLACE MODEL
  `netflix.churn_model_base`
OPTIONS
  (model_type='logistic_reg',
    input_label_cols=['churn_label']
  ) AS
SELECT
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  churn_label
FROM
  `netflix.churn_features`;

Query is running:   0%|          |

In [16]:
%%bigquery --project boxwood-veld-471119-r6
-- Evaluate the base model
SELECT
  *
FROM
  ML.EVALUATE(MODEL `netflix.churn_model_base`,
    (
    SELECT
      region,
      plan_tier,
      age_band,
      avg_rating,
      total_minutes,
      churn_label
    FROM
      `netflix.churn_features`
    )
  );

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.852039,0.0,0.419079,0.517648


In [17]:
%%bigquery --project boxwood-veld-471119-r6
-- Train a logistic regression model using churn_features_enhanced
CREATE OR REPLACE MODEL
  `netflix.churn_model_enhanced`
OPTIONS
  (model_type='logistic_reg',
    input_label_cols=['churn_label']
  ) AS
SELECT
  -- Original features
  region,
  plan_tier,
  age_band,
  avg_rating,
  total_minutes,
  -- Engineered features
  watch_time_bucket,
  flag_binge,
  plan_region_combo,
  is_missing_age_band,
  is_missing_avg_rating,
  churn_label
FROM
  `netflix.churn_features_enhanced`;

Query is running:   0%|          |

In [18]:

%%bigquery --project boxwood-veld-471119-r6
-- Evaluate the enhanced model
SELECT
  *
FROM
  ML.EVALUATE(MODEL `netflix.churn_model_enhanced`,
    (
    SELECT
      -- Include all features used for training
      region,
      plan_tier,
      age_band,
      avg_rating,
      total_minutes,
      watch_time_bucket,
      flag_binge,
      plan_region_combo,
      is_missing_age_band,
      is_missing_avg_rating,
      churn_label
    FROM
      `netflix.churn_features_enhanced`
    )
  );


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.852039,0.0,0.418972,0.521578


Exploration: The accuracy remained the same for both models at 0.852039. However the enhanced model showed a slight improvement in Log Loss as it decreased from 0.419079 in the base model to 0.418972 in the enhanced model. The ROC AUC also increased from 0.517648 in the base model to 0.521578 in the enhanced model, suggesting that the engineered features did slightly improve the model's ability to differentiate between churn and non-churn cases.


## Task 7: Compare Model Performance

**🎯 Goal:** Compare base model vs enhanced model using ML.EVALUATE.  
**📌 Requirements:** Use same evaluation query for both models.

---

### 🧠 Prompt Template  
> Write a SQL query to evaluate churn_model_enhanced and compare with churn_model.

---

### 👩‍🏫 Example Prompt  
> Compare ML.EVALUATE output from both models side-by-side.

---

### 🔍 Exploration  
Which features made the most difference?


In [19]:
%%bigquery --project boxwood-veld-471119-r6
-- Compare Base vs Enhanced metrics side-by-side
WITH base AS (
  SELECT
    'baseline' AS model,
    *
  FROM
    ML.EVALUATE(MODEL `netflix.churn_model_base`,
      (
      SELECT
        region,
        plan_tier,
        age_band,
        avg_rating,
        total_minutes,
        churn_label
      FROM
        `netflix.churn_features`
      )
    )
),
enhanced AS (
  SELECT
    'enhanced' AS model,
    *
  FROM
    ML.EVALUATE(MODEL `netflix.churn_model_enhanced`,
      (
      SELECT
        -- Include all features used for training
        region,
        plan_tier,
        age_band,
        avg_rating,
        total_minutes,
        watch_time_bucket,
        flag_binge,
        plan_region_combo,
        is_missing_age_band,
        is_missing_avg_rating,
        churn_label
      FROM
        `netflix.churn_features_enhanced`
      )
    )
)
SELECT * FROM base
UNION ALL
SELECT * FROM enhanced;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,model,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,baseline,0.0,0.0,0.852039,0.0,0.419079,0.517648
1,enhanced,0.0,0.0,0.852039,0.0,0.418972,0.521578


In [20]:
%%bigquery --project boxwood-veld-471119-r6
-- Get feature weights for the enhanced model
SELECT
  *
FROM
  ML.WEIGHTS(MODEL `netflix.churn_model_enhanced`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,processed_input,weight,category_weights
0,region,,"[{'category': 'USA', 'weight': -0.301128616323..."
1,plan_tier,,"[{'category': 'Premium+', 'weight': -0.3256082..."
2,age_band,,"[{'category': '65+', 'weight': -0.352634151793..."
3,avg_rating,0.000533,[]
4,total_minutes,5e-06,[]
5,watch_time_bucket,,"[{'category': 'low', 'weight': -0.572433424037..."
6,flag_binge,-0.000592,[]
7,plan_region_combo,,"[{'category': 'Basic_Canada', 'weight': -0.348..."
8,is_missing_age_band,0.0,[]
9,is_missing_avg_rating,0.0,[]


In [21]:
%%bigquery --project boxwood-veld-471119-r6
-- Unnest category weights for easier viewing
SELECT
  processed_input,
  category_weights.category,
  category_weights.weight
FROM
  ML.WEIGHTS(MODEL `netflix.churn_model_enhanced`),
  UNNEST(category_weights) AS category_weights
WHERE
  -- Filter for features that have category weights (i.e., are categorical)
  -- Removing the problematic ARRAY_LENGTH condition
  processed_input IS NOT NULL; -- Keep this condition as processed_input can be NULL for intercept

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,processed_input,category,weight
0,region,USA,-0.301129
1,region,Canada,-0.297966
2,plan_tier,Premium+,-0.325608
3,plan_tier,Standard,-0.300381
4,plan_tier,Premium,-0.297072
5,plan_tier,Basic,-0.29206
6,age_band,65+,-0.352634
7,age_band,35-44,-0.332007
8,age_band,55-64,-0.315826
9,age_band,unknown,-0.312408


In [22]:
%%bigquery --project boxwood-veld-471119-r6
-- Analyze weights of individual engineered features
SELECT
  processed_input,
  category_weights.category,
  category_weights.weight,  -- Specify weight from unnested table
  ABS(category_weights.weight) AS abs_weight -- Specify weight from unnested table for ABS
FROM
  ML.WEIGHTS(MODEL `netflix.churn_model_enhanced`),
  UNNEST(category_weights) AS category_weights
WHERE
  processed_input IN ('watch_time_bucket', 'plan_region_combo', 'flag_binge', 'is_missing_age_band', 'is_missing_avg_rating')
ORDER BY
  abs_weight DESC
LIMIT 20; -- Limit to top 20 for clarity

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,processed_input,category,weight,abs_weight
0,watch_time_bucket,low,-0.572433,0.572433
1,plan_region_combo,Basic_Canada,-0.348778,0.348778
2,plan_region_combo,Premium+_Canada,-0.331466,0.331466
3,plan_region_combo,Premium+_USA,-0.323544,0.323544
4,plan_region_combo,Standard_USA,-0.316385,0.316385
5,watch_time_bucket,high,-0.299847,0.299847
6,plan_region_combo,Premium_USA,-0.297856,0.297856
7,plan_region_combo,Premium_Canada,-0.295352,0.295352
8,watch_time_bucket,medium,-0.289039,0.289039
9,plan_region_combo,Basic_USA,-0.266401,0.266401


In [23]:
%%bigquery --project boxwood-veld-471119-r6
-- Calculate total absolute weight by feature group (revised)
WITH FeatureWeights AS (
  SELECT
    processed_input,
    ABS(weight) AS abs_weight
  FROM ML.WEIGHTS(MODEL `netflix.churn_model_enhanced`)
  WHERE processed_input IS NOT NULL AND category_weights IS NULL -- For numerical and binary features
  UNION ALL
  SELECT
    processed_input,
    ABS(category_weights.weight) AS abs_weight
  FROM ML.WEIGHTS(MODEL `netflix.churn_model_enhanced`),
  UNNEST(category_weights) AS category_weights -- For categorical features
  WHERE processed_input IS NOT NULL
),
FeatureGroups AS (
  SELECT
    processed_input,
    CASE
      WHEN processed_input = 'watch_time_bucket' THEN 'watch_time_bucket'
      WHEN processed_input = 'plan_region_combo' THEN 'plan_region_combo'
      WHEN processed_input = 'flag_binge' THEN 'flag_binge'
      WHEN processed_input = 'is_missing_age_band' OR processed_input = 'is_missing_avg_rating' THEN 'missing_flags'
      ELSE 'original_features' -- Group all other features
    END AS feature_group,
    abs_weight
  FROM FeatureWeights
)
SELECT
  feature_group,
  SUM(abs_weight) AS total_absolute_weight
FROM FeatureGroups
GROUP BY
  feature_group
ORDER BY
  total_absolute_weight DESC;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,feature_group,total_absolute_weight
0,original_features,4.227146
1,plan_region_combo,2.444361
2,watch_time_bucket,1.16132


In [24]:
%%bigquery --project boxwood-veld-471119-r6
-- Calculate total absolute weight by feature group (similar to sample code)
WITH W AS (
  SELECT
    CASE
      WHEN processed_input LIKE 'watch_time_bucket=%' THEN 'watch_time_bucket'
      WHEN processed_input LIKE 'plan_region_combo=%' THEN 'plan_region_combo'
      WHEN processed_input = 'flag_binge' THEN 'flag_binge'
      WHEN processed_input LIKE 'is_missing_%' THEN 'missing_flags'
      ELSE 'other'
    END AS feature_group,
    ABS(weight) AS abs_weight
  FROM ML.WEIGHTS(MODEL `netflix.churn_model_enhanced`)
  WHERE processed_input IS NOT NULL
)
SELECT feature_group, SUM(abs_weight) AS total_abs_weight
FROM W
GROUP BY feature_group
ORDER BY total_abs_weight DESC;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,feature_group,total_abs_weight
0,other,0.30482
1,flag_binge,0.000592
2,missing_flags,0.0


Exploration:  Looking at the total absolute weights by feature group, the original features combined together had the highest influence on the model.Among the engineered features, the plan_region_combo and watch_time_bucket groups had the highest collective influence. Analyzing the individual engineered feature weights showed that the 'low' category of watch_time_bucket and specific categories within plan_region_combo had the largest individual impacts. Features like flag_binge had very little influence in this model.

Overall, while the engineered features provided a slight improvement in model performance (in Log Loss and ROC AUC), the original features were more dominant in influencing model's predictions.