
# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio for AI-Assisted SQL + ML

**Date:** 2025-10-16  
**Objective:** Build and refine a complete ML pipeline for churn prediction using BigQuery — but with **Gemini-style prompts** guiding SQL generation.

You'll learn to:
- Frame SQL goals as clear prompts
- Generate, test, and debug queries with an AI assistant
- Reflect on each modeling step and your prompt design



## Task 0: Connect to BigQuery

**🎯 Goal:** Verify BigQuery access from Colab.  
**📌 Requirements:** Use `%%bigquery`, get current date and user session.

---

### 🧠 Prompt Template  
> Write a SQL query that returns CURRENT_DATE() and SESSION_USER(). I will run it with %%bigquery in Colab.

---

### 👩‍🏫 Example Prompt  
> Write a SQL query using BigQuery syntax that returns today’s date and the current session user.

---

### ✅ Expected SQL Output
```sql
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user;
```

---

### 🔍 Checkpoint  
Query should return a single row with today's date and your user.


In [None]:
from google.colab import auth
auth.authenticate_user()


In [None]:
PROJECT_ID = "mgmt467project"
from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)


In [None]:
%env GOOGLE_CLOUD_PROJECT=mgmt467project


env: GOOGLE_CLOUD_PROJECT=mgmt467project


In [None]:
%%bigquery
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-26,anuragkoripalli004@gmail.com


My reflection: Successfully connecting Colab to BigQuery confirmed my project authentication and environment setup, ensuring a nice pipeline for running SQL queries.


## Task 1: Prepare ML Table

**🎯 Goal:** Create a clean features table for modeling churn.  
**📌 Requirements:** Use cleaned_features as source, select relevant columns, filter rows with churn_label IS NOT NULL.

---

### 🧠 Prompt Template  
> Write a query that creates a new table with columns: [region, plan_tier, age_band, ...] and churn_label from [source_table]. Filter to rows where churn_label IS NOT NULL.

---

### 👩‍🏫 Example Prompt  
> Create a BigQuery table named churn_features from cleaned_features with selected features and where churn_label IS NOT NULL.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE TABLE `your_dataset.churn_features` AS
SELECT region, plan_tier, age_band, avg_rating, total_minutes, churn_label
FROM `your_dataset.cleaned_features`
WHERE churn_label IS NOT NULL;
```

---

### 🔍 Checkpoint  
Table should appear in BigQuery and contain non-null labels.


In [None]:
%%bigquery
CREATE OR REPLACE TABLE `mgmt467project.netflix.cleaned_features` AS
SELECT
  user_id,
  age,
  gender,
  country,
  subscription_plan,
  monthly_spend,
  household_size,
  primary_device,
  -- Convert boolean into numeric churn label (1 = churned, 0 = active)
  CASE
    WHEN is_active = FALSE THEN 1
    WHEN is_active = TRUE THEN 0
    ELSE NULL
  END AS churn_label
FROM
  `mgmt467project.netflix.users`
WHERE
  user_id IS NOT NULL;


Query is running:   0%|          |

In [None]:
%%bigquery
CREATE OR REPLACE TABLE `mgmt467project.netflix.churn_features` AS
SELECT
  country,
  subscription_plan,
  age,
  gender,
  monthly_spend,
  household_size,
  primary_device,
  churn_label
FROM
  `mgmt467project.netflix.cleaned_features`
WHERE churn_label IS NOT NULL;


Query is running:   0%|          |

In [None]:
%%bigquery
SELECT COUNT(*) AS total_rows,
       COUNTIF(churn_label IS NULL) AS null_labels
FROM `mgmt467project.netflix.churn_features`;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_rows,null_labels
0,20600,0


My Reflection: By creating a clean churn_features table filtered for valid labels, I ensured that the dataset was ready for modeling and was only containing only complete, relevant attributes that improve the reliability and interpretability of churn predictions.


## Task 2: Train Logistic Regression Model

**🎯 Goal:** Train a basic BQML logistic regression model.  
**📌 Requirements:** Use churn_features table, predict churn_label from features.

---

### 🧠 Prompt Template  
> Write a CREATE MODEL SQL for logistic regression using churn_label as label and [features] as inputs.

---

### 👩‍🏫 Example Prompt  
> Train a logistic regression model to predict churn_label using region, plan_tier, total_minutes, avg_rating.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE MODEL `your_dataset.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT region, plan_tier, total_minutes, avg_rating, churn_label
FROM `your_dataset.churn_features`;
```

---

### 🔍 Checkpoint  
Model appears in BigQuery under Models. Training completes.


In [None]:
%%bigquery
CREATE OR REPLACE MODEL `mgmt467project.netflix.churn_model`
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['churn_label']
) AS
SELECT
  age,
  gender,
  country,
  subscription_plan,
  monthly_spend,
  household_size,
  primary_device,
  churn_label
FROM
  `mgmt467project.netflix.churn_features`;


Query is running:   0%|          |

My Reflection: Training the logistic regression model in BigQuery ML successfully linked the churn label with key customer attributes, allowing me to establish a baseline predictive model that quantifies how different factors contribute to customer churn.


## Task 3: Evaluate Model

**🎯 Goal:** Evaluate the logistic regression model.  
**📌 Requirements:** Use ML.EVALUATE.

---

### 🧠 Prompt Template  
> Write a query to evaluate my logistic regression model using ML.EVALUATE.

---

### 👩‍🏫 Example Prompt  
> Evaluate the churn_model using ML.EVALUATE to get accuracy, precision, recall.

---

### ✅ Expected SQL Output
```sql
SELECT * FROM ML.EVALUATE(MODEL `your_dataset.churn_model`);
```

---

### 🔍 Checkpoint  
View performance metrics: accuracy, log_loss, precision, recall.


In [None]:
%%bigquery
SELECT *
FROM ML.EVALUATE(MODEL `mgmt467project.netflix.churn_model`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.850892,0.0,0.421135,0.510065


My Reflection: Evaluating the logistic regression model using ML.EVALUATE provided insight into its strengths and weaknesses — revealing good overall accuracy but poor precision and recall, indicating a need for better feature engineering or class balancing.


## Task 4: Predict Churn

**🎯 Goal:** Use ML.PREDICT to generate churn predictions.  
**📌 Requirements:** Apply model to same input table.

---

### 🧠 Prompt Template  
> Generate SQL to use ML.PREDICT on churn_model and return predictions by user_id.

---

### 👩‍🏫 Example Prompt  
> Predict churn using churn_model. Include user_id, predicted_churn_label, and prediction probability.

---

### ✅ Expected SQL Output
```sql
SELECT user_id, predicted_churn_label, predicted_churn_label_probs
FROM ML.PREDICT(MODEL `your_dataset.churn_model`,
      (SELECT * FROM `your_dataset.churn_features`));
```

---

### 🔍 Checkpoint  
Inspect top churn risk users. Validate probabilities.


In [None]:
%%bigquery
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs[OFFSET(1)] AS churn_probability
FROM
  ML.PREDICT(
    MODEL `mgmt467project.netflix.churn_model`,
    (
      SELECT
        user_id,               -- 👈 added here
        age,
        gender,
        country,
        subscription_plan,
        monthly_spend,
        household_size,
        primary_device,
        churn_label
      FROM `mgmt467project.netflix.cleaned_features`
    )
  );

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,predicted_churn_label,churn_probability
0,user_00008,0,"{'label': 0, 'prob': 0.8541316243804984}"
1,user_00008,0,"{'label': 0, 'prob': 0.8541316243804984}"
2,user_00024,0,"{'label': 0, 'prob': 0.8516794217606469}"
3,user_00024,0,"{'label': 0, 'prob': 0.8516794217606469}"
4,user_00028,0,"{'label': 0, 'prob': 0.8568589527278607}"
...,...,...,...
20595,user_02460,0,"{'label': 0, 'prob': 0.8702391415428415}"
20596,user_08221,0,"{'label': 0, 'prob': 0.8659344367493615}"
20597,user_08221,0,"{'label': 0, 'prob': 0.8659344367493615}"
20598,user_09179,0,"{'label': 0, 'prob': 0.8758004522479436}"


My reflection: Using ML.PREDICT to generate churn probabilities provided individualized churn insights, translating the model’s outcomes into actionable business intelligence for retention strategy and customer segmentation.

In [None]:
%%writefile dq_checks.sql
-- =====================================================
-- MGMT 467 Data Quality (DQ) Queries - Netflix Dataset
-- Author: Anurag Koripalli
-- =====================================================
-- Task 1: Prepare Churn Features Table
CREATE OR REPLACE TABLE `mgmt467project.netflix.churn_features` AS
SELECT
  user_id,
  age,
  gender,
  country,
  subscription_plan,
  monthly_spend,
  household_size,
  primary_device,
  churn_label
FROM `mgmt467project.netflix.cleaned_features`;

-- Task 2: Train Logistic Regression Model
CREATE OR REPLACE MODEL `mgmt467project.netflix.churn_model`
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['churn_label']
) AS
SELECT
  age,
  gender,
  country,
  subscription_plan,
  monthly_spend,
  household_size,
  primary_device,
  churn_label
FROM `mgmt467project.netflix.churn_features`;

-- Task 3: Evaluate Model
SELECT *
FROM ML.EVALUATE(MODEL `mgmt467project.netflix.churn_model`);

-- Task 4: Predict Churn
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs[OFFSET(1)] AS churn_probability
FROM ML.PREDICT(
  MODEL `mgmt467project.netflix.churn_model`,
  (
    SELECT
      user_id,
      age,
      gender,
      country,
      subscription_plan,
      monthly_spend,
      household_size,
      primary_device,
      churn_label
    FROM `mgmt467project.netflix.cleaned_features`
  )
);

-- Task 5.0: Bucket a Continuous Feature
CREATE OR REPLACE TABLE `mgmt467project.netflix.watch_time_buckets` AS
SELECT
  wh.user_id,
  SUM(wh.watch_duration_minutes) AS total_minutes,
  cf.churn_label,
  CASE
    WHEN SUM(wh.watch_duration_minutes) < 100 THEN 'Low'
    WHEN SUM(wh.watch_duration_minutes) BETWEEN 100 AND 300 THEN 'Medium'
    WHEN SUM(wh.watch_duration_minutes) > 300 THEN 'High'
    ELSE 'Unknown'
  END AS watch_time_bucket
FROM `mgmt467project.netflix.watch_history_dedup` AS wh
JOIN `mgmt467project.netflix.cleaned_features` AS cf
USING (user_id)
GROUP BY wh.user_id, cf.churn_label;

-- Task 5.1: Create Binary Flag Feature
CREATE OR REPLACE TABLE `mgmt467project.netflix.binge_flag_features` AS
SELECT
  user_id,
  total_minutes,
  churn_label,
  IF(total_minutes > 500, 1, 0) AS flag_binge
FROM `mgmt467project.netflix.watch_time_buckets`;

-- Task 5.2: Create Interaction Term
CREATE OR REPLACE TABLE `mgmt467project.netflix.plan_region_combo_features` AS
SELECT
  user_id,
  subscription_plan AS plan_tier,
  country AS region,
  churn_label,
  CONCAT(subscription_plan, '_', country) AS plan_region_combo
FROM `mgmt467project.netflix.cleaned_features`
WHERE churn_label IS NOT NULL;

-- Task 5.3: Add Missingness Flags
CREATE OR REPLACE TABLE `mgmt467project.netflix.missingness_flags` AS
SELECT
  u.user_id,
  u.age AS avg_age,
  wh.user_rating AS avg_rating,
  cf.churn_label,
  IF(u.age IS NULL, 1, 0) AS is_missing_age_band,
  IF(wh.user_rating IS NULL, 1, 0) AS is_missing_avg_rating
FROM `mgmt467project.netflix.users` AS u
LEFT JOIN `mgmt467project.netflix.watch_history_dedup` AS wh
ON u.user_id = wh.user_id
LEFT JOIN `mgmt467project.netflix.cleaned_features` AS cf
ON u.user_id = cf.user_id;

-- Task 5.4: Create Time-Based Feature
CREATE OR REPLACE TABLE `mgmt467project.netflix.time_features` AS
SELECT
  wh.user_id,
  MAX(wh.watch_date) AS last_watch_date,
  cf.churn_label,
  DATE_DIFF(CURRENT_DATE(), MAX(wh.watch_date), DAY) AS days_since_last_watch
FROM `mgmt467project.netflix.watch_history_dedup` AS wh
JOIN `mgmt467project.netflix.cleaned_features` AS cf
USING (user_id)
GROUP BY user_id, churn_label;

-- Task 5.5: Assemble Enhanced Feature Table
CREATE OR REPLACE TABLE `mgmt467project.netflix.churn_features_enhanced` AS
SELECT
  cf.user_id,
  cf.age,
  cf.gender,
  cf.country,
  cf.subscription_plan,
  cf.monthly_spend,
  cf.household_size,
  cf.primary_device,
  cf.churn_label,
  CONCAT(cf.subscription_plan, '_', cf.country) AS plan_region_combo,
  IF(whb.total_minutes > 500, 1, 0) AS flag_binge,
  whb.watch_time_bucket,
  tf.days_since_last_watch,
  mf.is_missing_age_band,
  mf.is_missing_avg_rating
FROM `mgmt467project.netflix.cleaned_features` AS cf
LEFT JOIN `mgmt467project.netflix.watch_time_buckets` AS whb
ON cf.user_id = whb.user_id
LEFT JOIN `mgmt467project.netflix.missingness_flags` AS mf
ON cf.user_id = mf.user_id
LEFT JOIN `mgmt467project.netflix.time_features` AS tf
ON cf.user_id = tf.user_id;

-- Task 6: Retrain Model on Enhanced Features
CREATE OR REPLACE MODEL `mgmt467project.netflix.churn_model_enhanced`
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['churn_label']
) AS
SELECT
  age,
  gender,
  country,
  subscription_plan,
  monthly_spend,
  household_size,
  primary_device,
  plan_region_combo,
  flag_binge,
  watch_time_bucket,
  days_since_last_watch,
  is_missing_age_band,
  is_missing_avg_rating,
  churn_label
FROM `mgmt467project.netflix.churn_features_enhanced`;

-- Task 7: Compare Model Performance
WITH base_eval AS (
  SELECT 'Base Model' AS model_name, *
  FROM ML.EVALUATE(MODEL `mgmt467project.netflix.churn_model`)
),
enhanced_eval AS (
  SELECT 'Enhanced Model' AS model_name, *
  FROM ML.EVALUATE(MODEL `mgmt467project.netflix.churn_model_enhanced`)
)
SELECT * FROM base_eval
UNION ALL
SELECT * FROM enhanced_eval;


Writing dq_checks.sql


In [None]:
from google.colab import files
files.download("dq_checks.sql")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>