<a href="https://colab.research.google.com/github/elebon26/mgmt467-analytics-portfolio/blob/main/Labs/Unit2/(Ethan_Lebon_Completed)Unit2_Lab2_PromptStudio_Tasks0to4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio for AI-Assisted SQL + ML

**Date:** 2025-10-16  
**Objective:** Build and refine a complete ML pipeline for churn prediction using BigQuery — but with **Gemini-style prompts** guiding SQL generation.

You'll learn to:
- Frame SQL goals as clear prompts
- Generate, test, and debug queries with an AI assistant
- Reflect on each modeling step and your prompt design



## Task 0: Connect to BigQuery

**🎯 Goal:** Verify BigQuery access from Colab.  
**📌 Requirements:** Use `%%bigquery`, get current date and user session.

---

### 🧠 Prompt Template  
> Write a SQL query that returns CURRENT_DATE() and SESSION_USER(). I will run it with %%bigquery in Colab.

---

### 👩‍🏫 Example Prompt  
> Write a SQL query using BigQuery syntax that returns today’s date and the current session user.

---

### ✅ Expected SQL Output
```sql
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user;
```

---

### 🔍 Checkpoint  
Query should return a single row with today's date and your user.


In [9]:
# Authenticate and load BigQuery
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
PROJECT_ID = "mgmt-467-1234"
bq = bigquery.Client(project=PROJECT_ID)

print("Connected to project:", PROJECT_ID)


Connected to project: mgmt-467-1234


In [10]:
query = """
SELECT
  CURRENT_DATE() AS today,
  SESSION_USER() AS user
"""
df = bq.query(query).result().to_dataframe()

# 5️⃣ Display result
print("Today's date and session user:")
display(df)

Today's date and session user:


Unnamed: 0,today,user
0,2025-10-27,ethanpurdue123@gmail.com



## Task 1: Prepare ML Table

**🎯 Goal:** Create a clean features table for modeling churn.  
**📌 Requirements:** Use cleaned_features as source, select relevant columns, filter rows with churn_label IS NOT NULL.

---

### 🧠 Prompt Template  
> Write a query that creates a new table with columns: [region, plan_tier, age_band, ...] and churn_label from [source_table]. Filter to rows where churn_label IS NOT NULL.

---

### 👩‍🏫 Example Prompt  
> Create a BigQuery table named churn_features from cleaned_features with selected features and where churn_label IS NOT NULL.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE TABLE `your_dataset.churn_features` AS
SELECT region, plan_tier, age_band, avg_rating, total_minutes, churn_label
FROM `your_dataset.cleaned_features`
WHERE churn_label IS NOT NULL;
```

---

### 🔍 Checkpoint  
Table should appear in BigQuery and contain non-null labels.


In [23]:
# ============================
# Task 1 — Prepare ML Table
# Dataset: mgmt-467-1234.netflix
# Source:  feat_churn_lite  (has churn_next_month label)
# Output:  churn_features
# ============================

!pip -q install google-cloud-bigquery
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
SOURCE     = f"{PROJECT_ID}.{DATASET}.feat_churn_lite"
DEST       = f"{PROJECT_ID}.{DATASET}.churn_features"

bq = bigquery.Client(project=PROJECT_ID)

# (Optional) peek schema to confirm columns
print("🔎 Using source:", SOURCE)
print([f"{f.name}:{f.field_type}" for f in bq.get_table(SOURCE).schema])

# Build churn_features:
#   region         ← country
#   plan_tier      ← subscription_plan
#   age_band       ← derived from age
#   total_minutes  ← r3_min
#   churn_label    ← churn_next_month (0/1)
create_sql = f"""
CREATE OR REPLACE TABLE `{DEST}` AS
WITH base AS (
  SELECT
    country                       AS region,
    subscription_plan             AS plan_tier,
    CASE
      WHEN age IS NULL             THEN 'unknown'
      WHEN age < 25                THEN 'under_25'
      WHEN age BETWEEN 25 AND 34   THEN '25_34'
      WHEN age BETWEEN 35 AND 44   THEN '35_44'
      WHEN age BETWEEN 45 AND 54   THEN '45_54'
      WHEN age BETWEEN 55 AND 64   THEN '55_64'
      ELSE '65_plus'
    END                            AS age_band,
    r3_min                         AS total_minutes,
    CAST(churn_next_month AS INT64) AS churn_label
  FROM `{SOURCE}`
)
SELECT *
FROM base
WHERE churn_label IS NOT NULL;
"""

print("\n🚀 Creating table:", DEST)
bq.query(create_sql).result()
print("✅ Table created")

# Sanity checks
cnt_df = bq.query(f"SELECT COUNT(*) AS row_count FROM `{DEST}`").result().to_dataframe()
print(cnt_df)

preview_df = bq.query(f"SELECT * FROM `{DEST}` LIMIT 5").result().to_dataframe()
preview_df


🔎 Using source: mgmt-467-1234.netflix.feat_churn_lite
['user_id:STRING', 'month:DATE', 'r3_sess:INTEGER', 'r3_min:FLOAT', 'unique_days_watched:INTEGER', 'avg_watch_duration:FLOAT', 'days_since_last_month_start:INTEGER', 'subscription_plan:STRING', 'country:STRING', 'age:FLOAT', 'churn_next_month:INTEGER']

🚀 Creating table: mgmt-467-1234.netflix.churn_features
✅ Table created
   row_count
0     947600


Unnamed: 0,region,plan_tier,age_band,total_minutes,churn_label
0,USA,Basic,25_34,0.0,0
1,USA,Premium+,25_34,0.0,1
2,Canada,Standard,25_34,0.0,0
3,Canada,Standard,25_34,0.0,0
4,USA,Standard,25_34,0.0,1



## Task 2: Train Logistic Regression Model

**🎯 Goal:** Train a basic BQML logistic regression model.  
**📌 Requirements:** Use churn_features table, predict churn_label from features.

---

### 🧠 Prompt Template  
> Write a CREATE MODEL SQL for logistic regression using churn_label as label and [features] as inputs.

---

### 👩‍🏫 Example Prompt  
> Train a logistic regression model to predict churn_label using region, plan_tier, total_minutes, avg_rating.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE MODEL `your_dataset.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT region, plan_tier, total_minutes, avg_rating, churn_label
FROM `your_dataset.churn_features`;
```

---

### 🔍 Checkpoint  
Model appears in BigQuery under Models. Training completes.


In [25]:
!pip -q install google-cloud-bigquery
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
FEATURES   = f"{PROJECT_ID}.{DATASET}.churn_features"
MODEL_NAME = f"{PROJECT_ID}.{DATASET}.churn_logreg"

bq = bigquery.Client(project=PROJECT_ID)

train_sql = f"""
CREATE OR REPLACE MODEL `{MODEL_NAME}`
OPTIONS(
  model_type = 'LOGISTIC_REG',
  input_label_cols = ['churn_label'],
  auto_class_weights = TRUE,
  enable_global_explain = TRUE
) AS
SELECT
  region,
  plan_tier,
  age_band,
  total_minutes,
  churn_label
FROM `{FEATURES}`;
"""

print("🚀 Training logistic regression model on churn_features …")
bq.query(train_sql).result()
print(f"✅ Model trained successfully: {MODEL_NAME}")

eval_sql = f"""
SELECT *
FROM ML.EVALUATE(MODEL `{MODEL_NAME}`);
"""
eval_df = bq.query(eval_sql).result().to_dataframe()
print("📊 Evaluation Metrics:")
display(eval_df)

weights_sql = f"""
SELECT *
FROM ML.WEIGHTS(MODEL `{MODEL_NAME}`)
ORDER BY ABS(weight) DESC
LIMIT 10;
"""
weights_df = bq.query(weights_sql).result().to_dataframe()
print("💡 Top 10 Most Influential Features:")
display(weights_df)

🚀 Training logistic regression model on churn_features …
✅ Model trained successfully: mgmt-467-1234.netflix.churn_logreg
📊 Evaluation Metrics:


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.648802,0.456819,0.485677,0.536142,0.693149,0.494366


💡 Top 10 Most Influential Features:


Unnamed: 0,processed_input,weight,category_weights
0,__INTERCEPT__,-0.0001153022,[]
1,total_minutes,3.959847e-07,[]
2,region,,"[{'category': 'USA', 'weight': -0.000334828137..."
3,plan_tier,,"[{'category': 'Premium+', 'weight': 5.67662913..."
4,age_band,,"[{'category': '35_44', 'weight': -0.0005164572..."



## Task 3: Evaluate Model

**🎯 Goal:** Evaluate the logistic regression model.  
**📌 Requirements:** Use ML.EVALUATE.

---

### 🧠 Prompt Template  
> Write a query to evaluate my logistic regression model using ML.EVALUATE.

---

### 👩‍🏫 Example Prompt  
> Evaluate the churn_model using ML.EVALUATE to get accuracy, precision, recall.

---

### ✅ Expected SQL Output
```sql
SELECT * FROM ML.EVALUATE(MODEL `your_dataset.churn_model`);
```

---

### 🔍 Checkpoint  
View performance metrics: accuracy, log_loss, precision, recall.


In [26]:
# =====================================================
# Task 3 – Evaluate the Logistic Regression Churn Model
# =====================================================

!pip -q install google-cloud-bigquery
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
MODEL_NAME = f"{PROJECT_ID}.{DATASET}.churn_logreg"

bq = bigquery.Client(project=PROJECT_ID)

# 🧠 Evaluate the trained model
eval_sql = f"""
SELECT *
FROM ML.EVALUATE(MODEL `{MODEL_NAME}`);
"""

print("📊 Evaluating model:", MODEL_NAME)
eval_df = bq.query(eval_sql).result().to_dataframe()
display(eval_df)


📊 Evaluating model: mgmt-467-1234.netflix.churn_logreg


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.648802,0.456819,0.485677,0.536142,0.693149,0.494366



## Task 4: Predict Churn

**🎯 Goal:** Use ML.PREDICT to generate churn predictions.  
**📌 Requirements:** Apply model to same input table.

---

### 🧠 Prompt Template  
> Generate SQL to use ML.PREDICT on churn_model and return predictions by user_id.

---

### 👩‍🏫 Example Prompt  
> Predict churn using churn_model. Include user_id, predicted_churn_label, and prediction probability.

---

### ✅ Expected SQL Output
```sql
SELECT user_id, predicted_churn_label, predicted_churn_label_probs
FROM ML.PREDICT(MODEL `your_dataset.churn_model`,
      (SELECT * FROM `your_dataset.churn_features`));
```

---

### 🔍 Checkpoint  
Inspect top churn risk users. Validate probabilities.


In [35]:
# Task 4 — Predict Churn (minimal, matches expected SQL)
!pip -q install google-cloud-bigquery
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
MODEL_NAME = f"{PROJECT_ID}.{DATASET}.churn_logreg"
FEATURES   = f"{PROJECT_ID}.{DATASET}.churn_features"

bq = bigquery.Client(project=PROJECT_ID)

sql = f"""
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs
FROM ML.PREDICT(
  MODEL `{MODEL_NAME}`,
  (SELECT * FROM `{FEATURES}`)
);
"""

pred_df = bq.query(sql).result().to_dataframe()
pred_df.head(20)  # show a quick sample


Unnamed: 0,user_id,predicted_churn_label,predicted_churn_label_probs
0,user_00001,0,"[{'label': 1, 'prob': 0.49977893331075185}, {'..."
1,user_00001,0,"[{'label': 1, 'prob': 0.4997489968752844}, {'l..."
2,user_00001,0,"[{'label': 1, 'prob': 0.49976535103889536}, {'..."
3,user_00001,0,"[{'label': 1, 'prob': 0.4997489968752844}, {'l..."
4,user_00001,0,"[{'label': 1, 'prob': 0.4997489968752844}, {'l..."
5,user_00001,0,"[{'label': 1, 'prob': 0.49976535103889536}, {'..."
6,user_00001,0,"[{'label': 1, 'prob': 0.49980118764609344}, {'..."
7,user_00001,0,"[{'label': 1, 'prob': 0.4997489968752844}, {'l..."
8,user_00001,0,"[{'label': 1, 'prob': 0.49977374591242313}, {'..."
9,user_00001,0,"[{'label': 1, 'prob': 0.49979093164447164}, {'..."


In [38]:
# Task 4 — Predict Churn (latest row per user + prob(label=1))
from google.cloud import bigquery

PROJECT_ID = "mgmt-467-1234"
DATASET    = "netflix"
MODEL_NAME = f"{PROJECT_ID}.{DATASET}.churn_logreg"
FEATURE_SRC = f"{PROJECT_ID}.{DATASET}.feat_churn_lite" # Use the source table with 'month'

bq = bigquery.Client(project=PROJECT_ID)

sql_latest = f"""
WITH latest_features AS (
  SELECT
    user_id,
    month, -- Include month for QUALIFY
    country AS region,
    subscription_plan AS plan_tier,
    CASE
      WHEN age IS NULL THEN 'unknown'
      WHEN age < 25 THEN 'under_25'
      WHEN age BETWEEN 25 AND 34 THEN '25_34'
      WHEN age BETWEEN 35 AND 44 THEN '35_44'
      WHEN age BETWEEN 45 AND 54 THEN '45_54'
      WHEN age BETWEEN 55 AND 64 THEN '55_64'
      ELSE '65_plus'
    END AS age_band,
    r3_min AS total_minutes
  FROM `{FEATURE_SRC}`
  QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY month DESC) = 1
)
SELECT
  lf.user_id,
  p.predicted_churn_label,
  (
    SELECT probs.prob
    FROM UNNEST(p.predicted_churn_label_probs) AS probs
    WHERE CAST(probs.label AS INT64) = 1
  ) AS predicted_churn_label_prob
FROM ML.PREDICT(MODEL `{MODEL_NAME}`, (SELECT * FROM latest_features)) AS p
JOIN latest_features lf ON TRUE -- Join ON TRUE to add user_id back as ML.PREDICT preserves row order
ORDER BY predicted_churn_label_prob DESC
LIMIT 20;
"""

print("🔮 Generating churn predictions for the latest month (top 20 by probability)…")
pred_top_df = bq.query(sql_latest).result().to_dataframe()
pred_top_df

🔮 Generating churn predictions for the latest month (top 20 by probability)…


Unnamed: 0,user_id,predicted_churn_label,predicted_churn_label_prob
0,user_06504,1,0.501068
1,user_03067,1,0.501068
2,user_01551,1,0.501068
3,user_07444,1,0.501068
4,user_09748,1,0.501068
5,user_01786,1,0.501068
6,user_08123,1,0.501068
7,user_08715,1,0.501068
8,user_01583,1,0.501068
9,user_04124,1,0.501068
