
# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio for AI-Assisted SQL + ML

**Date:** 2025-10-16  
**Objective:** Build and refine a complete ML pipeline for churn prediction using BigQuery — but with **Gemini-style prompts** guiding SQL generation.

You'll learn to:
- Frame SQL goals as clear prompts
- Generate, test, and debug queries with an AI assistant
- Reflect on each modeling step and your prompt design


In [1]:
from google.colab import auth
auth.authenticate_user()



## Task 0: Connect to BigQuery

**🎯 Goal:** Verify BigQuery access from Colab.  
**📌 Requirements:** Use `%%bigquery`, get current date and user session.

---

### 🧠 Prompt Template  
> Write a SQL query that returns CURRENT_DATE() and SESSION_USER(). I will run it with %%bigquery in Colab.

---

### 👩‍🏫 Example Prompt  
> Write a SQL query using BigQuery syntax that returns today’s date and the current session user.

---

### ✅ Expected SQL Output
```sql
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user;
```

---

### 🔍 Checkpoint  
Query should return a single row with today's date and your user.


In [4]:
# prompt: Write a SQL query that returns CURRENT_DATE() and SESSION_USER(). I will run it with %%bigquery in Colab.

# Make sure the BigQuery extension is loaded
%load_ext google.cloud.bigquery

# Replace with your GCP project ID
project_id = "mgmt-467-471819"



In [5]:
# Run the BigQuery query with explicit project parameter
%%bigquery --project $project_id
SELECT
  CURRENT_DATE() AS today,
  SESSION_USER() AS user;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,today,user
0,2025-10-27,ethanquinn504@gmail.com


In [6]:
# prompt: give me the code to connect my bigquery netflix dataset to here

# The following code is for demonstration purposes only.
# Replace `your-project-id` with your actual project ID.
# Replace `your-dataset-id` with your actual dataset ID.
# Replace `your-table-id` with your actual table ID.
# For example, if your dataset is named `netflix` and your table is named `titles`,
# you would use `your-project-id.netflix.titles`.
project_id = "mgmt-467-471819"
table_id = "netflix"

# Construct the full table reference
full_table_ref = f"{project_id}.{table_id}"

# You can now use this `full_table_ref` in your %%bigquery queries.
# For example:
# %%bigquery
# SELECT * FROM `{full_table_ref}` LIMIT 5;

In [8]:
project_id = "mgmt-467-471819"
dataset_id = "netflix"

In [11]:
%%bigquery --project $project_id
SELECT
  table_name,
  table_type
FROM
  `mgmt-467-471819.netflix.INFORMATION_SCHEMA.TABLES`;



Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,table_name,table_type
0,users,BASE TABLE
1,reviews,BASE TABLE
2,watch_history,BASE TABLE
3,watch_history_dedup,BASE TABLE
4,search_logs,BASE TABLE
5,recommendation_logs,BASE TABLE
6,movies,BASE TABLE
7,watch_history_robust,BASE TABLE


In [14]:
# prompt: tell me the schema for each table

%%bigquery --project mgmt-467-471819
SELECT
  table_name,
  column_name,
  data_type
FROM
  `mgmt-467-471819.netflix.INFORMATION_SCHEMA.COLUMNS`
ORDER BY
  table_name,
  ordinal_position;



Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,table_name,column_name,data_type
0,movies,movie_id,STRING
1,movies,title,STRING
2,movies,content_type,STRING
3,movies,genre_primary,STRING
4,movies,genre_secondary,STRING
...,...,...,...
100,watch_history_robust,quality,STRING
101,watch_history_robust,location_country,STRING
102,watch_history_robust,is_download,BOOL
103,watch_history_robust,user_rating,INT64


In [15]:
# prompt: tell me which tables and features are needed to predict churn

print("To predict churn, we typically need tables that contain customer information and their interaction history with the service. Key features would include:")
print("\n1. Customer Demographics:")
print("   - Age, gender, location, subscription tier, signup date.")
print("   - Table: Likely a 'customers' or 'users' table.")
print("\n2. Service Usage and Activity:")
print("   - Frequency of use, session duration, features used, content consumed (if applicable).")
print("   - Tables: 'usage_logs', 'activity_stream', 'viewing_history'.")
print("\n3. Billing and Subscription Information:")
print("   - Payment method, billing cycle, contract length, payment history, discounts applied.")
print("   - Tables: 'subscriptions', 'billing_records'.")
print("\n4. Customer Support Interactions:")
print("   - Number of support tickets, type of issues, resolution time, customer satisfaction scores.")
print("   - Tables: 'support_tickets', 'customer_feedback'.")
print("\n5. Churn Status (Target Variable):")
print("   - A flag indicating whether a customer has churned or not.")
print("   - Table: This might be a separate 'churn_status' table or a column within the 'customers' table, often derived from subscription end dates.")
print("\nBased on the provided schema information, the 'netflix' dataset seems to contain tables that could be relevant. Specifically:")
print("- Tables like 'titles' might indirectly inform usage patterns if we can link them to user activity.")
print("- We would need to explore other tables within the 'netflix' dataset (if available and not shown in the schema output) that directly relate to customer accounts, their interactions, and subscription details to build a comprehensive churn prediction model.")


To predict churn, we typically need tables that contain customer information and their interaction history with the service. Key features would include:

1. Customer Demographics:
   - Age, gender, location, subscription tier, signup date.
   - Table: Likely a 'customers' or 'users' table.

2. Service Usage and Activity:
   - Frequency of use, session duration, features used, content consumed (if applicable).
   - Tables: 'usage_logs', 'activity_stream', 'viewing_history'.

3. Billing and Subscription Information:
   - Payment method, billing cycle, contract length, payment history, discounts applied.
   - Tables: 'subscriptions', 'billing_records'.

4. Customer Support Interactions:
   - Number of support tickets, type of issues, resolution time, customer satisfaction scores.
   - Tables: 'support_tickets', 'customer_feedback'.

5. Churn Status (Target Variable):
   - A flag indicating whether a customer has churned or not.
   - Table: This might be a separate 'churn_status' table or 


## Task 1: Prepare ML Table

**🎯 Goal:** Create a clean features table for modeling churn.  
**📌 Requirements:** Use cleaned_features as source, select relevant columns, filter rows with churn_label IS NOT NULL.

---

### 🧠 Prompt Template  
> Write a query that creates a new table with columns: [region, plan_tier, age_band, ...] and churn_label from [source_table]. Filter to rows where churn_label IS NOT NULL.

---

### 👩‍🏫 Example Prompt  
> Create a BigQuery table named churn_features from cleaned_features with selected features and where churn_label IS NOT NULL.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE TABLE `your_dataset.churn_features` AS
SELECT region, plan_tier, age_band, avg_rating, total_minutes, churn_label
FROM `your_dataset.cleaned_features`
WHERE churn_label IS NOT NULL;
```

---

### 🔍 Checkpoint  
Table should appear in BigQuery and contain non-null labels.


In [19]:
# prompt: build a simple table with minimal features to understand how churn prediction works. Do not use from cleaned_features, since that does not exist anywhere. churn_label also does not exist so do not use it. only use what you generated about the netflix dataset above, as in the table names that you found and not made up ones

%%bigquery --project mgmt-467-471819
CREATE OR REPLACE TABLE `mgmt-467-471819.netflix.churn_features` AS
SELECT
  u.user_id,
  u.country,
  u.subscription_plan,
  u.gender,
  u.age,
  u.primary_device,
  u.household_size,
  COUNT(DISTINCT w.movie_id) AS total_movies_watched,
  AVG(w.progress_percentage) AS avg_progress,
  SUM(w.watch_duration_minutes) AS total_minutes,
  COUNT(DISTINCT s.search_id) AS total_searches,
  COUNT(DISTINCT r.review_id) AS total_reviews,
  IF(DATE_DIFF(CURRENT_DATE(), u.subscription_start_date, DAY) > 365, 1, 0) AS churn_label
FROM
  `mgmt-467-471819.netflix.users` AS u
LEFT JOIN
  `mgmt-467-471819.netflix.watch_history_robust` AS w
ON
  u.user_id = w.user_id
LEFT JOIN
  `mgmt-467-471819.netflix.search_logs` AS s
ON
  u.user_id = s.user_id
LEFT JOIN
  `mgmt-467-471819.netflix.reviews` AS r
ON
  u.user_id = r.user_id
GROUP BY
  u.user_id, u.country, u.subscription_plan, u.gender, u.age,
  u.primary_device, u.household_size, u.subscription_start_date
LIMIT 100;



Query is running:   0%|          |


## Task 2: Train Logistic Regression Model

**🎯 Goal:** Train a basic BQML logistic regression model.  
**📌 Requirements:** Use churn_features table, predict churn_label from features.

---

### 🧠 Prompt Template  
> Write a CREATE MODEL SQL for logistic regression using churn_label as label and [features] as inputs.

---

### 👩‍🏫 Example Prompt  
> Train a logistic regression model to predict churn_label using region, plan_tier, total_minutes, avg_rating.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE MODEL `your_dataset.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT region, plan_tier, total_minutes, avg_rating, churn_label
FROM `your_dataset.churn_features`;
```

---

### 🔍 Checkpoint  
Model appears in BigQuery under Models. Training completes.


In [21]:
%%bigquery --project mgmt-467-471819
CREATE OR REPLACE MODEL `mgmt-467-471819.netflix.churn_model`
OPTIONS(
  model_type = 'logistic_reg',
  input_label_cols = ['churn_label']
) AS
SELECT
  country,
  subscription_plan,
  gender,
  age,
  primary_device,
  household_size,
  total_movies_watched,
  avg_progress,
  total_minutes,
  total_searches,
  total_reviews,
  churn_label
FROM
  `mgmt-467-471819.netflix.churn_features`;



Query is running:   0%|          |


## Task 3: Evaluate Model

**🎯 Goal:** Evaluate the logistic regression model.  
**📌 Requirements:** Use ML.EVALUATE.

---

### 🧠 Prompt Template  
> Write a query to evaluate my logistic regression model using ML.EVALUATE.

---

### 👩‍🏫 Example Prompt  
> Evaluate the churn_model using ML.EVALUATE to get accuracy, precision, recall.

---

### ✅ Expected SQL Output
```sql
SELECT * FROM ML.EVALUATE(MODEL `your_dataset.churn_model`);
```

---

### 🔍 Checkpoint  
View performance metrics: accuracy, log_loss, precision, recall.


In [22]:
%%bigquery --project mgmt-467-471819
SELECT *
FROM ML.EVALUATE(MODEL `mgmt-467-471819.netflix.churn_model`);


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.760417,0.986486,0.76,0.858824,0.494476,0.755575



## Task 4: Predict Churn

**🎯 Goal:** Use ML.PREDICT to generate churn predictions.  
**📌 Requirements:** Apply model to same input table.

---

### 🧠 Prompt Template  
> Generate SQL to use ML.PREDICT on churn_model and return predictions by user_id.

---

### 👩‍🏫 Example Prompt  
> Predict churn using churn_model. Include user_id, predicted_churn_label, and prediction probability.

---

### ✅ Expected SQL Output
```sql
SELECT user_id, predicted_churn_label, predicted_churn_label_probs
FROM ML.PREDICT(MODEL `your_dataset.churn_model`,
      (SELECT * FROM `your_dataset.churn_features`));
```

---

### 🔍 Checkpoint  
Inspect top churn risk users. Validate probabilities.


In [23]:
%%bigquery --project mgmt-467-471819
SELECT
  user_id,
  predicted_churn_label,
  predicted_churn_label_probs AS churn_probability
FROM
  ML.PREDICT(
    MODEL `mgmt-467-471819.netflix.churn_model`,
    (
      SELECT *
      FROM `mgmt-467-471819.netflix.churn_features`
    )
  );


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,predicted_churn_label,churn_probability
0,user_06018,1,"[{'label': 1, 'prob': 0.8866245400521383}, {'l..."
1,user_04365,1,"[{'label': 1, 'prob': 0.6615704959618837}, {'l..."
2,user_05156,1,"[{'label': 1, 'prob': 0.8462067721737059}, {'l..."
3,user_07168,1,"[{'label': 1, 'prob': 0.8237881723136039}, {'l..."
4,user_06769,1,"[{'label': 1, 'prob': 0.808682109531796}, {'la..."
...,...,...,...
95,user_03012,1,"[{'label': 1, 'prob': 0.9094382898185381}, {'l..."
96,user_08765,1,"[{'label': 1, 'prob': 0.8188085885896399}, {'l..."
97,user_02051,0,"[{'label': 1, 'prob': 0.4385902698951316}, {'l..."
98,user_09812,1,"[{'label': 1, 'prob': 0.6636332776942119}, {'l..."
