<a href="https://colab.research.google.com/github/bjrodarmel/mgmt467-analytics-portfolio/blob/main/Unit2_Lab2_PartAFinsihed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🤖 MGMT 467 - Unit 2 Lab 2: Prompt Studio for AI-Assisted SQL + ML

**Date:** 2025-10-16  
**Objective:** Build and refine a complete ML pipeline for churn prediction using BigQuery — but with **Gemini-style prompts** guiding SQL generation.

You'll learn to:
- Frame SQL goals as clear prompts
- Generate, test, and debug queries with an AI assistant
- Reflect on each modeling step and your prompt design



## Task 0: Connect to BigQuery

**🎯 Goal:** Verify BigQuery access from Colab.  
**📌 Requirements:** Use `%%bigquery`, get current date and user session.

---

### 🧠 Prompt Template  
> Write a SQL query that returns CURRENT_DATE() and SESSION_USER(). I will run it with %%bigquery in Colab.

---

### 👩‍🏫 Example Prompt  
> Write a SQL query using BigQuery syntax that returns today’s date and the current session user.

---

### ✅ Expected SQL Output
```sql
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user;
```

---

### 🔍 Checkpoint  
Query should return a single row with today's date and your user.


In [None]:
# prompt: Using BigQuery, test connection by writing a SQL query to get current date and user session.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
SELECT CURRENT_DATE() AS today, SESSION_USER() AS user;
"""

query_job = client.query(query)  # Make an API request.
results = query_job.result()  # Wait for the job to complete.

for row in results:
  print(f"Today: {row.today}, User: {row.user}")

Today: 2025-10-26, User: bjrodarmel@gmail.com



## Task 1: Prepare ML Table

**🎯 Goal:** Create a clean features table for modeling churn.  
**📌 Requirements:** Use cleaned_features as source, select relevant columns, filter rows with churn_label IS NOT NULL.

---

### 🧠 Prompt Template  
> Write a query that creates a new table with columns: [region, plan_tier, age_band, ...] and churn_label from [source_table]. Filter to rows where churn_label IS NOT NULL.

---

### 👩‍🏫 Example Prompt  
> Create a BigQuery table named churn_features from cleaned_features with selected features and where churn_label IS NOT NULL.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE TABLE `your_dataset.churn_features` AS
SELECT region, plan_tier, age_band, avg_rating, total_minutes, churn_label
FROM `your_dataset.cleaned_features`
WHERE churn_label IS NOT NULL;
```

---

### 🔍 Checkpoint  
Table should appear in BigQuery and contain non-null labels.


In [None]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# NOTE: Ensure the dataset `Netflix` exists within your `mgmt-467` project.
# You can create it manually in the BigQuery UI if it doesn't.
query = """
CREATE OR REPLACE TABLE `mgmt-467.netflix.churn_features` AS
SELECT
    region,
    plan_tier,
    age_band,
    avg_rating,
    total_minutes,
    churn_label
FROM
    `mgmt-467.netflix.cleaned_features`
WHERE
    churn_label IS NOT NULL;
"""

query_job = client.query(query)  # Make an API request.
print(f"Job {query_job.job_id} started.")
query_job.result()  # Wait for the job to complete.
print("Table `mgmt-467.Netflix.churn_features` created successfully.")

Job 1c251122-c5ac-4471-8227-5f162ed47a4e started.


NotFound: 404 Not found: Table mgmt-467:netflix.cleaned_features was not found in location US; reason: notFound, message: Not found: Table mgmt-467:netflix.cleaned_features was not found in location US

Location: US
Job ID: 1c251122-c5ac-4471-8227-5f162ed47a4e


# Task
Define the schema for the `mgmt-467.netflix.cleaned_features` table, including how to derive `region`, `plan_tier`, `age_band`, `avg_rating`, `total_minutes`, and `churn_label` from the `movies`, `recommendation_logs`, `reviews`, `search_log`, `users`, and `watch_history` tables. Then, provide the SQL query to create this `cleaned_features` table.

## Define cleaned_features Schema and Joins

### Subtask:
Explicitly define how each target feature (region, plan_tier, age_band, avg_rating, total_minutes, churn_label) will be derived from the source tables (movies, recommendation_logs, reviews, search_log, users, watch_history).


```markdown
## Define cleaned_features Schema and Joins

### Feature Derivation Details

For the `cleaned_features` table, the target features will be derived from the following hypothetical source tables and columns:

*   **users**: `user_id`, `region`, `plan_tier`, `age`, `churn_label` (or `subscription_status`)
*   **reviews**: `user_id`, `rating`
*   **watch_history**: `user_id`, `duration_minutes`

Here's how each target feature is derived:

1.  **`region`**
    *   **Source Column(s):** `users.region`
    *   **Join Conditions:** No direct join needed if `users` is the base table for selection. If not, join with `users` on `user_id`.
    *   **Transformation Logic:** Direct selection.

2.  **`plan_tier`**
    *   **Source Column(s):** `users.plan_tier`
    *   **Join Conditions:** No direct join needed if `users` is the base table for selection. If not, join with `users` on `user_id`.
    *   **Transformation Logic:** Direct selection.

3.  **`age_band`**
    *   **Source Column(s):** `users.age`
    *   **Join Conditions:** No direct join needed if `users` is the base table for selection. If not, join with `users` on `user_id`.
    *   **Transformation Logic:** Categorization of `users.age` into bands. For example:
        ```sql
        CASE
            WHEN users.age < 18 THEN '0-17'
            WHEN users.age BETWEEN 18 AND 24 THEN '18-24'
            WHEN users.age BETWEEN 25 AND 34 THEN '25-34'
            WHEN users.age BETWEEN 35 AND 44 THEN '35-44'
            WHEN users.age BETWEEN 45 AND 54 THEN '45-54'
            WHEN users.age BETWEEN 55 AND 64 THEN '55-64'
            ELSE '65+'
        END AS age_band
        ```

4.  **`avg_rating`**
    *   **Source Column(s):** `reviews.rating`
    *   **Join Conditions:** `users` `LEFT JOIN` `reviews` `ON users.user_id = reviews.user_id`
    *   **Transformation Logic:** Calculate the average `rating` given by each user. Users who have not given any reviews will have a `NULL` for `avg_rating` or it could be imputed to 0.
        ```sql
        AVG(reviews.rating) OVER (PARTITION BY users.user_id) AS avg_rating
        ```

5.  **`total_minutes`**
    *   **Source Column(s):** `watch_history.duration_minutes`
    *   **Join Conditions:** `users` `LEFT JOIN` `watch_history` `ON users.user_id = watch_history.user_id`
    *   **Transformation Logic:** Sum the `duration_minutes` for all watched items by each user. Users with no watch history will have a `NULL` for `total_minutes` or it could be imputed to 0.
        ```sql
        SUM(watch_history.duration_minutes) OVER (PARTITION BY users.user_id) AS total_minutes
        ```

6.  **`churn_label`**
    *   **Source Column(s):** `users.churn_label` (assuming a direct column, or `users.subscription_status` for derivation)
    *   **Join Conditions:** No direct join needed if `users` is the base table for selection. If not, join with `users` on `user_id`.
    *   **Transformation Logic:** Direct selection of `users.churn_label`. (If `churn_label` is to be derived from `subscription_status`, it would be `CASE WHEN users.subscription_status = 'churned' THEN 1 ELSE 0 END AS churn_label`).

**Overall Join Strategy:**
The `users` table will serve as the central table. `reviews` and `watch_history` tables will be joined to the `users` table using `user_id` to aggregate user-specific metrics. A `LEFT JOIN` is preferred to retain all users, even those without reviews or watch history.

```

## Create cleaned_features Table

### Subtask:
Generate and execute the SQL query to create the `mgmt-467.netflix.cleaned_features` table based on the defined schema and join logic.


**Reasoning**:
The goal is to create the `mgmt-467.netflix.cleaned_features` table. This requires constructing a SQL query that joins the `users`, `reviews`, and `watch_history` tables, calculates `age_band`, `avg_rating`, and `total_minutes`, and then executing this query using the BigQuery client.



In [None]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query_create_cleaned_features = """
CREATE OR REPLACE TABLE `mgmt-467.netflix.cleaned_features` AS
WITH
  avg_ratings_cte AS (
    SELECT
      user_id,
      AVG(rating) AS avg_rating
    FROM
      `mgmt-467.netflix.reviews`
    GROUP BY
      user_id
  ),
  total_minutes_cte AS (
    SELECT
      user_id,
      SUM(duration_minutes) AS total_minutes
    FROM
      `mgmt-467.netflix.watch_history`
    GROUP BY
      user_id
  )
SELECT
  u.user_id,
  u.region,
  u.plan_tier,
  CASE
    WHEN u.age < 18 THEN '0-17'
    WHEN u.age BETWEEN 18 AND 24 THEN '18-24'
    WHEN u.age BETWEEN 25 AND 34 THEN '25-34'
    WHEN u.age BETWEEN 35 AND 44 THEN '35-44'
    WHEN u.age BETWEEN 45 AND 54 THEN '45-54'
    WHEN u.age BETWEEN 55 AND 64 THEN '55-64'
    ELSE '65+'
  END AS age_band,
  COALESCE(ar.avg_rating, 0) AS avg_rating, -- Default to 0 if no reviews
  COALESCE(tm.total_minutes, 0) AS total_minutes, -- Default to 0 if no watch history
  u.churn_label
FROM
  `mgmt-467.netflix.users` AS u
LEFT JOIN
  avg_ratings_cte AS ar
ON
  u.user_id = ar.user_id
LEFT JOIN
  total_minutes_cte AS tm
ON
  u.user_id = tm.user_id;
"""

try:
    query_job = client.query(query_create_cleaned_features)  # Make an API request.
    print(f"Job {query_job.job_id} started to create cleaned_features table.")
    query_job.result()  # Wait for the job to complete.
    print("Table `mgmt-467.netflix.cleaned_features` created successfully.")
except Exception as e:
    print(f"Error creating table `mgmt-467.netflix.cleaned_features`: {e}")

Job c39ea640-256a-4069-8d0a-9873bc418afb started to create cleaned_features table.
Error creating table `mgmt-467.netflix.cleaned_features`: 400 Unrecognized name: duration_minutes at [16:11]; reason: invalidQuery, location: query, message: Unrecognized name: duration_minutes at [16:11]

Location: US
Job ID: c39ea640-256a-4069-8d0a-9873bc418afb



**Reasoning**:
The previous code execution failed because the `duration_minutes` column was not found in the `watch_history` table, indicating an incorrect column name in the SQL query. I will update the query to use a more plausible column name, `watch_duration`, for the total minutes calculation. If this still fails, the actual schema for `watch_history` will need to be verified.



In [None]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query_create_cleaned_features = """
CREATE OR REPLACE TABLE `mgmt-467.netflix.cleaned_features` AS
WITH
  avg_ratings_cte AS (
    SELECT
      user_id,
      AVG(rating) AS avg_rating
    FROM
      `mgmt-467.netflix.reviews`
    GROUP BY
      user_id
  ),
  total_minutes_cte AS (
    SELECT
      user_id,
      SUM(watch_duration) AS total_minutes -- Changed from duration_minutes
    FROM
      `mgmt-467.netflix.watch_history`
    GROUP BY
      user_id
  )
SELECT
  u.user_id,
  u.region,
  u.plan_tier,
  CASE
    WHEN u.age < 18 THEN '0-17'
    WHEN u.age BETWEEN 18 AND 24 THEN '18-24'
    WHEN u.age BETWEEN 25 AND 34 THEN '25-34'
    WHEN u.age BETWEEN 35 AND 44 THEN '35-44'
    WHEN u.age BETWEEN 45 AND 54 THEN '45-54'
    WHEN u.age BETWEEN 55 AND 64 THEN '55-64'
    ELSE '65+'
  END AS age_band,
  COALESCE(ar.avg_rating, 0) AS avg_rating, -- Default to 0 if no reviews
  COALESCE(tm.total_minutes, 0) AS total_minutes, -- Default to 0 if no watch history
  u.churn_label
FROM
  `mgmt-467.netflix.users` AS u
LEFT JOIN
  avg_ratings_cte AS ar
ON
  u.user_id = ar.user_id
LEFT JOIN
  total_minutes_cte AS tm
ON
  u.user_id = tm.user_id;
"""

try:
    query_job = client.query(query_create_cleaned_features)  # Make an API request.
    print(f"Job {query_job.job_id} started to create cleaned_features table.")
    query_job.result()  # Wait for the job to complete.
    print("Table `mgmt-467.netflix.cleaned_features` created successfully.")
except Exception as e:
    print(f"Error creating table `mgmt-467.netflix.cleaned_features`: {e}")

Job 5f926127-1c67-4338-a5a5-f6e3d83de6a8 started to create cleaned_features table.
Error creating table `mgmt-467.netflix.cleaned_features`: 400 Unrecognized name: watch_duration at [16:11]; reason: invalidQuery, location: query, message: Unrecognized name: watch_duration at [16:11]

Location: US
Job ID: 5f926127-1c67-4338-a5a5-f6e3d83de6a8



**Reasoning**:
The previous attempts to create `cleaned_features` failed due to unrecognized column names in the `watch_history` table. To resolve this, I need to inspect the schema of the `mgmt-467.netflix.watch_history` table to find the correct column name for watch duration. I will use `INFORMATION_SCHEMA.COLUMNS` to get column details.



In [None]:
from google.cloud import bigquery

client = bigquery.Client()

query_schema = """
SELECT
    column_name, data_type
FROM
    `mgmt-467.netflix.INFORMATION_SCHEMA.COLUMNS`
WHERE
    table_name = 'watch_history';
"""

try:
    query_job = client.query(query_schema)
    print(f"Job {query_job.job_id} started to retrieve schema for watch_history.")
    results = query_job.result() # Wait for the job to complete.

    print("Schema for `mgmt-467.netflix.watch_history`:")
    for row in results:
        print(f"  Column: {row.column_name}, Type: {row.data_type}")

except Exception as e:
    print(f"Error retrieving schema for watch_history: {e}")

Job 20bf0440-368e-4682-9ea5-70f38b1f63c7 started to retrieve schema for watch_history.
Schema for `mgmt-467.netflix.watch_history`:
  Column: session_id, Type: STRING
  Column: user_id, Type: STRING
  Column: movie_id, Type: STRING
  Column: watch_date, Type: DATE
  Column: device_type, Type: STRING
  Column: watch_duration_minutes, Type: FLOAT64
  Column: progress_percentage, Type: FLOAT64
  Column: action, Type: STRING
  Column: quality, Type: STRING
  Column: location_country, Type: STRING
  Column: is_download, Type: BOOL
  Column: user_rating, Type: INT64


**Reasoning**:
The schema retrieval revealed that the correct column name for watch duration in the `watch_history` table is `watch_duration_minutes`. I will now update the SQL query to use this correct column name and re-attempt to create the `cleaned_features` table.



In [None]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query_create_cleaned_features = """
CREATE OR REPLACE TABLE `mgmt-467.netflix.cleaned_features` AS
WITH
  avg_ratings_cte AS (
    SELECT
      user_id,
      AVG(rating) AS avg_rating
    FROM
      `mgmt-467.netflix.reviews`
    GROUP BY
      user_id
  ),
  total_minutes_cte AS (
    SELECT
      user_id,
      SUM(watch_duration_minutes) AS total_minutes -- Corrected column name
    FROM
      `mgmt-467.netflix.watch_history`
    GROUP BY
      user_id
  )
SELECT
  u.user_id,
  u.region,
  u.plan_tier,
  CASE
    WHEN u.age < 18 THEN '0-17'
    WHEN u.age BETWEEN 18 AND 24 THEN '18-24'
    WHEN u.age BETWEEN 25 AND 34 THEN '25-34'
    WHEN u.age BETWEEN 35 AND 44 THEN '35-44'
    WHEN u.age BETWEEN 45 AND 54 THEN '45-54'
    WHEN u.age BETWEEN 55 AND 64 THEN '55-64'
    ELSE '65+'
  END AS age_band,
  COALESCE(ar.avg_rating, 0) AS avg_rating, -- Default to 0 if no reviews
  COALESCE(tm.total_minutes, 0) AS total_minutes, -- Default to 0 if no watch history
  u.churn_label
FROM
  `mgmt-467.netflix.users` AS u
LEFT JOIN
  avg_ratings_cte AS ar
ON
  u.user_id = ar.user_id
LEFT JOIN
  total_minutes_cte AS tm
ON
  u.user_id = tm.user_id;
"""

try:
    query_job = client.query(query_create_cleaned_features)  # Make an API request.
    print(f"Job {query_job.job_id} started to create cleaned_features table.")
    query_job.result()  # Wait for the job to complete.
    print("Table `mgmt-467.netflix.cleaned_features` created successfully.")
except Exception as e:
    print(f"Error creating table `mgmt-467.netflix.cleaned_features`: {e}")

Job 02d6a53d-3fba-4aab-8ca7-51dce2fde1af started to create cleaned_features table.
Error creating table `mgmt-467.netflix.cleaned_features`: 400 Name region not found inside u at [24:5]; reason: invalidQuery, location: query, message: Name region not found inside u at [24:5]

Location: US
Job ID: 02d6a53d-3fba-4aab-8ca7-51dce2fde1af



**Reasoning**:
The previous attempt to create `cleaned_features` failed because the column `region` was not found in the `users` table (`u`). To resolve this, I need to inspect the schema of the `mgmt-467.netflix.users` table to find the correct column names for all user-related fields (`region`, `plan_tier`, `age`, and `churn_label`). I will use `INFORMATION_SCHEMA.COLUMNS` to get column details.



In [None]:
from google.cloud import bigquery

client = bigquery.Client()

query_schema = """
SELECT
    column_name, data_type
FROM
    `mgmt-467.netflix.INFORMATION_SCHEMA.COLUMNS`
WHERE
    table_name = 'users';
"""

try:
    query_job = client.query(query_schema)
    print(f"Job {query_job.job_id} started to retrieve schema for users.")
    results = query_job.result() # Wait for the job to complete.

    print("Schema for `mgmt-467.netflix.users`:")
    for row in results:
        print(f"  Column: {row.column_name}, Type: {row.data_type}")

except Exception as e:
    print(f"Error retrieving schema for users: {e}")

Job 82f77627-f598-40fa-b7d7-9d88a8ec2a1f started to retrieve schema for users.
Schema for `mgmt-467.netflix.users`:
  Column: user_id, Type: STRING
  Column: email, Type: STRING
  Column: first_name, Type: STRING
  Column: last_name, Type: STRING
  Column: age, Type: FLOAT64
  Column: gender, Type: STRING
  Column: country, Type: STRING
  Column: state_province, Type: STRING
  Column: city, Type: STRING
  Column: subscription_plan, Type: STRING
  Column: subscription_start_date, Type: DATE
  Column: is_active, Type: BOOL
  Column: monthly_spend, Type: FLOAT64
  Column: primary_device, Type: STRING
  Column: household_size, Type: FLOAT64
  Column: created_at, Type: TIMESTAMP


**Reasoning**:
The schema retrieval for the `users` table revealed that `region` should be `country`, `plan_tier` should be `subscription_plan`, `age` is correct, and `churn_label` can be derived from `is_active`. I will update the SQL query with these corrected column names and re-attempt to create the `cleaned_features` table.



In [None]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query_create_cleaned_features = """
CREATE OR REPLACE TABLE `mgmt-467.netflix.cleaned_features` AS
WITH
  avg_ratings_cte AS (
    SELECT
      user_id,
      AVG(rating) AS avg_rating
    FROM
      `mgmt-467.netflix.reviews`
    GROUP BY
      user_id
  ),
  total_minutes_cte AS (
    SELECT
      user_id,
      SUM(watch_duration_minutes) AS total_minutes
    FROM
      `mgmt-467.netflix.watch_history`
    GROUP BY
      user_id
  )
SELECT
  u.user_id,
  u.country AS region, -- Corrected column name
  u.subscription_plan AS plan_tier, -- Corrected column name
  CASE
    WHEN u.age < 18 THEN '0-17'
    WHEN u.age BETWEEN 18 AND 24 THEN '18-24'
    WHEN u.age BETWEEN 25 AND 34 THEN '25-34'
    WHEN u.age BETWEEN 35 AND 44 THEN '35-44'
    WHEN u.age BETWEEN 45 AND 54 THEN '45-54'
    WHEN u.age BETWEEN 55 AND 64 THEN '55-64'
    ELSE '65+'
  END AS age_band,
  COALESCE(ar.avg_rating, 0) AS avg_rating,
  COALESCE(tm.total_minutes, 0) AS total_minutes,
  CASE WHEN u.is_active = FALSE THEN 1 ELSE 0 END AS churn_label -- Derived from is_active
FROM
  `mgmt-467.netflix.users` AS u
LEFT JOIN
  avg_ratings_cte AS ar
ON
  u.user_id = ar.user_id
LEFT JOIN
  total_minutes_cte AS tm
ON
  u.user_id = tm.user_id;
"""

try:
    query_job = client.query(query_create_cleaned_features)  # Make an API request.
    print(f"Job {query_job.job_id} started to create cleaned_features table.")
    query_job.result()  # Wait for the job to complete.
    print("Table `mgmt-467.netflix.cleaned_features` created successfully.")
except Exception as e:
    print(f"Error creating table `mgmt-467.netflix.cleaned_features`: {e}")

Job a673a298-3363-4479-9fbc-4dba9a6b6199 started to create cleaned_features table.
Table `mgmt-467.netflix.cleaned_features` created successfully.


## Create Churn Features Table

### Subtask:
After `cleaned_features` is successfully created, generate and execute the SQL query to create the `mgmt-467.netflix.churn_features` table using `region, plan_tier, age_band, avg_rating, total_minutes, and churn_label` from `mgmt-467.netflix.cleaned_features`, filtering where `churn_label IS NOT NULL`.


**Reasoning**:
The previous step successfully created the `cleaned_features` table. Now, I will generate and execute the SQL query to create the `churn_features` table based on the specified columns and filtering criteria from the `cleaned_features` table, as per the subtask instructions.



In [None]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query_create_churn_features = """
CREATE OR REPLACE TABLE `mgmt-467.netflix.churn_features` AS
SELECT
    region,
    plan_tier,
    age_band,
    avg_rating,
    total_minutes,
    churn_label
FROM
    `mgmt-467.netflix.cleaned_features`
WHERE
    churn_label IS NOT NULL;
"""

try:
    query_job = client.query(query_create_churn_features)  # Make an API request.
    print(f"Job {query_job.job_id} started to create churn_features table.")
    query_job.result()  # Wait for the job to complete.
    print("Table `mgmt-467.netflix.churn_features` created successfully.")
except Exception as e:
    print(f"Error creating table `mgmt-467.netflix.churn_features`: {e}")

Job 7aca3e03-4e3f-4f41-bfa1-4841079ce86d started to create churn_features table.
Table `mgmt-467.netflix.churn_features` created successfully.


## Final Task

### Subtask:
Confirm the successful creation of both `cleaned_features` and `churn_features` tables.


## Summary:

### Data Analysis Key Findings

*   **`cleaned_features` Table Schema Defined:** A comprehensive schema for the `cleaned_features` table was defined, outlining the derivation of each field:
    *   `region` and `plan_tier` were derived from `users.country` and `users.subscription_plan`, respectively.
    *   `age_band` was categorized from `users.age` into specific ranges (e.g., '0-17', '18-24', '65+').
    *   `avg_rating` was calculated as the average `rating` per user from the `reviews` table.
    *   `total_minutes` represented the sum of `watch_duration_minutes` per user from the `watch_history` table.
    *   `churn_label` was derived from `users.is_active` (0 for active, 1 for inactive/churned).
*   **Successful Creation of `cleaned_features`:** The `mgmt-467.netflix.cleaned_features` table was successfully created in BigQuery after iterative debugging. Initial attempts failed due to incorrect column names (`duration_minutes`, `watch_duration` in `watch_history`; `region`, `plan_tier`, `churn_label` in `users`), which were resolved by querying the `INFORMATION_SCHEMA.COLUMNS` for the respective tables.
*   **Successful Creation of `churn_features`:** The `mgmt-467.netflix.churn_features` table was subsequently created from `cleaned_features`, including `region, plan_tier, age_band, avg_rating, total_minutes,` and `churn_label`, with a filter applied to include only records where `churn_label IS NOT NULL`.

### Insights or Next Steps

*   The detailed feature engineering for `cleaned_features` provides a robust foundation for churn prediction models, encapsulating key user demographics, engagement, and sentiment.
*   The `churn_features` table is now ready for use in machine learning model training, offering a pre-processed dataset specifically tailored for churn analysis.



## Task 2: Train Logistic Regression Model

**🎯 Goal:** Train a basic BQML logistic regression model.  
**📌 Requirements:** Use churn_features table, predict churn_label from features.

---

### 🧠 Prompt Template  
> Write a CREATE MODEL SQL for logistic regression using churn_label as label and [features] as inputs.

---

### 👩‍🏫 Example Prompt  
> Train a logistic regression model to predict churn_label using region, plan_tier, total_minutes, avg_rating.

---

### ✅ Expected SQL Output
```sql
CREATE OR REPLACE MODEL `your_dataset.churn_model`
OPTIONS(model_type='logistic_reg') AS
SELECT region, plan_tier, total_minutes, avg_rating, churn_label
FROM `your_dataset.churn_features`;
```

---

### 🔍 Checkpoint  
Model appears in BigQuery under Models. Training completes.


In [None]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query_create_model = """
CREATE OR REPLACE MODEL `mgmt-467.netflix.churn_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['churn_label'])
AS
SELECT
    region,
    plan_tier,
    age_band,
    avg_rating,
    total_minutes,
    churn_label
FROM
    `mgmt-467.netflix.churn_features`;
"""

try:
    query_job = client.query(query_create_model)  # Make an API request.
    print(f"Job {query_job.job_id} started to create churn_model.")
    query_job.result()  # Wait for the job to complete.
    print("Model `mgmt-467.netflix.churn_model` created successfully.")
except Exception as e:
    print(f"Error creating model `mgmt-467.netflix.churn_model`: {e}")

Job 116144e7-7b0f-409d-9497-6ae0e57dd2d5 started to create churn_model.
Model `mgmt-467.netflix.churn_model` created successfully.



## Task 3: Evaluate Model

**🎯 Goal:** Evaluate the logistic regression model.  
**📌 Requirements:** Use ML.EVALUATE.

---

### 🧠 Prompt Template  
> Write a query to evaluate my logistic regression model using ML.EVALUATE.

---

### 👩‍🏫 Example Prompt  
> Evaluate the churn_model using ML.EVALUATE to get accuracy, precision, recall.

---

### ✅ Expected SQL Output
```sql
SELECT * FROM ML.EVALUATE(MODEL `your_dataset.churn_model`);
```

---

### 🔍 Checkpoint  
View performance metrics: accuracy, log_loss, precision, recall.


In [None]:
# prompt: Use ML.EVALUATE in order to asses the model created in the previous step. Please use metrics like accuracy, precision, recall, and log loss.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query_evaluate_model = """
SELECT
    *
FROM
    ML.EVALUATE(MODEL `mgmt-467.netflix.churn_model`);
"""

try:
    query_job = client.query(query_evaluate_model)  # Make an API request.
    print(f"Job {query_job.job_id} started to evaluate churn_model.")
    results = query_job.result()  # Wait for the job to complete.

    print("Model Evaluation Results:")
    for row in results:
        print(f"  Accuracy: {row.accuracy}")
        print(f"  Log Loss: {row.log_loss}")
        print(f"  Precision: {row.precision}")
        print(f"  Recall: {row.recall}")

except Exception as e:
    print(f"Error evaluating model `mgmt-467.netflix.churn_model`: {e}")


Job 5dc67173-4589-4551-bed0-7a8d24845bdd started to evaluate churn_model.
Model Evaluation Results:
  Accuracy: 0.8496166213208014
  Log Loss: 0.42349457226866244
  Precision: 0.0
  Recall: 0.0



## Task 4: Predict Churn

**🎯 Goal:** Use ML.PREDICT to generate churn predictions.  
**📌 Requirements:** Apply model to same input table.

---

### 🧠 Prompt Template  
> Generate SQL to use ML.PREDICT on churn_model and return predictions by user_id.

---

### 👩‍🏫 Example Prompt  
> Predict churn using churn_model. Include user_id, predicted_churn_label, and prediction probability.

---

### ✅ Expected SQL Output
```sql
SELECT user_id, predicted_churn_label, predicted_churn_label_probs
FROM ML.PREDICT(MODEL `your_dataset.churn_model`,
      (SELECT * FROM `your_dataset.churn_features`));
```

---

### 🔍 Checkpoint  
Inspect top churn risk users. Validate probabilities.


In [None]:
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query_predict_churn = """
SELECT
    user_id,
    predicted_churn_label,
    predicted_churn_label_probs
FROM
    ML.PREDICT(MODEL `mgmt-467.netflix.churn_model`,
        (SELECT
            user_id,
            region,
            plan_tier,
            age_band,
            avg_rating,
            total_minutes
        FROM
            `mgmt-467.netflix.cleaned_features`)); -- Changed input table to cleaned_features
"""

try:
    query_job = client.query(query_predict_churn)  # Make an API request.
    print(f"Job {query_job.job_id} started to predict churn.")
    results = query_job.result()  # Wait for the job to complete.

    print("Churn Prediction Results (Top 5):")
    for i, row in enumerate(results):
        if i >= 5: # Display only top 5 for brevity
            break
        print(f"  User ID: {row.user_id}, Predicted Churn: {row.predicted_churn_label}, Probability: {row.predicted_churn_label_probs}")

except Exception as e:
    print(f"Error predicting churn using model `mgmt-467.netflix.churn_model`: {e}")

Job c28d358a-df85-43df-a77a-2c9f542df537 started to predict churn.
Churn Prediction Results (Top 5):
  User ID: user_00008, Predicted Churn: 0, Probability: [{'label': 1, 'prob': 0.15123665873560657}, {'label': 0, 'prob': 0.8487633412643935}]
  User ID: user_00008, Predicted Churn: 0, Probability: [{'label': 1, 'prob': 0.15123665873560657}, {'label': 0, 'prob': 0.8487633412643935}]
  User ID: user_00009, Predicted Churn: 0, Probability: [{'label': 1, 'prob': 0.15192225799575576}, {'label': 0, 'prob': 0.8480777420042442}]
  User ID: user_00009, Predicted Churn: 0, Probability: [{'label': 1, 'prob': 0.15192225799575576}, {'label': 0, 'prob': 0.8480777420042442}]
  User ID: user_00012, Predicted Churn: 0, Probability: [{'label': 1, 'prob': 0.14792381486266493}, {'label': 0, 'prob': 0.8520761851373351}]
