# MGMT 467 — Prompt-Driven Lab (with Commented Examples)
## Kaggle ➜ Google Cloud Storage ➜ BigQuery ➜ Data Quality (DQ)

**How to use this notebook**
- Each section gives you a **Build Prompt** to paste into Gemini/Vertex AI (or Gemini in Colab).
- Below each prompt, you’ll see a **commented example** of what a good LLM answer might look like.
- **Do not** just uncomment and run. Use the prompt to generate your own code, then compare to the example.
- After every step, run the **Verification Prompt**, and write the **Reflection** in Markdown.

> Goal today: Download the Netflix dataset (Kaggle) → Stage on GCS → Load into BigQuery → Run DQ profiling (missingness, duplicates, outliers, anomaly flags).


### Academic integrity & LLM usage
- Use the prompts here to generate your own code cells.
- Read concept notes and write the reflection answers in your own words.
- Keep credentials out of code. Upload `kaggle.json` when asked.


## Learning objectives
1) Explain **why** we stage data in GCS and load it to BigQuery.  
2) Build an **idempotent**, auditable pipeline.  
3) Diagnose **missingness**, **duplicates**, and **outliers** and justify cleaning choices.  
4) Connect DQ decisions to **business/ML impact**.


## 0) Environment setup — What & Why
Authenticate Colab to Google Cloud so we can use `gcloud`, GCS, and BigQuery. Set **PROJECT_ID** and **REGION** once for consistency (cost/latency).

### Build Prompt (paste to LLM)
You are my cloud TA. Generate a single **Colab code cell** that:
1) Authenticates to Google Cloud in Colab,  
2) Prompts for `PROJECT_ID` via `input()` and sets `REGION="us-central1"` (editable),  
3) Exports `GOOGLE_CLOUD_PROJECT`,  
4) Runs `gcloud config set project $GOOGLE_CLOUD_PROJECT`,  
5) Prints both values. Add 2–3 comments explaining what/why.
End with a comment: `# Done: Auth + Project/Region set`.


In [1]:
# Authenticate to Google Cloud in Colab
from google.colab import auth
auth.authenticate_user()

import os

# Prompt for Project ID and set Region
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # Keep consistent; change if instructed

# Export GOOGLE_CLOUD_PROJECT for gcloud commands
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

# Set the active project for gcloud and BigQuery CLI
!gcloud config set project $GOOGLE_CLOUD_PROJECT

# Print the set values
print("Project:", PROJECT_ID, "| Region:", REGION)

# Done: Auth + Project/Region set

Enter your GCP Project ID: mgmt-467-25259
Updated property [core/project].
Project: mgmt-467-25259 | Region: us-central1


### Verification Prompt
Generate a short cell that prints the active project using `gcloud config get-value project` and echoes the `REGION` you set.


**Reflection:** Why do we set `PROJECT_ID` and `REGION` at the top? What can go wrong if we don’t?



Setting the PROJECT_ID and REGION at the beginning of the notebook ensures consistency and avoids potential issues. Commands could fali and resources could be created in the wrong project/region if we don't.

## 1) Kaggle API — What & Why
Use Kaggle CLI for reproducible downloads. Store `kaggle.json` at `~/.kaggle/kaggle.json` with `0600` permissions to protect secrets.

### Build Prompt
Generate a **single Colab code cell** that:
- Prompts me to upload `kaggle.json`,
- Saves to `~/.kaggle/kaggle.json` with `0600` permissions,
- Prints `kaggle --version`.
Add comments about security and reproducibility.


In [2]:
# Prompt user to upload their kaggle.json file
# This file contains your Kaggle API credentials. Keep it secure.
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

# Ensure the .kaggle directory exists in the home directory
# This is where the Kaggle CLI expects to find the credentials file.
import os
os.makedirs('/root/.kaggle', exist_ok=True)

# Save the uploaded file to the correct location
# Use the first file uploaded (assuming only one: kaggle.json)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])

# Set file permissions to 0600 (owner read/write only)
# This is crucial for security to prevent other users/processes from accessing your token.
os.chmod('/root/.kaggle/kaggle.json', 0o600)

# Verify the Kaggle CLI is installed and configured
# This also confirms the credential file is in the right place and has correct permissions.
!kaggle --version

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle.json
Kaggle API 1.7.4.5


### Verification Prompt
Generate a one-liner that runs `kaggle --help | head -n 20` to show the CLI is ready.


**Reflection:** Why require strict `0600` permissions on API tokens? What risks are we avoiding?


This is crucial for security because API tokens are essentially passwords,
this ensures that only the owner of the file can read nad write to it.


## 2) Download & unzip dataset — What & Why
Keep raw files under `/content/data/raw` for predictable paths and auditing.
**Dataset:** `sayeeduddin/netflix-2025user-behavior-dataset-210k-records`

In [3]:
# Create directory for raw data
# This ensures a consistent location for downloaded files.
!mkdir -p /content/data/raw

# Download the dataset from Kaggle
# The -p flag specifies the download path.
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# Unzip the downloaded file into the raw data directory
# The -o flag allows overwriting if files already exist.
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes
# This provides a clear inventory of the downloaded and unzipped files.
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
Downloading netflix-2025user-behavior-dataset-210k-records.zip to /content/data
  0% 0.00/4.02M [00:00<?, ?B/s]
100% 4.02M/4.02M [00:00<00:00, 778MB/s]
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root 

### Build Prompt
Generate a **Colab code cell** that:
- Creates `/content/data/raw`,
- Downloads the dataset to `/content/data` with Kaggle CLI,
- Unzips into `/content/data/raw` (overwrite OK),
- Lists all CSVs with sizes in a neat table.
Include comments describing each step.


In [4]:
# Create directory for raw data
# This ensures a consistent location for downloaded files.
!mkdir -p /content/data/raw

# Download the dataset from Kaggle
# The -p flag specifies the download path.
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# Unzip the downloaded file into the raw data directory
# The -o flag allows overwriting if files already exist.
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes
# This provides a clear inventory of the downloaded and unzipped files.
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
netflix-2025user-behavior-dataset-210k-records.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root root 1.6M Aug  2 1

### Verification Prompt
Generate a snippet that asserts there are exactly **six** CSV files and prints their names.


In [5]:
# Verify there are exactly six CSV files and print their names
import os
import glob

csv_files = glob.glob('/content/data/raw/*.csv')

# Assert that there are exactly 6 CSV files
assert len(csv_files) == 6, f"Expected 6 CSV files, but found {len(csv_files)}"

# Print the names of the CSV files
print("Found the following CSV files:")
for csv_file in csv_files:
    print(os.path.basename(csv_file))

Found the following CSV files:
movies.csv
watch_history.csv
search_logs.csv
users.csv
recommendation_logs.csv
reviews.csv


**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?.

Clean file inventories help with reproducibility, auditing and provenance, troubleshooting, and resource management. It can also help with scripting and automation

## 3) Create GCS bucket & upload — What & Why
Stage in GCS → consistent, versionable source for BigQuery loads. Bucket names must be **globally unique**.

### Build Prompt
Generate a **Colab code cell** that:
- Creates a unique bucket in `${REGION}` (random suffix),
- Saves name to `BUCKET_NAME` env var,
- Uploads all CSVs to `gs://$BUCKET_NAME/netflix/`,
- Prints the bucket name and explains staging benefits.


In [6]:
# --- Create a unique GCS bucket and upload Netflix dataset for BigQuery Staging ---
import os, uuid, re

# 1) Define region and bucket name
region = "US"
bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"

# 2) Expose values to both Python and shell
os.environ["BUCKET_NAME"] = bucket_name
os.environ["REGION"] = region

# 3) Create bucket (NOTE: use --location=$REGION or --location={region}; both are fine now)
!gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION

# 4) UPLOAD CSVs and verify
!gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/
!gcloud storage ls gs://$BUCKET_NAME/netflix/

Creating gs://mgmt467-netflix-94a76365/...
Copying file:///content/data/raw/movies.csv to gs://mgmt467-netflix-94a76365/netflix/movies.csv
Copying file:///content/data/raw/README.md to gs://mgmt467-netflix-94a76365/netflix/README.md
Copying file:///content/data/raw/recommendation_logs.csv to gs://mgmt467-netflix-94a76365/netflix/recommendation_logs.csv
Copying file:///content/data/raw/reviews.csv to gs://mgmt467-netflix-94a76365/netflix/reviews.csv
Copying file:///content/data/raw/search_logs.csv to gs://mgmt467-netflix-94a76365/netflix/search_logs.csv
Copying file:///content/data/raw/users.csv to gs://mgmt467-netflix-94a76365/netflix/users.csv
Copying file:///content/data/raw/watch_history.csv to gs://mgmt467-netflix-94a76365/netflix/watch_history.csv

Average throughput: 9.0MiB/s
gs://mgmt467-netflix-94a76365/netflix/README.md
gs://mgmt467-netflix-94a76365/netflix/movies.csv
gs://mgmt467-netflix-94a76365/netflix/recommendation_logs.csv
gs://mgmt467-netflix-94a76365/netflix/reviews.cs

### Verification Prompt
Generate a snippet that lists the `netflix/` prefix and shows object sizes.


In [7]:
# List objects in the GCS bucket under the 'netflix/' prefix with sizes
# The -l flag provides a long listing, including object size.
!gcloud storage ls -l gs://$BUCKET_NAME/netflix/

      8002  2025-10-26T20:57:46Z  gs://mgmt467-netflix-94a76365/netflix/README.md
    115942  2025-10-26T20:57:46Z  gs://mgmt467-netflix-94a76365/netflix/movies.csv
   4695557  2025-10-26T20:57:47Z  gs://mgmt467-netflix-94a76365/netflix/recommendation_logs.csv
   1861942  2025-10-26T20:57:48Z  gs://mgmt467-netflix-94a76365/netflix/reviews.csv
   2250902  2025-10-26T20:57:47Z  gs://mgmt467-netflix-94a76365/netflix/search_logs.csv
   1606820  2025-10-26T20:57:47Z  gs://mgmt467-netflix-94a76365/netflix/users.csv
   9269425  2025-10-26T20:57:48Z  gs://mgmt467-netflix-94a76365/netflix/watch_history.csv
TOTAL: 7 objects, 19808590 bytes (18.89MiB)


**Reflection:** Name two benefits of staging in GCS vs loading directly from local Colab.


One is scalability and accessibility, and another benefit is reproducibility and collaboration, staging in GCS makes both easier.

## 4) BigQuery dataset & loads — What & Why
Create dataset `netflix` and load six CSVs with **autodetect** for speed (we’ll enforce schemas later).

### Build Prompt (two cells)
**Cell A:** Create (idempotently) dataset `netflix` in US multi-region; if it exists, print a friendly message.  
**Cell B:** Load tables from `gs://$BUCKET_NAME/netflix/`:
`users, movies, watch_history, recommendation_logs, search_logs, reviews`
with `--skip_leading_rows=1 --autodetect --source_format=CSV`.
Finish with row-count queries for each table.


In [8]:
# Verify row counts for all tables
%%bigquery
SELECT 'users' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.users`
UNION ALL
SELECT 'movies' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.movies`
UNION ALL
SELECT 'watch_history' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.watch_history`
UNION ALL
SELECT 'recommendation_logs' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.recommendation_logs`
UNION ALL
SELECT 'search_logs' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.search_logs`
UNION ALL
SELECT 'reviews' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.reviews`

Executing query with job ID: 1e768df0-6803-4f0f-bea5-b35ca7a221bc
Query executing: 0.50s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.movies: User does not have permission to query table mgmt-467-4677:netflix.movies, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.movies: User does not have permission to query table mgmt-467-4677:netflix.movies, or perhaps it does not exist.

Location: US
Job ID: 1e768df0-6803-4f0f-bea5-b35ca7a221bc



### Verification Prompt
Generate a single query that returns `table_name, row_count` for all six tables in `${GOOGLE_CLOUD_PROJECT}.netflix`.


In [9]:
# Generate a single query that returns table_name, row_count for all six tables in ${GOOGLE_CLOUD_PROJECT}.netflix.
%%bigquery
SELECT 'users' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.users`
UNION ALL
SELECT 'movies' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.movies`
UNION ALL
SELECT 'watch_history' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.watch_history`
UNION ALL
SELECT 'recommendation_logs' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.recommendation_logs`
UNION ALL
SELECT 'search_logs' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.search_logs`
UNION ALL
SELECT 'reviews' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.reviews`

Executing query with job ID: d8522df0-3f82-4945-8f0b-173c3d8b1262
Query executing: 0.42s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.movies: User does not have permission to query table mgmt-467-4677:netflix.movies, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.movies: User does not have permission to query table mgmt-467-4677:netflix.movies, or perhaps it does not exist.

Location: US
Job ID: d8522df0-3f82-4945-8f0b-173c3d8b1262



**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

Autodetect is acceptable during exploratory data analysis, rapid prototyping, and well-structured, consistent data. We should enforce explicit schemas during production ETL/ELT pipeliness nd when preserving data quality and integrity.

## 5) Data Quality (DQ) — Concepts we care about
- **Missingness** (MCAR/MAR/MNAR). Impute vs drop. Add `is_missing_*` indicators.
- **Duplicates** (exact vs near). Double-counted engagement corrupts labels & KPIs.
- **Outliers** (IQR). Winsorize/cap vs robust models. Always **flag** and explain.
- **Reproducibility**. Prefer `CREATE OR REPLACE` and deterministic keys.


### 5.1 Missingness (users) — What & Why
Measure % missing and check if missingness depends on another variable (MAR) → potential bias & instability.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Total rows and % missing in `region`, `plan_tier`, `age_band` from `users`.
2) `% plan_tier missing by region` ordered descending. Add comments on MAR.


In [10]:
# Verification query: Print the three missingness percentages
# This query retrieves and displays the calculated missingness percentages
# for country, subscription_plan, and age from the previous query's results.
%%bigquery
SELECT pct_missing_country, pct_missing_subscription_plan, pct_missing_age
FROM (
  WITH base AS (
    SELECT COUNT(*) n,
           COUNTIF(country IS NULL) miss_country,
           COUNTIF(subscription_plan IS NULL) miss_plan,
           COUNTIF(age IS NULL) miss_age
    FROM `mgmt-467-4677.netflix.users`
  )
  SELECT n,
         ROUND(100*miss_country/n,2) AS pct_missing_country,
         ROUND(100*miss_plan/n,2)   AS pct_missing_subscription_plan,
         ROUND(100*miss_age/n,2)    AS pct_missing_age
  FROM base
)
LIMIT 1;

Executing query with job ID: d185869e-32f5-43d3-bc68-343cffaf629c
Query executing: 0.52s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.

Location: US
Job ID: d185869e-32f5-43d3-bc68-343cffaf629c



In [11]:
# -- % subscription_plan missing by country ordered descending
# This query calculates the percentage of missing 'subscription_plan' values for each country.
# It helps identify if the missingness of 'subscription_plan' is dependent on the 'country' (Missing At Random - MAR).
%%bigquery
SELECT country,
       COUNT(*) AS n,
       ROUND(100*COUNTIF(subscription_plan IS NULL)/COUNT(*),2) AS pct_missing_subscription_plan
FROM `mgmt-467-4677.netflix.users`
GROUP BY country
ORDER BY pct_missing_subscription_plan DESC;

Executing query with job ID: fbfd50c3-b308-46d7-8a7b-88ff2bdf5d6a
Query executing: 0.43s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.

Location: US
Job ID: fbfd50c3-b308-46d7-8a7b-88ff2bdf5d6a



In [12]:
# Select a few rows from the users table to check column names
%%bigquery
SELECT *
FROM `mgmt-467-4677.netflix.users`
LIMIT 5

Executing query with job ID: 7108f604-55e9-43c6-8544-4a973d513322
Query executing: 0.40s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.

Location: US
Job ID: 7108f604-55e9-43c6-8544-4a973d513322



In [13]:
# -- Users: % missing per column
# This query calculates the total number of rows and the percentage of missing values
# for the 'country', 'subscription_plan', and 'age' columns in the 'users' table.
%%bigquery
WITH base AS (
  SELECT COUNT(*) n,
         COUNTIF(country IS NULL) miss_country,
         COUNTIF(subscription_plan IS NULL) miss_plan,
         COUNTIF(age IS NULL) miss_age
  FROM `mgmt-467-4677.netflix.users`
)
SELECT n,
       ROUND(100*miss_country/n,2) AS pct_missing_country,
       ROUND(100*miss_plan/n,2)   AS pct_missing_subscription_plan,
       ROUND(100*miss_age/n,2)    AS pct_missing_age
FROM base;

Executing query with job ID: 744e3c62-b807-4c21-9a48-caa31b46fe9b
Query executing: 0.37s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.

Location: US
Job ID: 744e3c62-b807-4c21-9a48-caa31b46fe9b



### Verification Prompt
Generate a query that prints the three missingness percentages from (1), rounded to two decimals.


In [14]:
# Verification query: Print the three missingness percentages
# This query retrieves and displays the calculated missingness percentages
# for country, subscription_plan, and age from the previous query's results.
%%bigquery
SELECT pct_missing_country, pct_missing_subscription_plan, pct_missing_age
FROM (
  WITH base AS (
    SELECT COUNT(*) n,
           COUNTIF(country IS NULL) miss_country,
           COUNTIF(subscription_plan IS NULL) miss_plan,
           COUNTIF(age IS NULL) miss_age
    FROM `mgmt-467-4677.netflix.users`
  )
  SELECT n,
         ROUND(100*miss_country/n,2) AS pct_missing_country,
         ROUND(100*miss_plan/n,2)   AS pct_missing_subscription_plan,
         ROUND(100*miss_age/n,2)    AS pct_missing_age
  FROM base
)
LIMIT 1;

Executing query with job ID: d7dd7e6f-64fa-4665-b7db-2e31cf7d459f
Query executing: 0.38s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.

Location: US
Job ID: d7dd7e6f-64fa-4665-b7db-2e31cf7d459f



**Reflection:** Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.

### 5.2 Duplicates (watch_history) — What & Why
Find exact duplicate interaction records and keep **one best** per group (deterministic policy).

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Report duplicate groups on `(user_id, movie_id, event_ts, device_type)` with counts (top 20).
2) Create table `watch_history_dedup` that keeps one row per group (prefer higher `progress_ratio`, then `minutes_watched`). Add comments.


In [15]:
# Verification query: Before/after count comparing raw vs watch_history_dedup
# This query compares the number of rows in the original watch_history table
# and the new watch_history_dedup table to show the effect of deduplication.
%%bigquery
SELECT 'watch_history_raw' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.watch_history`
UNION ALL
SELECT 'watch_history_dedup' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.watch_history_dedup`;

Executing query with job ID: c050f839-1468-4144-82c0-9d42a863e406
Query executing: 0.46s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.

Location: US
Job ID: c050f839-1468-4144-82c0-9d42a863e406



In [16]:
# Select a few rows from the watch_history table to check column names
%%bigquery
SELECT *
FROM `mgmt-467-4677.netflix.watch_history`
LIMIT 5

Executing query with job ID: d3df3df4-bdce-4d98-bffb-3fc8fd412321
Query executing: 0.44s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.

Location: US
Job ID: d3df3df4-bdce-4d98-bffb-3fc8fd412321



In [17]:
# -- Report duplicate groups on (user_id, movie_id, watch_date, device_type) with counts (top 20)
# This query identifies and counts duplicate rows based on key columns in the watch_history table.
# Understanding duplicates is crucial for accurate analysis and modeling.
%%bigquery
SELECT user_id, movie_id, watch_date, device_type, COUNT(*) AS dup_count
FROM `mgmt-467-4677.netflix.watch_history`
GROUP BY user_id, movie_id, watch_date, device_type
HAVING dup_count > 1
ORDER BY dup_count DESC
LIMIT 20;

Executing query with job ID: 16a58bc0-767b-4bb2-bcb4-5af9da349bc2
Query executing: 0.48s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.

Location: US
Job ID: 16a58bc0-767b-4bb2-bcb4-5af9da349bc2



In [18]:
# -- Create table watch_history_dedup that keeps one row per group
# This query creates a new table with duplicates removed, keeping one row per group
# based on a defined policy (preferring higher progress_percentage, then watch_duration_minutes).
# This ensures a clean dataset for downstream tasks.
%%bigquery
CREATE OR REPLACE TABLE `mgmt-467-4677.netflix.watch_history_dedup` AS
SELECT * EXCEPT(rk) FROM (
  SELECT h.*,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, movie_id, watch_date, device_type
           ORDER BY progress_percentage DESC, watch_duration_minutes DESC
         ) AS rk
  FROM `mgmt-467-4677.netflix.watch_history` h
)
WHERE rk = 1;

Executing query with job ID: 064ef730-df57-4f19-8eee-0d65c6894b0a
Query executing: 0.44s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.

Location: US
Job ID: 064ef730-df57-4f19-8eee-0d65c6894b0a



### Verification Prompt
Generate a before/after count query comparing raw vs `watch_history_dedup`.


In [19]:
# Verification query: Before/after count comparing raw vs watch_history_dedup
# This query compares the number of rows in the original watch_history table
# and the new watch_history_dedup table to show the effect of deduplication.
%%bigquery
SELECT 'watch_history_raw' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.watch_history`
UNION ALL
SELECT 'watch_history_dedup' AS table_name, COUNT(*) AS n FROM `mgmt-467-4677.netflix.watch_history_dedup`;

Executing query with job ID: 98e45ae2-afa1-45d5-aba7-a827dd3372ae
Query executing: 0.33s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.watch_history: User does not have permission to query table mgmt-467-4677:netflix.watch_history, or perhaps it does not exist.

Location: US
Job ID: 98e45ae2-afa1-45d5-aba7-a827dd3372ae



**Reflection:** Why do duplicates arise (natural vs system-generated)? How do they corrupt labels and KPIs?

### 5.3 Outliers (minutes_watched) — What & Why
Estimate extreme values via IQR; report % outliers; **winsorize** to P01/P99 for robustness while also **flagging** extremes.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Compute IQR bounds for `minutes_watched` on `watch_history_dedup` and report % outliers.
2) Create `watch_history_robust` with `minutes_watched_capped` capped at P01/P99; return quantile summaries before/after.


In [20]:
# Select a few rows from the watch_history_dedup table to check column names
%%bigquery
SELECT *
FROM `mgmt-467-4677.netflix.watch_history_dedup`
LIMIT 5

Executing query with job ID: 3b92be03-f22a-489d-acd7-18b8dd915d19
Query executing: 0.44s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.watch_history_dedup: User does not have permission to query table mgmt-467-4677:netflix.watch_history_dedup, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.watch_history_dedup: User does not have permission to query table mgmt-467-4677:netflix.watch_history_dedup, or perhaps it does not exist.

Location: US
Job ID: 3b92be03-f22a-489d-acd7-18b8dd915d19



In [21]:
# -- Compute IQR bounds for watch_duration_minutes and report % outliers
# This query calculates the Interquartile Range (IQR) for 'watch_duration_minutes'
# and determines the percentage of values that fall outside the IQR bounds (outliers).
%%bigquery
SELECT
    COUNT(*) AS total_rows,
    COUNTIF(watch_duration_minutes < q1 - 1.5 * iqr OR watch_duration_minutes > q3 + 1.5 * iqr) AS outlier_count,
    ROUND(100 * COUNTIF(watch_duration_minutes < q1 - 1.5 * iqr OR watch_duration_minutes > q3 + 1.5 * iqr) / COUNT(*), 2) AS pct_outliers
FROM
    `mgmt-467-4677.netflix.watch_history_dedup`,
    (
        SELECT
            APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(1)] AS q1,
            APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(3)] AS q3,
            APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(3)] - APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(1)] AS iqr
        FROM `mgmt-467-4677.netflix.watch_history_dedup`
        WHERE watch_duration_minutes IS NOT NULL
    )
WHERE watch_duration_minutes IS NOT NULL;

Executing query with job ID: 88abae07-a97b-45c3-a605-6198e503384d
Query executing: 0.48s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.watch_history_dedup: User does not have permission to query table mgmt-467-4677:netflix.watch_history_dedup, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.watch_history_dedup: User does not have permission to query table mgmt-467-4677:netflix.watch_history_dedup, or perhaps it does not exist.

Location: US
Job ID: 88abae07-a97b-45c3-a605-6198e503384d



In [22]:
# -- Create watch_history_robust with watch_duration_minutes_capped at P01/P99; return quantile summaries before/after.
# This query creates a new table with 'watch_duration_minutes' values capped at the 1st and 99th percentiles
# to handle outliers. It then returns quantile summaries for both the original and capped columns
# to show the effect of capping.
%%bigquery
CREATE OR REPLACE TABLE `mgmt-467-4677.netflix.watch_history_robust` AS
SELECT
    *,
    IF(watch_duration_minutes < p1, p1, IF(watch_duration_minutes > p99, p99, watch_duration_minutes)) AS watch_duration_minutes_capped
FROM
    `mgmt-467-4677.netflix.watch_history_dedup`,
    (
        SELECT
            APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(1)] AS p1,
            APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(99)] AS p99
        FROM `mgmt-467-4677.netflix.watch_history_dedup`
        WHERE watch_duration_minutes IS NOT NULL
    );

SELECT
  'before_capping' AS summary_type,
  APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(0)] AS min,
  APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(50)] AS median,
  APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(100)] AS max
FROM `mgmt-467-4677.netflix.watch_history_dedup`
UNION ALL
SELECT
  'after_capping' AS summary_type,
  APPROX_QUANTILES(watch_duration_minutes_capped, 100)[OFFSET(0)] AS min,
  APPROX_QUANTILES(watch_duration_minutes_capped, 100)[OFFSET(50)] AS median,
  APPROX_QUANTILES(watch_duration_minutes_capped, 100)[OFFSET(100)] AS max
FROM `mgmt-467-4677.netflix.watch_history_robust`
WHERE watch_duration_minutes_capped IS NOT NULL;

Executing query with job ID: 0f1e2bb2-8a92-499e-bf96-991a390952bc
Query executing: 0.29s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/mgmt-467-25259/queries/0f1e2bb2-8a92-499e-bf96-991a390952bc?maxResults=0&location=US&prettyPrint=false: Access Denied: Table mgmt-467-4677:netflix.watch_history_dedup: User does not have permission to query table mgmt-467-4677:netflix.watch_history_dedup, or perhaps it does not exist. at [1:1]

Location: US
Job ID: 0f1e2bb2-8a92-499e-bf96-991a390952bc



### Verification Prompt
Generate a query that shows min/median/max before vs after capping.


In [23]:
# Verification query: shows min/median/max before vs after capping.
# This query returns quantile summaries for both the original and capped columns
# to show the effect of capping.
%%bigquery
SELECT
  'before_capping' AS summary_type,
  APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(0)] AS min,
  APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(50)] AS median,
  APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(100)] AS max
FROM `mgmt-467-4677.netflix.watch_history_dedup`
UNION ALL
SELECT
  'after_capping' AS summary_type,
  APPROX_QUANTILES(watch_duration_minutes_capped, 100)[OFFSET(0)] AS min,
  APPROX_QUANTILES(watch_duration_minutes_capped, 100)[OFFSET(50)] AS median,
  APPROX_QUANTILES(watch_duration_minutes_capped, 100)[OFFSET(100)] AS max
FROM `mgmt-467-4677.netflix.watch_history_robust`
WHERE watch_duration_minutes_capped IS NOT NULL;

Executing query with job ID: 0d81051d-af76-46f9-b83f-f4e32b6551ba
Query executing: 0.37s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.watch_history_dedup: User does not have permission to query table mgmt-467-4677:netflix.watch_history_dedup, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.watch_history_dedup: User does not have permission to query table mgmt-467-4677:netflix.watch_history_dedup, or perhaps it does not exist.

Location: US
Job ID: 0d81051d-af76-46f9-b83f-f4e32b6551ba



**Reflection:** When might capping be harmful? Name a model type less sensitive to outliers and why.

### 5.4 Business anomaly flags — What & Why
Human-readable flags help both product decisioning and ML features (e.g., binge behavior).

### Build Prompt
Generate **three BigQuery SQL cells** (adjust if columns differ):
1) In `watch_history_robust`, compute and summarize `flag_binge` for sessions > 8 hours.
2) In `users`, compute and summarize `flag_age_extreme` if age can be parsed from `age_band` (<10 or >100).
3) In `movies`, compute and summarize `flag_duration_anomaly` where `duration_min` < 15 or > 480 (if exists).
Each cell should output count and percentage and include 1–2 comments.


In [24]:
# -- Compute and summarize flag_binge for sessions > 8 hours in watch_history_robust
# This query identifies sessions with a total watch duration greater than 8 hours (480 minutes)
# and calculates the count and percentage of such sessions.
%%bigquery
SELECT
    COUNT(*) AS total_rows,
    COUNTIF(watch_duration_minutes_capped > 480) AS binge_session_count,
    ROUND(100 * COUNTIF(watch_duration_minutes_capped > 480) / COUNT(*), 2) AS pct_binge_sessions
FROM `mgmt-467-4677.netflix.watch_history_robust`
WHERE watch_duration_minutes_capped IS NOT NULL;

Executing query with job ID: 1b95e758-6481-4bd6-a416-ffbec08c7331
Query executing: 0.44s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.watch_history_robust: User does not have permission to query table mgmt-467-4677:netflix.watch_history_robust, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.watch_history_robust: User does not have permission to query table mgmt-467-4677:netflix.watch_history_robust, or perhaps it does not exist.

Location: US
Job ID: 1b95e758-6481-4bd6-a416-ffbec08c7331



In [25]:
# -- Compute and summarize flag_age_extreme if age < 10 or > 100 in users
# This query identifies users with age values considered extreme (<10 or >100)
# and calculates the count and percentage of such users.
%%bigquery
SELECT
    COUNT(*) AS total_rows,
    COUNTIF(age < 10 OR age > 100) AS extreme_age_count,
    ROUND(100 * COUNTIF(age < 10 OR age > 100) / COUNT(*), 2) AS pct_extreme_age
FROM `mgmt-467-4677.netflix.users`
WHERE age IS NOT NULL;

Executing query with job ID: 05c7ca14-647f-4fa7-ae67-b1c32e8d436c
Query executing: 0.43s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.users: User does not have permission to query table mgmt-467-4677:netflix.users, or perhaps it does not exist.

Location: US
Job ID: 05c7ca14-647f-4fa7-ae67-b1c32e8d436c



In [26]:
# -- Compute and summarize flag_duration_anomaly where duration_minutes < 15 or > 480 (if exists) in movies
# This query identifies movies with duration outside the typical range (<15 or >480 minutes)
# and calculates the count and percentage of such movies.
%%bigquery
SELECT
    COUNT(*) AS total_rows,
    COUNTIF(duration_minutes < 15 OR duration_minutes > 480) AS duration_anomaly_count,
    ROUND(100 * COUNTIF(duration_minutes < 15 OR duration_minutes > 480) / COUNT(*), 2) AS pct_duration_anomaly
FROM `mgmt-467-4677.netflix.movies`
WHERE duration_minutes IS NOT NULL;

Executing query with job ID: c41e2fd0-156a-48f1-89e7-839afe4191df
Query executing: 0.37s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.movies: User does not have permission to query table mgmt-467-4677:netflix.movies, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.movies: User does not have permission to query table mgmt-467-4677:netflix.movies, or perhaps it does not exist.

Location: US
Job ID: c41e2fd0-156a-48f1-89e7-839afe4191df



### Verification Prompt
Generate a single compact summary query that returns two columns per flag: `flag_name, pct_of_rows`.


In [27]:
# Verification query: Generate a single compact summary query that returns two columns per flag: flag_name, pct_of_rows.
# This query combines the results of the anomaly flag calculations into a single summary table.
%%bigquery
SELECT
    'flag_binge' AS flag_name,
    ROUND(100 * COUNTIF(watch_duration_minutes_capped > 480) / COUNT(*), 2) AS pct_of_rows
FROM `mgmt-467-4677.netflix.watch_history_robust`
WHERE watch_duration_minutes_capped IS NOT NULL
UNION ALL
SELECT
    'flag_age_extreme' AS flag_name,
    ROUND(100 * COUNTIF(age < 10 OR age > 100) / COUNT(*), 2) AS pct_of_rows
FROM `mgmt-467-4677.netflix.users`
WHERE age IS NOT NULL
UNION ALL
SELECT
    'flag_duration_anomaly' AS flag_name,
    ROUND(100 * COUNTIF(duration_minutes < 15 OR duration_minutes > 480) / COUNT(*), 2) AS pct_of_rows
FROM `mgmt-467-4677.netflix.movies`
WHERE duration_minutes IS NOT NULL;

Executing query with job ID: 6ecb2a92-2341-4fcc-b56f-07caf01ea781
Query executing: 0.48s


ERROR:
 403 Access Denied: Table mgmt-467-4677:netflix.movies: User does not have permission to query table mgmt-467-4677:netflix.movies, or perhaps it does not exist.; reason: accessDenied, message: Access Denied: Table mgmt-467-4677:netflix.movies: User does not have permission to query table mgmt-467-4677:netflix.movies, or perhaps it does not exist.

Location: US
Job ID: 6ecb2a92-2341-4fcc-b56f-07caf01ea781



**Reflection:** Which anomaly flag is most common? Which would you keep as a feature and why?

## 6) Save & submit — What & Why
Reproducibility: save artifacts and document decisions so others can rerun and audit.

### Build Prompt
Generate a checklist (Markdown) students can paste at the end:
- Save this notebook to the team Drive.
- Export a `.sql` file with your DQ queries and save to repo.
- Push notebook + SQL to the **team GitHub** with a descriptive commit.
- Add a README with your `PROJECT_ID`, `REGION`, bucket, dataset, and today’s row counts.


## Grading rubric (quick)
- Profiling completeness (30)  
- Cleaning policy correctness & reproducibility (40)  
- Reflection/insight (20)  
- Hygiene (naming, verification, idempotence) (10)
