<a href="https://colab.research.google.com/github/elebon26/mgmt467-analytics-portfolio/blob/main/Labs/Unit2/(Completed)_Ethan_Lebon_Unit2_Lab1_PromptPlusExamples_Colab_Kaggle_GCS_BQ_DQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MGMT 467 — Prompt-Driven Lab (with Commented Examples)
## Kaggle ➜ Google Cloud Storage ➜ BigQuery ➜ Data Quality (DQ)

**How to use this notebook**
- Each section gives you a **Build Prompt** to paste into Gemini/Vertex AI (or Gemini in Colab).
- Below each prompt, you’ll see a **commented example** of what a good LLM answer might look like.
- **Do not** just uncomment and run. Use the prompt to generate your own code, then compare to the example.
- After every step, run the **Verification Prompt**, and write the **Reflection** in Markdown.

> Goal today: Download the Netflix dataset (Kaggle) → Stage on GCS → Load into BigQuery → Run DQ profiling (missingness, duplicates, outliers, anomaly flags).


### Academic integrity & LLM usage
- Use the prompts here to generate your own code cells.
- Read concept notes and write the reflection answers in your own words.
- Keep credentials out of code. Upload `kaggle.json` when asked.


## Learning objectives
1) Explain **why** we stage data in GCS and load it to BigQuery.  
2) Build an **idempotent**, auditable pipeline.  
3) Diagnose **missingness**, **duplicates**, and **outliers** and justify cleaning choices.  
4) Connect DQ decisions to **business/ML impact**.


## 0) Environment setup — What & Why
Authenticate Colab to Google Cloud so we can use `gcloud`, GCS, and BigQuery. Set **PROJECT_ID** and **REGION** once for consistency (cost/latency).

### Build Prompt (paste to LLM)
You are my cloud TA. Generate a single **Colab code cell** that:
1) Authenticates to Google Cloud in Colab,  
2) Prompts for `PROJECT_ID` via `input()` and sets `REGION="us-central1"` (editable),  
3) Exports `GOOGLE_CLOUD_PROJECT`,  
4) Runs `gcloud config set project $GOOGLE_CLOUD_PROJECT`,  
5) Prints both values. Add 2–3 comments explaining what/why.
End with a comment: `# Done: Auth + Project/Region set`.


In [4]:
# Authenticate to Google Cloud in Colab
# This allows Colab to access your GCP resources
from google.colab import auth
auth.authenticate_user()

import os

# Prompt for Project ID and set Region
# PROJECT_ID is required for all GCP operations; REGION keeps resources consistent
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # keep consistent; change if instructed

# Export PROJECT_ID as an environment variable
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

# Set the active project for gcloud commands
!gcloud config set project $GOOGLE_CLOUD_PROJECT

# Print the set values for verification
print("Project:", PROJECT_ID, "| Region:", REGION)

# Done: Auth + Project/Region set

Enter your GCP Project ID: mgmt-467-1234
Updated property [core/project].
Project: mgmt-467-1234 | Region: us-central1


In [24]:
PROJECT_ID = "mgmt-467-1234"
!gcloud config set project $PROJECT_ID
%env GOOGLE_CLOUD_PROJECT=$PROJECT_ID


Updated property [core/project].
env: GOOGLE_CLOUD_PROJECT=mgmt-467-1234


In [25]:
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)
print("Connected to project:", PROJECT_ID)


Connected to project: mgmt-467-1234


In [5]:
# # EXAMPLE (from LLM) — Auth + Project/Region (commented; write your own cell using the prompt)
# # from google.colab import auth
# # auth.authenticate_user()
# #
# # import os
# # PROJECT_ID = input("Enter your GCP Project ID: ").strip()
# # REGION = "us-central1"  # keep consistent; change if instructed
# # os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
# # print("Project:", PROJECT_ID, "| Region:", REGION)
# #
# # # Set active project for gcloud/BigQuery CLI
# # !gcloud config set project $GOOGLE_CLOUD_PROJECT
# # !gcloud config get-value project
# # # Done: Auth + Project/Region set

### Verification Prompt
Generate a short cell that prints the active project using `gcloud config get-value project` and echoes the `REGION` you set.


In [6]:
# Verify the active project and region
!gcloud config get-value project
!echo "Region: $REGION"

mgmt-467-1234
Region: us-central1


**Reflection:** Why do we set `PROJECT_ID` and `REGION` at the top? What can go wrong if we don’t?

**Reflection:** Setting `PROJECT_ID` and `REGION` at the top of the notebook ensures consistency and reproducibility for all subsequent cloud operations. If these are not set, or are set inconsistently, you could encounter errors related to resource location, permissions, or even create resources in unintended projects or regions, leading to confusion, increased costs, or difficulty managing resources. It provides a single point of control for where your cloud resources will be created and accessed.

## 1) Kaggle API — What & Why
Use Kaggle CLI for reproducible downloads. Store `kaggle.json` at `~/.kaggle/kaggle.json` with `0600` permissions to protect secrets.

### Build Prompt
Generate a **single Colab code cell** that:
- Prompts me to upload `kaggle.json`,
- Saves to `~/.kaggle/kaggle.json` with `0600` permissions,
- Prints `kaggle --version`.
Add comments about security and reproducibility.


In [8]:
# Prompt the user to upload their kaggle.json file
# This file contains your Kaggle API credentials
from google.colab import files
print("Upload your kaggle.json file (from Kaggle > Account > Create New API Token)")
uploaded = files.upload()

import os

# Ensure the .kaggle directory exists and save the uploaded file
# Storing the API key here is required for the kaggle CLI to work
os.makedirs('/root/.kaggle', exist_ok=True)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])

# Set file permissions to owner-only (0600) for security
# This prevents other users or processes from reading your API key
os.chmod('/root/.kaggle/kaggle.json', 0o600)

# Verify the Kaggle installation by printing the version
# This confirms the CLI is ready to use
!kaggle --version

Upload your kaggle.json file (from Kaggle > Account > Create New API Token)


Saving kaggle (1).json to kaggle (1).json
Kaggle API 1.7.4.5


In [None]:
# # EXAMPLE (from LLM) — Kaggle setup (commented)
# # from google.colab import files
# # print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
# # uploaded = files.upload()
# #
# # import os
# # os.makedirs('/root/.kaggle', exist_ok=True)
# # with open('/root/.kaggle/kaggle.json', 'wb') as f:
# #     f.write(uploaded[list(uploaded.keys())[0]])
# # os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only
# #
# # !kaggle --version

### Verification Prompt
Generate a one-liner that runs `kaggle --help | head -n 20` to show the CLI is ready.


In [9]:
# Verify the Kaggle CLI is ready by showing the first 20 lines of the help output
!kaggle --help | head -n 20

usage: kaggle [-h] [-v] [-W]
              {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
              ...

options:
  -h, --help            show this help message and exit
  -v, --version         Print the Kaggle API version

commands:
  {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        kernels {list, files, init, push, pull, output, status}
                        models {instances, get, list, init, create, delete, update}
                        models instances {versions, get, files, init, create, delete, update}
                        models instances versions {init, create, download, delete, files}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle compe

**Reflection:** Why require strict `0600` permissions on API tokens? What risks are we avoiding?

**Reflection:** Requiring strict `0600` permissions on API tokens (owner-only read/write) is crucial for security. It prevents unauthorized access to your credentials by other users or processes on the system. If an attacker or malicious program were to gain access to your API token, they could potentially perform actions on your behalf, such as accessing private data, making purchases, or incurring costs on cloud services. By setting `0600` permissions, we significantly reduce the attack surface and protect against the compromise of your accounts and data.

## 2) Download & unzip dataset — What & Why
Keep raw files under `/content/data/raw` for predictable paths and auditing.
**Dataset:** `sayeeduddin/netflix-2025user-behavior-dataset-210k-records`

### Build Prompt
Generate a **Colab code cell** that:
- Creates `/content/data/raw`,
- Downloads the dataset to `/content/data` with Kaggle CLI,
- Unzips into `/content/data/raw` (overwrite OK),
- Lists all CSVs with sizes in a neat table.
Include comments describing each step.


In [10]:
# Create directory for raw data
# This ensures a predictable location for the downloaded files
!mkdir -p /content/data/raw

# Download the dataset using Kaggle CLI
# The -d flag specifies the dataset, -p specifies the download path
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# Unzip the downloaded file into the raw data directory
# The -o flag allows overwriting existing files
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes
# This helps to verify the download and unzipping process
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
Downloading netflix-2025user-behavior-dataset-210k-records.zip to /content/data
  0% 0.00/4.02M [00:00<?, ?B/s]
100% 4.02M/4.02M [00:00<00:00, 776MB/s]
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root 

In [11]:
# # EXAMPLE (from LLM) — Download & unzip (commented)
# # !mkdir -p /content/data/raw
# # !kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data
# # !unzip -o /content/data/*.zip -d /content/data/raw
# # # List CSV inventory
# # !ls -lh /content/data/raw/*.csv

### Verification Prompt
Generate a snippet that asserts there are exactly **six** CSV files and prints their names.


In [12]:
# Verify the number of CSV files and print their names
import glob
csv_files = glob.glob('/content/data/raw/*.csv')
assert len(csv_files) == 6, f"Expected 6 CSV files, but found {len(csv_files)}"
print("Found 6 CSV files:")
for csv_file in csv_files:
    print(csv_file)

Found 6 CSV files:
/content/data/raw/movies.csv
/content/data/raw/watch_history.csv
/content/data/raw/search_logs.csv
/content/data/raw/users.csv
/content/data/raw/recommendation_logs.csv
/content/data/raw/reviews.csv


**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?

**Reflection:** Keeping a clean file inventory with names and sizes is useful downstream for several reasons:
1. **Auditing and Reproducibility:** It provides a clear record of the raw data files used, making it easier to audit the data pipeline and reproduce results later.
2. **Data Validation:** You can quickly verify that all expected files were downloaded and that their sizes are within a reasonable range, catching potential download errors early.
3. **Downstream Processing:** Knowing the exact filenames is essential for scripting subsequent steps, such as loading data into databases or processing files individually.
4. **Troubleshooting:** If issues arise later in the pipeline, the file inventory helps pinpoint whether the problem originated during the data download or unzipping phase.

## 3) Create GCS bucket & upload — What & Why
Stage in GCS → consistent, versionable source for BigQuery loads. Bucket names must be **globally unique**.

### Build Prompt
Generate a **Colab code cell** that:
- Creates a unique bucket in `${REGION}` (random suffix),
- Saves name to `BUCKET_NAME` env var,
- Uploads all CSVs to `gs://$BUCKET_NAME/netflix/`,
- Prints the bucket name and explains staging benefits.


In [None]:
# # EXAMPLE (from LLM) — GCS staging (commented)
# # import uuid, os
# # bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
# # os.environ["BUCKET_NAME"] = bucket_name
# # !gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION
# # !gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/
# # print("Bucket:", bucket_name)
# # # Verify contents
# # !gcloud storage ls gs://$BUCKET_NAME/netflix/

In [13]:
# Create GCS bucket (robust to location policies) and upload CSVs
import os, uuid, subprocess, sys

bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
os.environ["BUCKET_NAME"] = bucket_name

project = os.environ.get("GOOGLE_CLOUD_PROJECT", "").strip()
if not project:
    raise SystemExit("GOOGLE_CLOUD_PROJECT is not set. Run the Auth/Project cell first.")

# Try your REGION first, then fall back to common US locations (policy-friendly)
preferred = os.environ.get("REGION", "").strip()
candidates = [loc for loc in [preferred, "US", "us-east1", "us-central1"] if loc]

created = False
for loc in candidates:
    print(f"→ Trying to create gs://{bucket_name} in location '{loc}' ...")
    cmd = [
        "gcloud","storage","buckets","create", f"gs://{bucket_name}",
        f"--location={loc}", f"--project={project}"
    ]
    res = subprocess.run(cmd, capture_output=True, text=True)
    if res.returncode == 0:
        print(f"✅ Bucket created in '{loc}'.")
        created = True
        chosen_loc = loc
        break
    else:
        print(f"   Failed: {res.stderr.strip()}")

if not created:
    raise SystemExit("❌ Could not create a bucket in any candidate location. "
                     "Ask your admin for the allowed Storage locations/org policy.")

# Upload all CSVs and list them
upload = subprocess.run(
    ["gcloud","storage","cp","/content/data/raw/*.csv", f"gs://{bucket_name}/netflix/"],
    text=True
)
# If shell-style glob didn’t expand, do a Python loop:
if upload.returncode != 0:
    import glob
    for p in glob.glob("/content/data/raw/*.csv"):
        subprocess.check_call(["gcloud","storage","cp", p, f"gs://{bucket_name}/netflix/"])

print(f"\nBucket: {bucket_name} | Project: {project} | Location: {chosen_loc}")
print("Inventory under gs://{}/netflix/:".format(bucket_name))
subprocess.check_call(["gcloud","storage","ls","-l", f"gs://{bucket_name}/netflix/"])


→ Trying to create gs://mgmt467-netflix-066b19a6 in location 'US' ...
✅ Bucket created in 'US'.

Bucket: mgmt467-netflix-066b19a6 | Project: mgmt-467-1234 | Location: US
Inventory under gs://mgmt467-netflix-066b19a6/netflix/:


0

### Verification Prompt
Generate a snippet that lists the `netflix/` prefix and shows object sizes.


In [14]:
# List objects in the 'netflix/' prefix of the bucket with sizes
# This verifies the files were uploaded correctly
import os
bucket_name = os.environ.get("BUCKET_NAME")
if bucket_name:
    !gcloud storage ls -l gs://{bucket_name}/netflix/
else:
    print("BUCKET_NAME environment variable not set. Please run the previous cell to create the bucket.")

    115942  2025-10-26T16:10:28Z  gs://mgmt467-netflix-066b19a6/netflix/movies.csv
   4695557  2025-10-26T16:10:28Z  gs://mgmt467-netflix-066b19a6/netflix/recommendation_logs.csv
   1861942  2025-10-26T16:10:28Z  gs://mgmt467-netflix-066b19a6/netflix/reviews.csv
   2250902  2025-10-26T16:10:28Z  gs://mgmt467-netflix-066b19a6/netflix/search_logs.csv
   1606820  2025-10-26T16:10:28Z  gs://mgmt467-netflix-066b19a6/netflix/users.csv
   9269425  2025-10-26T16:10:28Z  gs://mgmt467-netflix-066b19a6/netflix/watch_history.csv
TOTAL: 6 objects, 19800588 bytes (18.88MiB)


**Reflection:** Name two benefits of staging in GCS vs loading directly from local Colab.

## 4) BigQuery dataset & loads — What & Why
Create dataset `netflix` and load six CSVs with **autodetect** for speed (we’ll enforce schemas later).

### Build Prompt (two cells)
**Cell A:** Create (idempotently) dataset `netflix` in US multi-region; if it exists, print a friendly message.  
**Cell B:** Load tables from `gs://$BUCKET_NAME/netflix/`:
`users, movies, watch_history, recommendation_logs, search_logs, reviews`
with `--skip_leading_rows=1 --autodetect --source_format=CSV`.
Finish with row-count queries for each table.


In [None]:
# Create BigQuery dataset 'netflix' in US multi-region
# The '--location=US' flag specifies the multi-region
# The '-d' flag specifies dataset creation
# The '--description' adds a description
# The '|| echo "Dataset may already exist."' handles idempotency by printing a message if it fails (i.e., dataset exists)
DATASET = "netflix"
!bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

In [None]:
# # EXAMPLE (from LLM) — BigQuery dataset (commented)
# # DATASET="netflix"
# # # Attempt to create; ignore if exists
# # !bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

In [15]:
# Load tables from GCS → BigQuery and show row counts (location-aware, safe quoting)
import os, subprocess

DATASET = "netflix"
project  = os.environ["GOOGLE_CLOUD_PROJECT"]
bucket   = os.environ["BUCKET_NAME"]

# Discover the bucket (and hence dataset) location once
BQ_LOC = subprocess.check_output(
    ["gcloud","storage","buckets","describe", f"gs://{bucket}", "--format=value(location)"],
    text=True
).strip()
print(f"Project={project}  Dataset={DATASET}  Location={BQ_LOC}")

tables = {
    "users": "users.csv",
    "movies": "movies.csv",
    "watch_history": "watch_history.csv",
    "recommendation_logs": "recommendation_logs.csv",
    "search_logs": "search_logs.csv",
    "reviews": "reviews.csv",
}

# Load each table
for tbl, fname in tables.items():
    src = f"gs://{bucket}/netflix/{fname}"
    print(f"\nLoading {tbl} from {src}")
    !bq --location="$BQ_LOC" load \
        --skip_leading_rows=1 --autodetect --source_format=CSV \
        {DATASET}.{tbl} {src}

# Row counts for each table (single query; single quotes so backticks are literal)
sql = f'''
SELECT 'users' AS table_name, COUNT(*) AS row_count
FROM `{project}.{DATASET}.users`
UNION ALL
SELECT 'movies', COUNT(*) FROM `{project}.{DATASET}.movies`
UNION ALL
SELECT 'watch_history', COUNT(*) FROM `{project}.{DATASET}.watch_history`
UNION ALL
SELECT 'recommendation_logs', COUNT(*) FROM `{project}.{DATASET}.recommendation_logs`
UNION ALL
SELECT 'search_logs', COUNT(*) FROM `{project}.{DATASET}.search_logs`
UNION ALL
SELECT 'reviews', COUNT(*) FROM `{project}.{DATASET}.reviews`
ORDER BY table_name
'''
print("\nRow counts:")
# Use a temp file to avoid any quoting surprises
with open("rowcounts.sql","w") as f: f.write(sql)
!bq --location="$BQ_LOC" query --nouse_legacy_sql < rowcounts.sql


Project=mgmt-467-1234  Dataset=netflix  Location=US

Loading users from gs://mgmt467-netflix-066b19a6/netflix/users.csv
Waiting on bqjob_r1543f1d0e568004f_0000019a21496790_1 ... (1s) Current status: DONE   

Loading movies from gs://mgmt467-netflix-066b19a6/netflix/movies.csv
Waiting on bqjob_r2236b2191eafe734_0000019a21497e8e_1 ... (1s) Current status: DONE   

Loading watch_history from gs://mgmt467-netflix-066b19a6/netflix/watch_history.csv
Waiting on bqjob_r6b1ad786a77e27ba_0000019a214995f9_1 ... (2s) Current status: DONE   

Loading recommendation_logs from gs://mgmt467-netflix-066b19a6/netflix/recommendation_logs.csv
Waiting on bqjob_r2227f3aa03a3d524_0000019a2149b2c4_1 ... (2s) Current status: DONE   

Loading search_logs from gs://mgmt467-netflix-066b19a6/netflix/search_logs.csv
Waiting on bqjob_r63a033e154e150a5_0000019a2149d027_1 ... (2s) Current status: DONE   

Loading reviews from gs://mgmt467-netflix-066b19a6/netflix/reviews.csv
Waiting on bqjob_r1f43fef83e7ddac0_0000019a

In [None]:
# # EXAMPLE (from LLM) — Load tables (commented)
# # tables = {
# #   "users": "users.csv",
# #   "movies": "movies.csv",
# #   "watch_history": "watch_history.csv",
# #   "recommendation_logs": "recommendation_logs.csv",
# #   "search_logs": "search_logs.csv",
# #   "reviews": "reviews.csv",
# # }
# # import os
# # for tbl, fname in tables.items():
# #   src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
# #   print("Loading", tbl, "from", src)
# #   !bq load --skip_leading_rows=1 --autodetect --source_format=CSV $DATASET.$tbl $src
# #
# # # Row counts
# # for tbl in tables.keys():
# #   !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `${GOOGLE_CLOUD_PROJECT}.netflix.{tbl}`".format(tbl=tbl)

### Verification Prompt
Generate a single query that returns `table_name, row_count` for all six tables in `${GOOGLE_CLOUD_PROJECT}.netflix`.


In [16]:
import os

project_id = "mgmt-467-1234"

sql = f"""
SELECT 'users' AS table_name, COUNT(*) AS row_count
FROM `{project_id}.netflix.users`
UNION ALL
SELECT 'movies', COUNT(*) FROM `{project_id}.netflix.movies`
UNION ALL
SELECT 'watch_history', COUNT(*) FROM `{project_id}.netflix.watch_history`
UNION ALL
SELECT 'recommendation_logs', COUNT(*) FROM `{project_id}.netflix.recommendation_logs`
UNION ALL
SELECT 'search_logs', COUNT(*) FROM `{project_id}.netflix.search_logs`
UNION ALL
SELECT 'reviews', COUNT(*) FROM `{project_id}.netflix.reviews`
ORDER BY table_name;
"""

# Save the query to a temp file and run it with bq
with open("verify_row_counts.sql", "w") as f:
    f.write(sql)

!bq query --nouse_legacy_sql < verify_row_counts.sql


+---------------------+-----------+
|     table_name      | row_count |
+---------------------+-----------+
| movies              |      3120 |
| recommendation_logs |    156000 |
| reviews             |     46350 |
| search_logs         |     79500 |
| users               |     30900 |
| watch_history       |    315000 |
+---------------------+-----------+


In [22]:
%%bigquery
-- Total rows and % missing in region, plan_tier, age_band from users
SELECT
    COUNT(*) AS total_rows,
    COUNTIF(region IS NULL) AS missing_region,
    ROUND(SAFE_DIVIDE(COUNTIF(region IS NULL), COUNT(*)) * 100, 2) AS pct_missing_region,
    COUNTIF(plan_tier IS NULL) AS missing_plan_tier,
    ROUND(SAFE_DIVIDE(COUNTIF(plan_tier IS NULL), COUNT(*)) * 100, 2) AS pct_missing_plan_tier,
    COUNTIF(age_band IS NULL) AS missing_age_band,
    ROUND(SAFE_DIVIDE(COUNTIF(age_band IS NULL), COUNT(*)) * 100, 2) AS pct_missing_age_band
FROM
    `mgmt-467-1234.netflix.users`

Executing query with job ID: dc6e6482-1a0d-4e9c-a970-3245d477c28a
Query executing: 0.38s


ERROR:
 400 Unrecognized name: region at [4:13]; reason: invalidQuery, location: query, message: Unrecognized name: region at [4:13]

Location: US
Job ID: dc6e6482-1a0d-4e9c-a970-3245d477c28a



\**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

## 5) Data Quality (DQ) — Concepts we care about
- **Missingness** (MCAR/MAR/MNAR). Impute vs drop. Add `is_missing_*` indicators.
- **Duplicates** (exact vs near). Double-counted engagement corrupts labels & KPIs.
- **Outliers** (IQR). Winsorize/cap vs robust models. Always **flag** and explain.
- **Reproducibility**. Prefer `CREATE OR REPLACE` and deterministic keys.


### 5.1 Missingness (users) — What & Why
Measure % missing and check if missingness depends on another variable (MAR) → potential bias & instability.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Total rows and % missing in `region`, `plan_tier`, `age_band` from `users`.
2) `% plan_tier missing by region` ordered descending. Add comments on MAR.


In [30]:
# Install/upgrade BigQuery libs
!pip -q install --upgrade google-cloud-bigquery pandas-gbq db-dtypes

# Authenticate
from google.colab import auth
auth.authenticate_user()

# Set project (Python var + env, and tell the BigQuery magics)
PROJECT_ID = "mgmt-467-1234"

import os
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID

from google.cloud.bigquery.magics import context
context.project = PROJECT_ID

print("Using project:", context.project)


Using project: mgmt-467-1234


In [31]:
%%bigquery --project {PROJECT_ID}
--standardSQL
WITH u AS (
  SELECT
    NULLIF(TRIM(country), '')                AS region,
    NULLIF(TRIM(subscription_plan), '')      AS plan_tier,
    CASE
      WHEN age IS NULL THEN NULL
      WHEN age < 18 THEN 'under_18'
      WHEN age BETWEEN 18 AND 24 THEN '18-24'
      WHEN age BETWEEN 25 AND 34 THEN '25-34'
      WHEN age BETWEEN 35 AND 44 THEN '35-44'
      WHEN age BETWEEN 45 AND 54 THEN '45-54'
      WHEN age BETWEEN 55 AND 64 THEN '55-64'
      WHEN age >= 65 THEN '65_plus'
      ELSE NULL
    END AS age_band
  FROM `mgmt-467-1234.netflix.users`
)
SELECT
  COUNT(*) AS total_rows,
  COUNTIF(region IS NULL) AS missing_region,
  ROUND(100 * SAFE_DIVIDE(COUNTIF(region IS NULL), COUNT(*)), 2) AS pct_missing_region,
  COUNTIF(plan_tier IS NULL) AS missing_plan_tier,
  ROUND(100 * SAFE_DIVIDE(COUNTIF(plan_tier IS NULL), COUNT(*)), 2) AS pct_missing_plan_tier,
  COUNTIF(age_band IS NULL) AS missing_age_band,
  ROUND(100 * SAFE_DIVIDE(COUNTIF(age_band IS NULL), COUNT(*)), 2) AS pct_missing_age_band
FROM u;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_rows,missing_region,pct_missing_region,missing_plan_tier,pct_missing_plan_tier,missing_age_band,pct_missing_age_band
0,30900,0,0.0,0,0.0,3687,11.93


In [32]:
%%bigquery --project {PROJECT_ID}
--standardSQL
WITH u AS (
  SELECT
    NULLIF(TRIM(country), '')           AS region,
    NULLIF(TRIM(subscription_plan), '') AS plan_tier
  FROM `mgmt-467-1234.netflix.users`
)
SELECT
  COALESCE(region, '∅ (NULL/blank)') AS region,
  COUNT(*)                           AS n_rows,
  ROUND(100 * SAFE_DIVIDE(COUNTIF(plan_tier IS NULL), COUNT(*)), 2)
    AS pct_plan_tier_missing
FROM u
GROUP BY region
ORDER BY pct_plan_tier_missing DESC, n_rows DESC;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,region,n_rows,pct_plan_tier_missing
0,USA,21612,0.0
1,Canada,9288,0.0


In [None]:
# # EXAMPLE (from LLM) — MAR by region (commented)
# # SELECT region,
# #        COUNT(*) AS n,
# #        ROUND(100*COUNTIF(plan_tier IS NULL)/COUNT(*),2) AS pct_missing_plan_tier
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # GROUP BY region
# # ORDER BY pct_missing_plan_tier DESC;

### Verification Prompt
Generate a query that prints the three missingness percentages from (1), rounded to two decimals.


In [34]:
%%bigquery --project mgmt-467-1234
--standardSQL
-- Verification Prompt — print three missingness percentages (region, plan_tier, age_band)
WITH base AS (
  SELECT
    COUNT(*) AS n,
    COUNTIF(country IS NULL OR TRIM(country) = '')        AS miss_region,
    COUNTIF(subscription_plan IS NULL OR TRIM(subscription_plan) = '') AS miss_plan,
    COUNTIF(age IS NULL)                                  AS miss_age
  FROM `mgmt-467-1234.netflix.users`
)
SELECT
  ROUND(100 * miss_region / n, 2) AS pct_missing_region,
  ROUND(100 * miss_plan   / n, 2) AS pct_missing_plan_tier,
  ROUND(100 * miss_age    / n, 2) AS pct_missing_age_band
FROM base;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,pct_missing_region,pct_missing_plan_tier,pct_missing_age_band
0,0.0,0.0,11.93


**Reflection:** Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.

### 5.2 Duplicates (watch_history) — What & Why
Find exact duplicate interaction records and keep **one best** per group (deterministic policy).

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Report duplicate groups on `(user_id, movie_id, event_ts, device_type)` with counts (top 20).
2) Create table `watch_history_dedup` that keeps one row per group (prefer higher `progress_ratio`, then `minutes_watched`). Add comments.


In [41]:
%%bigquery --project mgmt-467-1234
--standardSQL
-- Find duplicate groups on (user_id, movie_id, watch_date, device_type)
SELECT
  user_id,
  movie_id,
  watch_date,
  device_type,
  COUNT(*) AS dup_count
FROM `mgmt-467-1234.netflix.watch_history`
GROUP BY user_id, movie_id, watch_date, device_type
HAVING dup_count > 1
ORDER BY dup_count DESC
LIMIT 20;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,user_id,movie_id,watch_date,device_type,dup_count
0,user_00391,movie_0893,2024-08-26,Laptop,12
1,user_03310,movie_0640,2024-09-08,Smart TV,12
2,user_01143,movie_0166,2024-05-28,Laptop,9
3,user_07594,movie_0133,2025-03-24,Laptop,9
4,user_08681,movie_0332,2024-06-13,Laptop,9
5,user_06462,movie_0588,2025-02-10,Laptop,9
6,user_01469,movie_0237,2025-01-17,Laptop,9
7,user_06103,movie_0113,2025-04-08,Laptop,9
8,user_01383,movie_0015,2025-04-29,Desktop,9
9,user_07738,movie_0793,2025-07-28,Desktop,9


In [None]:
# # EXAMPLE (from LLM) — Detect duplicate groups (commented)
# # SELECT user_id, movie_id, event_ts, device_type, COUNT(*) AS dup_count
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history`
# # GROUP BY user_id, movie_id, event_ts, device_type
# # HAVING dup_count > 1
# # ORDER BY dup_count DESC
# # LIMIT 20;

In [45]:
%%bigquery --project mgmt-467-1234
--standardSQL
-- Keep one row per (user_id, movie_id, watch_date, device_type)
-- Prefer higher progress_percentage, then watch_duration_minutes
CREATE OR REPLACE TABLE `mgmt-467-1234.netflix.watch_history_dedup` AS
SELECT * EXCEPT(rk)
FROM (
  SELECT
    h.*,
    ROW_NUMBER() OVER (
      PARTITION BY user_id, movie_id, watch_date, device_type
      ORDER BY progress_percentage DESC, watch_duration_minutes DESC
    ) AS rk
  FROM `mgmt-467-1234.netflix.watch_history` AS h
)
WHERE rk = 1;


Query is running:   0%|          |

In [None]:
# # EXAMPLE (from LLM) — Keep-one policy (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` AS
# # SELECT * EXCEPT(rk) FROM (
# #   SELECT h.*,
# #          ROW_NUMBER() OVER (
# #            PARTITION BY user_id, movie_id, event_ts, device_type
# #            ORDER BY progress_ratio DESC, minutes_watched DESC
# #          ) AS rk
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history` h
# # )
# # WHERE rk = 1;

### Verification Prompt
Generate a before/after count query comparing raw vs `watch_history_dedup`.


In [46]:
%%bigquery --project mgmt-467-1234
--standardSQL
SELECT 'raw_watch_history' AS table_name, COUNT(*) AS row_count
FROM `mgmt-467-1234.netflix.watch_history`
UNION ALL
SELECT 'watch_history_dedup' AS table_name, COUNT(*) AS row_count
FROM `mgmt-467-1234.netflix.watch_history_dedup`;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,table_name,row_count
0,raw_watch_history,315000
1,watch_history_dedup,100000


**Reflection:** Why do duplicates arise (natural vs system-generated)? How do
they corrupt labels and KPIs?

### 5.3 Outliers (minutes_watched) — What & Why
Estimate extreme values via IQR; report % outliers; **winsorize** to P01/P99 for robustness while also **flagging** extremes.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Compute IQR bounds for `minutes_watched` on `watch_history_dedup` and report % outliers.
2) Create `watch_history_robust` with `minutes_watched_capped` capped at P01/P99; return quantile summaries before/after.


In [49]:
%%bigquery --project mgmt-467-1234
--standardSQL
-- 5.3 — IQR outlier rate for watch_duration_minutes

WITH dist AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(1)] AS q1,   -- 25th
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(3)] AS q3    -- 75th
  FROM `mgmt-467-1234.netflix.watch_history_dedup`
),
bounds AS (
  SELECT
    q1,
    q3,
    (q3 - q1) AS iqr,
    q1 - 1.5 * (q3 - q1) AS lo_bound,
    q3 + 1.5 * (q3 - q1) AS hi_bound
  FROM dist
),
agg AS (
  SELECT
    COUNTIF(h.watch_duration_minutes < b.lo_bound
         OR h.watch_duration_minutes > b.hi_bound) AS outlier_rows,
    COUNT(*) AS total_rows
  FROM `mgmt-467-1234.netflix.watch_history_dedup` h
  CROSS JOIN bounds b
)
SELECT
  b.q1, b.q3, b.iqr, b.lo_bound, b.hi_bound,
  a.outlier_rows, a.total_rows,
  ROUND(100 * SAFE_DIVIDE(a.outlier_rows, a.total_rows), 2) AS pct_outliers
FROM bounds b
CROSS JOIN agg a;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,q1,q3,iqr,lo_bound,hi_bound,outlier_rows,total_rows,pct_outliers
0,28.9,82.6,53.7,-51.65,163.15,3462,100000,3.46


In [None]:
# # EXAMPLE (from LLM) — IQR outlier rate (commented)
# # WITH dist AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(1)] AS q1,
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(3)] AS q3
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # bounds AS (
# #   SELECT q1, q3, (q3-q1) AS iqr,
# #          q1 - 1.5*(q3-q1) AS lo,
# #          q3 + 1.5*(q3-q1) AS hi
# #   FROM dist
# # )
# # SELECT
# #   COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi) AS outliers,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi)/COUNT(*),2) AS pct_outliers
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h
# # CROSS JOIN bounds b;

In [50]:
%%bigquery --project mgmt-467-1234
--standardSQL
-- 5.3 — Create watch_history_robust with capping at P01/P99
-- Using column watch_duration_minutes (aka "minutes_watched" in prompt)
CREATE OR REPLACE TABLE `mgmt-467-1234.netflix.watch_history_robust` AS
WITH qs AS (
  SELECT
    -- 101 buckets → offsets 1 and 99 approximate P01 and P99
    APPROX_QUANTILES(watch_duration_minutes, 101) AS q
  FROM `mgmt-467-1234.netflix.watch_history_dedup`
),
caps AS (
  SELECT
    q[OFFSET(1)]  AS p01,
    q[OFFSET(99)] AS p99
  FROM qs
)
SELECT
  d.*,
  -- capped value (maps to prompt's minutes_watched_capped)
  CASE
    WHEN d.watch_duration_minutes < c.p01 THEN c.p01
    WHEN d.watch_duration_minutes > c.p99 THEN c.p99
    ELSE d.watch_duration_minutes
  END AS watch_duration_minutes_capped
FROM `mgmt-467-1234.netflix.watch_history_dedup` d
CROSS JOIN caps c;

-- Quantile summaries before vs after capping
WITH pre AS (
  SELECT
    'before' AS stage,
    MIN(watch_duration_minutes) AS min_val,
    APPROX_QUANTILES(watch_duration_minutes, 3)[OFFSET(2)] AS median_val,
    MAX(watch_duration_minutes) AS max_val
  FROM `mgmt-467-1234.netflix.watch_history_dedup`
),
post AS (
  SELECT
    'after' AS stage,
    MIN(watch_duration_minutes_capped) AS min_val,
    APPROX_QUANTILES(watch_duration_minutes_capped, 3)[OFFSET(2)] AS median_val,
    MAX(watch_duration_minutes_capped) AS max_val
  FROM `mgmt-467-1234.netflix.watch_history_robust`
)
SELECT * FROM pre
UNION ALL
SELECT * FROM post;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,stage,min_val,median_val,max_val
0,before,0.2,68.8,799.3
1,after,4.4,70.6,205.1


In [None]:
# # EXAMPLE (from LLM) — Winsorize + quantiles (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust` AS
# # WITH q AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(1)]  AS p01,
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(98)] AS p99
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # )
# # SELECT
# #   h.*,
# #   GREATEST(q.p01, LEAST(q.p99, h.minutes_watched)) AS minutes_watched_capped
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h, q;
# #
# # -- Quantiles before vs after
# # WITH before AS (
# #   SELECT 'before' AS which, APPROX_QUANTILES(minutes_watched, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # after AS (
# #   SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`
# # )
# # SELECT * FROM before UNION ALL SELECT * FROM after;

### Verification Prompt
Generate a query that shows min/median/max before vs after capping.


In [51]:
%%bigquery --project mgmt-467-1234
--standardSQL
-- Verification Prompt — min/median/max before vs after capping

WITH before AS (
  SELECT
    'before' AS stage,
    MIN(watch_duration_minutes) AS min_val,
    APPROX_QUANTILES(watch_duration_minutes, 3)[OFFSET(2)] AS median_val,
    MAX(watch_duration_minutes) AS max_val
  FROM `mgmt-467-1234.netflix.watch_history_dedup`
),
after AS (
  SELECT
    'after' AS stage,
    MIN(watch_duration_minutes_capped) AS min_val,
    APPROX_QUANTILES(watch_duration_minutes_capped, 3)[OFFSET(2)] AS median_val,
    MAX(watch_duration_minutes_capped) AS max_val
  FROM `mgmt-467-1234.netflix.watch_history_robust`
)
SELECT * FROM before
UNION ALL
SELECT * FROM after;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,stage,min_val,median_val,max_val
0,after,4.4,70.6,205.1
1,before,0.2,68.8,799.3


**Reflection:** When might capping be harmful? Name a model type less sensitive to outliers and why.

**Reflection:** Capping outliers might be harmful when the extreme values are not errors but represent genuine, albeit rare, occurrences that are important for the analysis or model. For example, in fraud detection, extreme transaction amounts might be critical indicators. Capping these could obscure valuable information. Tree-based models like Decision Trees, Random Forests, and Gradient Boosting Machines are generally less sensitive to outliers compared to linear models (like Linear Regression or Logistic Regression) because they make decisions based on splitting data at certain thresholds rather than relying on the magnitude of individual data points. Outliers typically won't drastically affect the split points in a tree.

### 5.4 Business anomaly flags — What & Why
Human-readable flags help both product decisioning and ML features (e.g., binge behavior).

### Build Prompt
Generate **three BigQuery SQL cells** (adjust if columns differ):
1) In `watch_history_robust`, compute and summarize `flag_binge` for sessions > 8 hours.
2) In `users`, compute and summarize `flag_age_extreme` if age can be parsed from `age_band` (<10 or >100).
3) In `movies`, compute and summarize `flag_duration_anomaly` where `duration_min` < 15 or > 480 (if exists).
Each cell should output count and percentage and include 1–2 comments.


In [None]:
# # EXAMPLE (from LLM) — flag_binge (commented)
# # SELECT
# #   COUNTIF(minutes_watched > 8*60) AS sessions_over_8h,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(minutes_watched > 8*60)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`;

In [52]:
%%bigquery --project mgmt-467-1234
--standardSQL
-- 5.4-1: flag_binge for sessions > 8 hours
SELECT
  COUNTIF(watch_duration_minutes_capped > 8*60) AS sessions_over_8h,
  COUNT(*) AS total_rows,
  ROUND(100 * SAFE_DIVIDE(COUNTIF(watch_duration_minutes_capped > 8*60), COUNT(*)), 2) AS pct_flag_binge
FROM `mgmt-467-1234.netflix.watch_history_robust`;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,sessions_over_8h,total_rows,pct_flag_binge
0,0,100000,0.0


In [None]:
# # EXAMPLE (from LLM) — flag_age_extreme (commented)
# # SELECT
# #   COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #           CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100) AS extreme_age_rows,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #                     CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`;

In [53]:
%%bigquery --project mgmt-467-1234
--standardSQL
-- 5.4-2: flag_age_extreme using numeric age
SELECT
  COUNTIF(age < 10 OR age > 100) AS extreme_age_rows,
  COUNT(*) AS total_rows,
  ROUND(100 * SAFE_DIVIDE(COUNTIF(age < 10 OR age > 100), COUNT(*)), 2) AS pct_flag_age_extreme
FROM `mgmt-467-1234.netflix.users`;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,extreme_age_rows,total_rows,pct_flag_age_extreme
0,537,30900,1.74


In [None]:
# # EXAMPLE (from LLM) — flag_duration_anomaly (commented)
# # SELECT
# #   COUNTIF(duration_min < 15) AS titles_under_15m,
# #   COUNTIF(duration_min > 8*60) AS titles_over_8h,
# #   COUNT(*) AS total
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`;

In [59]:
%%bigquery --project mgmt-467-1234
--standardSQL
-- Titles < 15 min or > 480 min
SELECT
  COUNTIF(duration_minutes < 15 OR duration_minutes > 8*60) AS duration_anomaly_rows,
  COUNT(*) AS total_rows,
  ROUND(
    100 * SAFE_DIVIDE(
      COUNTIF(duration_minutes < 15 OR duration_minutes > 8*60), COUNT(*)
    ), 2
  ) AS pct_flag_duration_anomaly
FROM `mgmt-467-1234.netflix.movies`;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,duration_anomaly_rows,total_rows,pct_flag_duration_anomaly
0,69,3120,2.21


### Verification Prompt
Generate a single compact summary query that returns two columns per flag: `flag_name, pct_of_rows`.


In [60]:
%%bigquery --project mgmt-467-1234
--standardSQL
WITH binge AS (
  SELECT ROUND(100 * SAFE_DIVIDE(COUNTIF(watch_duration_minutes_capped > 8*60), COUNT(*)), 2) AS pct
  FROM `mgmt-467-1234.netflix.watch_history_robust`
),
age_extreme AS (
  SELECT ROUND(100 * SAFE_DIVIDE(COUNTIF(age < 10 OR age > 100), COUNT(*)), 2) AS pct
  FROM `mgmt-467-1234.netflix.users`
),
dur_anom AS (
  SELECT ROUND(100 * SAFE_DIVIDE(COUNTIF(duration_minutes < 15 OR duration_minutes > 8*60), COUNT(*)), 2) AS pct
  FROM `mgmt-467-1234.netflix.movies`
)
SELECT 'flag_binge' AS flag_name, pct AS pct_of_rows FROM binge
UNION ALL
SELECT 'flag_age_extreme', pct FROM age_extreme
UNION ALL
SELECT 'flag_duration_anomaly', pct FROM dur_anom;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,flag_name,pct_of_rows
0,flag_binge,0.0
1,flag_age_extreme,1.74
2,flag_duration_anomaly,2.21


**Reflection:** Which anomaly flag is most common? Which would you keep as a feature and why?

## 6) Save & submit — What & Why
Reproducibility: save artifacts and document decisions so others can rerun and audit.

### Build Prompt
Generate a checklist (Markdown) students can paste at the end:
- Save this notebook to the team Drive.
- Export a `.sql` file with your DQ queries and save to repo.
- Push notebook + SQL to the **team GitHub** with a descriptive commit.
- Add a README with your `PROJECT_ID`, `REGION`, bucket, dataset, and today’s row counts.


### Submission Checklist

- [ ] Save this notebook to your team's shared Google Drive folder. Go to `File > Save a copy in Drive`.
- [ ] Export your BigQuery SQL queries:
    - Go to the BigQuery UI (link in the left panel).
    - Find the queries you wrote in the job history.
    - Save the relevant queries as `.sql` files (e.g., `dq_queries.sql`).
- [ ] Clone your team's GitHub repository locally.
- [ ] Add this notebook file (`.ipynb`) and your exported `.sql` file(s) to the repository.
- [ ] Commit your changes with a descriptive message (e.g., "feat: Completed DQ lab, added notebook and SQL queries").
- [ ] Push your committed changes to the team GitHub repository.
- [ ] Update your team's README file in the GitHub repository to include:
    - Your `PROJECT_ID`
    - The `REGION` used
    - The GCS bucket name created today
    - The BigQuery dataset name (`netflix`)
    - The final row counts for all tables (`users`, `movies`, `watch_history`, `recommendation_logs`, `search_logs`, `reviews`, `watch_history_dedup`, `watch_history_robust`) from the verification steps.

## Grading rubric (quick)
- Profiling completeness (30)  
- Cleaning policy correctness & reproducibility (40)  
- Reflection/insight (20)  
- Hygiene (naming, verification, idempotence) (10)
