<a href="https://colab.research.google.com/github/ews46167-art/mgmt467-analytics-portfolio/blob/main/Unit2_Lab1_PromptPlusExamples_Colab_Kaggle_GCS_BQ_DQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MGMT 467 — Prompt-Driven Lab (with Commented Examples)
## Kaggle ➜ Google Cloud Storage ➜ BigQuery ➜ Data Quality (DQ)

**How to use this notebook**
- Each section gives you a **Build Prompt** to paste into Gemini/Vertex AI (or Gemini in Colab).
- Below each prompt, you’ll see a **commented example** of what a good LLM answer might look like.
- **Do not** just uncomment and run. Use the prompt to generate your own code, then compare to the example.
- After every step, run the **Verification Prompt**, and write the **Reflection** in Markdown.

> Goal today: Download the Netflix dataset (Kaggle) → Stage on GCS → Load into BigQuery → Run DQ profiling (missingness, duplicates, outliers, anomaly flags).


### Academic integrity & LLM usage
- Use the prompts here to generate your own code cells.
- Read concept notes and write the reflection answers in your own words.
- Keep credentials out of code. Upload `kaggle.json` when asked.


## Learning objectives
1) Explain **why** we stage data in GCS and load it to BigQuery.  
2) Build an **idempotent**, auditable pipeline.  
3) Diagnose **missingness**, **duplicates**, and **outliers** and justify cleaning choices.  
4) Connect DQ decisions to **business/ML impact**.


## 0) Environment setup — What & Why
Authenticate Colab to Google Cloud so we can use `gcloud`, GCS, and BigQuery. Set **PROJECT_ID** and **REGION** once for consistency (cost/latency).

### Build Prompt (paste to LLM)
You are my cloud TA. Generate a single **Colab code cell** that:
1) Authenticates to Google Cloud in Colab,  
2) Prompts for `PROJECT_ID` via `input()` and sets `REGION="us-central1"` (editable),  
3) Exports `GOOGLE_CLOUD_PROJECT`,  
4) Runs `gcloud config set project $GOOGLE_CLOUD_PROJECT`,  
5) Prints both values. Add 2–3 comments explaining what/why.
End with a comment: `# Done: Auth + Project/Region set`.


In [None]:
from google.colab import auth
auth.authenticate_user()

import os
# Prompt user for GCP Project ID
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
# Set region - keep consistent; change if instructed
REGION = "us-central1"
# Export PROJECT_ID as an environment variable
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
print("Project:", PROJECT_ID, "| Region:", REGION)

# Set active project for gcloud/BigQuery CLI
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project

# Done: Auth + Project/Region set

Enter your GCP Project ID: manifest-chain-471119-t8
Project: manifest-chain-471119-t8 | Region: us-central1
Updated property [core/project].
manifest-chain-471119-t8


In [None]:
# # EXAMPLE (from LLM) — Auth + Project/Region (commented; write your own cell using the prompt)
# # from google.colab import auth
# # auth.authenticate_user()
# #
# # import os
# # PROJECT_ID = input("Enter your GCP Project ID: ").strip()
# # REGION = "us-central1"  # keep consistent; change if instructed
# # os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
# # print("Project:", PROJECT_ID, "| Region:", REGION)
# #
# # # Set active project for gcloud/BigQuery CLI
# # !gcloud config set project $GOOGLE_CLOUD_PROJECT
# # !gcloud config get-value project
# # # Done: Auth + Project/Region set

### Verification Prompt
Generate a short cell that prints the active project using `gcloud config get-value project` and echoes the `REGION` you set.


In [None]:
# Verify active project and region
!gcloud config get-value project
import os
print("Region:", os.environ.get("REGION"))

manifest-chain-471119-t8
Region: None


**Reflection:** Why do we set `PROJECT_ID` and `REGION` at the top? What can go wrong if we don’t?

*Setting the `PROJECT_ID` and `REGION` at the beginning of the notebook ensures consistency and avoids potential issues.*

## 1) Kaggle API — What & Why
Use Kaggle CLI for reproducible downloads. Store `kaggle.json` at `~/.kaggle/kaggle.json` with `0600` permissions to protect secrets.

### Build Prompt
Generate a **single Colab code cell** that:
- Prompts me to upload `kaggle.json`,
- Saves to `~/.kaggle/kaggle.json` with `0600` permissions,
- Prints `kaggle --version`.
Add comments about security and reproducibility.


In [None]:
from google.colab import files
import os

print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

# Create the .kaggle directory if it doesn't exist
os.makedirs('/root/.kaggle', exist_ok=True)

# Save the uploaded file to the .kaggle directory
# The list(uploaded.keys())[0] gets the filename from the uploaded dictionary
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])

# Set permissions to 0600 (owner read and write only) for security
os.chmod('/root/.kaggle/kaggle.json', 0o600)

# Verify Kaggle installation
!kaggle --version

# Done: Kaggle setup

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle.json
Kaggle API 1.7.4.5


In [None]:
# # EXAMPLE (from LLM) — Kaggle setup (commented)
# # from google.colab import files
# # print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
# # uploaded = files.upload()
# #
# # import os
# # os.makedirs('/root/.kaggle', exist_ok=True)
# # with open('/root/.kaggle/kaggle.json', 'wb') as f:
# #     f.write(uploaded[list(uploaded.keys())[0]])
# # os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only
# #
# # !kaggle --version

### Verification Prompt
Generate a one-liner that runs `kaggle --help | head -n 20` to show the CLI is ready.


In [None]:
# Verify Kaggle CLI is ready
!kaggle --help | head -n 20

usage: kaggle [-h] [-v] [-W]
              {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
              ...

options:
  -h, --help            show this help message and exit
  -v, --version         Print the Kaggle API version

commands:
  {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        kernels {list, files, init, push, pull, output, status}
                        models {instances, get, list, init, create, delete, update}
                        models instances {versions, get, files, init, create, delete, update}
                        models instances versions {init, create, download, delete, files}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle compe

**Reflection:** Why require strict `0600` permissions on API tokens? What risks are we avoiding?

Requiring strict 0600 permissions on API tokens means that only the owner of the file can read and write to it. This is a crucial security measure to prevent unauthorized access to your sensitive API credentials. If other users or processes on the system could read your kaggle.json file, they could potentially use your Kaggle account to download data, create datasets, or submit to competitions, which could have unintended consequences or incur costs. By setting the permissions to 0600, you minimize the risk of your API token being compromised.

## 2) Download & unzip dataset — What & Why
Keep raw files under `/content/data/raw` for predictable paths and auditing.
**Dataset:** `sayeeduddin/netflix-2025user-behavior-dataset-210k-records`

### Build Prompt
Generate a **Colab code cell** that:
- Creates `/content/data/raw`,
- Downloads the dataset to `/content/data` with Kaggle CLI,
- Unzips into `/content/data/raw` (overwrite OK),
- Lists all CSVs with sizes in a neat table.
Include comments describing each step.


In [None]:
# Create directory for raw data
!mkdir -p /content/data/raw

# Download the dataset to /content/data
# -d specifies the dataset
# -p specifies the download path
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# Unzip the downloaded file into the raw data directory
# -o flag means overwrite existing files without prompting
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw directory with sizes in a neat table
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
Downloading netflix-2025user-behavior-dataset-210k-records.zip to /content/data
  0% 0.00/4.02M [00:00<?, ?B/s]
100% 4.02M/4.02M [00:00<00:00, 769MB/s]
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root 

In [None]:
# # EXAMPLE (from LLM) — Download & unzip (commented)
# # !mkdir -p /content/data/raw
# # !kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data
# # !unzip -o /content/data/*.zip -d /content/data/raw
# # # List CSV inventory
# # !ls -lh /content/data/raw/*.csv

### Verification Prompt
Generate a snippet that asserts there are exactly **six** CSV files and prints their names.


In [None]:
import glob
import os

csv_files = glob.glob('/content/data/raw/*.csv')

# Assert that there are exactly six CSV files
assert len(csv_files) == 6, f"Expected 6 CSV files, but found {len(csv_files)}"

# Print the names of the CSV files
print("Found exactly 6 CSV files:")
for csv_file in csv_files:
    print(os.path.basename(csv_file))

Found exactly 6 CSV files:
reviews.csv
search_logs.csv
watch_history.csv
recommendation_logs.csv
users.csv
movies.csv


**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?

**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?

Keeping a clean file inventory with names and sizes is useful downstream for several reasons:

1.  **Reproducibility:** It ensures that you are always working with the expected data files. If a file is missing or has an unexpected size, it can indicate a problem in the data pipeline.
2.  **Auditing:** It provides a clear record of the raw data used in the analysis, which is important for auditing and compliance.
3.  **Debugging:** If errors occur later in the pipeline, having a clear inventory helps in quickly identifying if the issue is related to missing or corrupted input files.
4.  **Automation:** When automating data processing pipelines, predictable file names and locations are essential.
5.  **Documentation:** It serves as documentation of the input data, making it easier for others to understand and replicate your work.

## 3) Create GCS bucket & upload — What & Why
Stage in GCS → consistent, versionable source for BigQuery loads. Bucket names must be **globally unique**.

### Build Prompt
Generate a **Colab code cell** that:
- Creates a unique bucket in `${REGION}` (random suffix),
- Saves name to `BUCKET_NAME` env var,
- Uploads all CSVs to `gs://$BUCKET_NAME/netflix/`,
- Prints the bucket name and explains staging benefits.


In [None]:
import uuid
import os

# Create a unique bucket name
BUCKET_NAME = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
os.environ["BUCKET_NAME"] = BUCKET_NAME

# Create the GCS bucket
# --location specifies the region
!gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION

# Upload all CSV files to the bucket under the 'netflix/' prefix
!gcloud storage cp /content/data/raw/*.csv gs://$BUCKET_NAME/netflix/

# Print the bucket name
print("Created bucket:", BUCKET_NAME)

# Explain staging benefits
print("\nBenefits of staging data in GCS:")
print("- **Centralized storage:** A single location for your data.")
print("- **Version control:** GCS supports object versioning.")
print("- **Accessibility:** Easily accessible by other Google Cloud services like BigQuery.")
print("- **Cost-effective:** Generally cheaper than storing data in databases for raw files.")
print("- **Decoupling:** Separates storage from compute.")

Creating gs://mgmt467-netflix-384948fa/...
Copying file:///content/data/raw/movies.csv to gs://mgmt467-netflix-384948fa/netflix/movies.csv
Copying file:///content/data/raw/recommendation_logs.csv to gs://mgmt467-netflix-384948fa/netflix/recommendation_logs.csv
Copying file:///content/data/raw/reviews.csv to gs://mgmt467-netflix-384948fa/netflix/reviews.csv
Copying file:///content/data/raw/search_logs.csv to gs://mgmt467-netflix-384948fa/netflix/search_logs.csv
Copying file:///content/data/raw/users.csv to gs://mgmt467-netflix-384948fa/netflix/users.csv
Copying file:///content/data/raw/watch_history.csv to gs://mgmt467-netflix-384948fa/netflix/watch_history.csv

Average throughput: 46.2MiB/s
Created bucket: mgmt467-netflix-384948fa

Benefits of staging data in GCS:
- **Centralized storage:** A single location for your data.
- **Version control:** GCS supports object versioning.
- **Accessibility:** Easily accessible by other Google Cloud services like BigQuery.
- **Cost-effective:** Gen

In [None]:
# # EXAMPLE (from LLM) — GCS staging (commented)
# # import uuid, os
# # bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
# # os.environ["BUCKET_NAME"] = bucket_name
# # !gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION
# # !gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/
# # print("Bucket:", bucket_name)
# # # Verify contents
# # !gcloud storage ls gs://$BUCKET_NAME/netflix/

### Verification Prompt
Generate a snippet that lists the `netflix/` prefix and shows object sizes.


In [None]:
# List objects in the netflix/ prefix and show sizes
!gcloud storage ls -l gs://$BUCKET_NAME/netflix/

    115942  2025-10-23T19:35:35Z  gs://mgmt467-netflix-384948fa/netflix/movies.csv
   4695557  2025-10-23T19:35:35Z  gs://mgmt467-netflix-384948fa/netflix/recommendation_logs.csv
   1861942  2025-10-23T19:35:35Z  gs://mgmt467-netflix-384948fa/netflix/reviews.csv
   2250902  2025-10-23T19:35:35Z  gs://mgmt467-netflix-384948fa/netflix/search_logs.csv
   1606820  2025-10-23T19:35:35Z  gs://mgmt467-netflix-384948fa/netflix/users.csv
   9269425  2025-10-23T19:35:35Z  gs://mgmt467-netflix-384948fa/netflix/watch_history.csv
TOTAL: 6 objects, 19800588 bytes (18.88MiB)


**Reflection:** Name two benefits of staging in GCS vs loading directly from local Colab.

Two benefits of staging data in GCS versus loading directly from local Colab are:

1.  **Scalability and Accessibility:** GCS provides a scalable and centralized location for your data that can be easily accessed by various Google Cloud services like BigQuery, Dataproc, and AI Platform. Loading directly from local Colab is limited by the Colab environment's storage and is not easily accessible by other services.
2.  **Durability and Reliability:** GCS offers high durability and reliability with data redundancy across multiple locations. This protects your data from loss due to hardware failures or other issues that could affect data stored locally in the Colab environment.

## 4) BigQuery dataset & loads — What & Why
Create dataset `netflix` and load six CSVs with **autodetect** for speed (we’ll enforce schemas later).

### Build Prompt (two cells)
**Cell A:** Create (idempotently) dataset `netflix` in US multi-region; if it exists, print a friendly message.  
**Cell B:** Load tables from `gs://$BUCKET_NAME/netflix/`:
`users, movies, watch_history, recommendation_logs, search_logs, reviews`
with `--skip_leading_rows=1 --autodetect --source_format=CSV`.
Finish with row-count queries for each table.


In [None]:
# Load tables from GCS
tables = {
  "users": "users.csv",
  "movies": "movies.csv",
  "watch_history": "watch_history.csv",
  "recommendation_logs": "recommendation_logs.csv",
  "search_logs": "search_logs.csv",
  "reviews": "reviews.csv",
}

import os
for tbl, fname in tables.items():
  src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
  print("Loading", tbl, "from", src)
  !bq load --skip_leading_rows=1 --autodetect --source_format=CSV {DATASET}.{tbl} {src}

# Row counts
print("\nRow counts:")
for tbl in tables.keys():
  # Corrected bq query command syntax using f-string
  !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `{os.environ['GOOGLE_CLOUD_PROJECT']}.{DATASET}.{tbl}`"

Loading users from gs://mgmt467-netflix-384948fa/netflix/users.csv
Waiting on bqjob_r507f4c4014d002a4_0000019a129877a9_1 ... (1s) Current status: DONE   
Loading movies from gs://mgmt467-netflix-384948fa/netflix/movies.csv
Waiting on bqjob_r7e3b4d2fca7279c9_0000019a12988d4a_1 ... (2s) Current status: DONE   
Loading watch_history from gs://mgmt467-netflix-384948fa/netflix/watch_history.csv
Waiting on bqjob_r4284473acb27aee6_0000019a1298a6e1_1 ... (3s) Current status: DONE   
Loading recommendation_logs from gs://mgmt467-netflix-384948fa/netflix/recommendation_logs.csv
Waiting on bqjob_r33dfddbc58d3bf67_0000019a1298c552_1 ... (1s) Current status: DONE   
Loading search_logs from gs://mgmt467-netflix-384948fa/netflix/search_logs.csv
Waiting on bqjob_r7b9f13cf84eaaeec_0000019a1298da39_1 ... (1s) Current status: DONE   
Loading reviews from gs://mgmt467-netflix-384948fa/netflix/reviews.csv
Waiting on bqjob_r4e681e2c5dfafa73_0000019a1298ef3d_1 ... (1s) Current status: DONE   

Row counts:
/

In [None]:
# Create BigQuery dataset (idempotent)
DATASET = "netflix"
# Attempt to create; ignore if exists and print message
!bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET 2> /dev/null || echo "Dataset '$DATASET' may already exist."

Dataset 'manifest-chain-471119-t8:netflix' successfully created.


In [None]:
# # EXAMPLE (from LLM) — BigQuery dataset (commented)
# # DATASET="netflix"
# # # Attempt to create; ignore if exists
# # !bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

In [None]:
# # EXAMPLE (from LLM) — Load tables (commented)
# # tables = {
# #   "users": "users.csv",
# #   "movies": "movies.csv",
# #   "watch_history": "watch_history.csv",
# #   "recommendation_logs": "recommendation_logs.csv",
# #   "search_logs": "search_logs.csv",
# #   "reviews": "reviews.csv",
# # }
# # import os
# # for tbl, fname in tables.items():
# #   src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
# #   print("Loading", tbl, "from", src)
# #   !bq load --skip_leading_rows=1 --autodetect --source_format=CSV $DATASET.$tbl $src
# #
# # # Row counts
# # for tbl in tables.keys():
# #   !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `${GOOGLE_CLOUD_PROJECT}.netflix.{tbl}`".format(tbl=tbl)

### Verification Prompt
Generate a single query that returns `table_name, row_count` for all six tables in `${GOOGLE_CLOUD_PROJECT}.netflix`.


In [None]:
import os

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  query = f"""
  SELECT 'users' AS table_name, COUNT(*) AS row_count FROM `{project_id}.netflix.users`
  UNION ALL
  SELECT 'movies' AS table_name, COUNT(*) AS row_count FROM `{project_id}.netflix.movies`
  UNION ALL
  SELECT 'watch_history' AS table_name, COUNT(*) AS row_count FROM `{project_id}.netflix.watch_history`
  UNION ALL
  SELECT 'recommendation_logs' AS table_name, COUNT(*) AS row_count FROM `{project_id}.netflix.recommendation_logs`
  UNION ALL
  SELECT 'search_logs' AS table_name, COUNT(*) AS row_count FROM `{project_id}.netflix.search_logs`
  UNION ALL
  SELECT 'reviews' AS table_name, COUNT(*) AS row_count FROM `{project_id}.netflix.reviews`
  """
  from IPython.display import display
  from google.cloud import bigquery

  client = bigquery.Client(project=project_id)
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())

else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Unnamed: 0,table_name,row_count
0,watch_history,210000
1,users,20600
2,movies,2080
3,search_logs,53000
4,recommendation_logs,104000
5,reviews,30900


**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

`autodetect` is acceptable for initial exploration and quick loading of data, especially when the data is relatively clean and consistent. It's convenient for getting a feel for the data without manually defining each column's type.

However, you should enforce explicit schemas when:

1.  **Data Quality and Consistency are Critical:** Explicit schemas ensure that data conforms to expected types and structures, catching errors early in the loading process.
2.  **Production Pipelines:** In production, predictable data types are essential for reliable downstream processes and analyses.
3.  **Complex Data Types or Nested Structures:** `autodetect` might not correctly interpret complex or nested data, leading to incorrect schema definitions.
4.  **Performance Optimization:** Explicitly defining schemas can sometimes lead to better query performance in BigQuery.
5.  **Documentation and Clarity:** Explicit schemas serve as clear documentation of the data structure for anyone using the dataset.

Enforcing explicit schemas provides better control, reliability, and data quality assurance compared to relying solely on `autodetect`.

## 5) Data Quality (DQ) — Concepts we care about
- **Missingness** (MCAR/MAR/MNAR). Impute vs drop. Add `is_missing_*` indicators.
- **Duplicates** (exact vs near). Double-counted engagement corrupts labels & KPIs.
- **Outliers** (IQR). Winsorize/cap vs robust models. Always **flag** and explain.
- **Reproducibility**. Prefer `CREATE OR REPLACE` and deterministic keys.


### 5.1 Missingness (users) — What & Why
Measure % missing and check if missingness depends on another variable (MAR) → potential bias & instability.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Total rows and % missing in `region`, `plan_tier`, `age_band` from `users`.
2) `% plan_tier missing by region` ordered descending. Add comments on MAR.


In [None]:
from google.cloud import bigquery
import os

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
dataset_id = "netflix"
table_id = "users"

if project_id:
    client = bigquery.Client(project=project_id)
    table_ref = client.dataset(dataset_id).table(table_id)
    table = client.get_table(table_ref)

    print(f"Schema for table {project_id}.{dataset_id}.{table_id}:")
    for field in table.schema:
        print(f"- {field.name}: {field.field_type}")
else:
    print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Schema for table manifest-chain-471119-t8.netflix.users:
- user_id: STRING
- email: STRING
- first_name: STRING
- last_name: STRING
- age: FLOAT
- gender: STRING
- country: STRING
- state_province: STRING
- city: STRING
- subscription_plan: STRING
- subscription_start_date: DATE
- is_active: BOOLEAN
- monthly_spend: FLOAT
- primary_device: STRING
- household_size: FLOAT
- created_at: TIMESTAMP


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- % plan_tier missing by country (MAR check)
  SELECT country,
         COUNT(*) AS n,
         ROUND(100*COUNTIF(subscription_plan IS NULL)/COUNT(*),2) AS pct_missing_plan_tier
  FROM `{}.netflix.users`
  GROUP BY country
  ORDER BY pct_missing_plan_tier DESC;
  """.format(project_id)
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Unnamed: 0,country,n,pct_missing_plan_tier
0,USA,14408,0.0
1,Canada,6192,0.0


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Users: % missing per column
  WITH base AS (
    SELECT COUNT(*) n,
           COUNTIF(country IS NULL) miss_country,
           COUNTIF(subscription_plan IS NULL) miss_plan,
           COUNTIF(age IS NULL) miss_age
    FROM `{}.netflix.users`
  )
  SELECT n,
         ROUND(100*miss_country/n,2) AS pct_missing_country,
         ROUND(100*miss_plan/n,2)   AS pct_missing_plan_tier,
         ROUND(100*miss_age/n,2)    AS pct_missing_age
  FROM base;
  """.format(project_id)
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Unnamed: 0,n,pct_missing_country,pct_missing_plan_tier,pct_missing_age
0,20600,0.0,0.0,11.93


In [None]:
# # EXAMPLE (from LLM) — Missingness profile (commented)
# # -- Users: % missing per column
# # WITH base AS (
# #   SELECT COUNT(*) n,
# #          COUNTIF(region IS NULL) miss_region,
# #          COUNTIF(plan_tier IS NULL) miss_plan,
# #          COUNTIF(age_band IS NULL) miss_age
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # )
# # SELECT n,
# #        ROUND(100*miss_region/n,2) AS pct_missing_region,
# #        ROUND(100*miss_plan/n,2)   AS pct_missing_plan_tier,
# #        ROUND(100*miss_age/n,2)    AS pct_missing_age_band
# # FROM base;

In [None]:
# # EXAMPLE (from LLM) — MAR by region (commented)
# # SELECT region,
# #        COUNT(*) AS n,
# #        ROUND(100*COUNTIF(plan_tier IS NULL)/COUNT(*),2) AS pct_missing_plan_tier
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # GROUP BY region
# # ORDER BY pct_missing_plan_tier DESC;

### Verification Prompt
Generate a query that prints the three missingness percentages from (1), rounded to two decimals.


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Print missingness percentages
  SELECT ROUND(100*COUNTIF(country IS NULL)/COUNT(*),2) AS pct_missing_country,
         ROUND(100*COUNTIF(subscription_plan IS NULL)/COUNT(*),2)   AS pct_missing_plan_tier,
         ROUND(100*COUNTIF(age IS NULL)/COUNT(*),2)    AS pct_missing_age
  FROM `{}.netflix.users`;
  """.format(project_id)
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Unnamed: 0,pct_missing_country,pct_missing_plan_tier,pct_missing_age
0,0.0,0.0,11.93


**Reflection:** Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.

Based on the query results, the `age` column has the most missing values (11.93%), while `country` and `subscription_plan` have no missing values.

It is difficult to definitively determine if the missingness is MCAR, MAR, or MNAR without further investigation. However, here are some hypotheses:

*   **Age (MAR or MNAR):** Missing age data could be **Missing At Random (MAR)** if the missingness is related to another observed variable, such as the user's device type or subscription date (e.g., older users who signed up earlier might be less likely to provide their age). It could also be **Missing Not At Random (MNAR)** if the missingness is related to the age itself (e.g., users at extreme ages or those who are sensitive about revealing their age might be less likely to provide it).
*   **Country and Subscription Plan (MCAR):** Since there are no missing values for `country` and `subscription_plan`, the missingness for these columns is likely **Missing Completely At Random (MCAR)** if any missingness were to occur, as it wouldn't appear to be related to any other observed or unobserved variables in this dataset based on the current analysis.

### 5.2 Duplicates (watch_history) — What & Why
Find exact duplicate interaction records and keep **one best** per group (deterministic policy).

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Report duplicate groups on `(user_id, movie_id, event_ts, device_type)` with counts (top 20).
2) Create table `watch_history_dedup` that keeps one row per group (prefer higher `progress_ratio`, then `minutes_watched`). Add comments.


In [None]:
import os
from google.cloud import bigquery

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Create table watch_history_dedup that keeps one row per group
  CREATE OR REPLACE TABLE `{}.netflix.watch_history_dedup` AS
  SELECT * EXCEPT(rk) FROM (
    SELECT h.*,
           ROW_NUMBER() OVER (
             PARTITION BY user_id, movie_id, watch_date, device_type
             ORDER BY progress_percentage DESC, watch_duration_minutes DESC
           ) AS rk
    FROM `{}.netflix.watch_history` h
  )
  WHERE rk = 1;
  """.format(project_id, project_id)
  query_job = client.query(query)
  query_job.result() # Wait for the job to complete
  print(f"Table `{project_id}.netflix.watch_history_dedup` created or replaced.")
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Table `manifest-chain-471119-t8.netflix.watch_history_dedup` created or replaced.


In [None]:
from google.cloud import bigquery
import os

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
dataset_id = "netflix"
table_id = "watch_history"

if project_id:
    client = bigquery.Client(project=project_id)
    table_ref = client.dataset(dataset_id).table(table_id)
    table = client.get_table(table_ref)

    print(f"Schema for table {project_id}.{dataset_id}.{table_id}:")
    for field in table.schema:
        print(f"- {field.name}: {field.field_type}")
else:
    print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Schema for table manifest-chain-471119-t8.netflix.watch_history:
- session_id: STRING
- user_id: STRING
- movie_id: STRING
- watch_date: DATE
- device_type: STRING
- watch_duration_minutes: FLOAT
- progress_percentage: FLOAT
- action: STRING
- quality: STRING
- location_country: STRING
- is_download: BOOLEAN
- user_rating: INTEGER


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Report duplicate groups on (user_id, movie_id, watch_date, device_type) with counts (top 20)
  SELECT user_id, movie_id, watch_date, device_type, COUNT(*) AS dup_count
  FROM `{}.netflix.watch_history`
  GROUP BY user_id, movie_id, watch_date, device_type
  HAVING COUNT(*) > 1
  ORDER BY dup_count DESC
  LIMIT 20;
  """.format(project_id)
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Unnamed: 0,user_id,movie_id,watch_date,device_type,dup_count
0,user_03310,movie_0640,2024-09-08,Smart TV,8
1,user_00391,movie_0893,2024-08-26,Laptop,8
2,user_05629,movie_0697,2025-01-23,Desktop,6
3,user_07617,movie_0785,2024-07-14,Desktop,6
4,user_07738,movie_0793,2025-07-28,Desktop,6
5,user_05811,movie_0177,2024-05-07,Desktop,6
6,user_07594,movie_0133,2025-03-24,Laptop,6
7,user_04899,movie_0142,2025-01-20,Desktop,6
8,user_02976,movie_0987,2024-09-19,Desktop,6
9,user_02652,movie_0352,2024-10-22,Desktop,6


In [None]:
# # EXAMPLE (from LLM) — Detect duplicate groups (commented)
# # SELECT user_id, movie_id, event_ts, device_type, COUNT(*) AS dup_count
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history`
# # GROUP BY user_id, movie_id, event_ts, device_type
# # HAVING dup_count > 1
# # ORDER BY dup_count DESC
# # LIMIT 20;

In [None]:
# # EXAMPLE (from LLM) — Keep-one policy (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` AS
# # SELECT * EXCEPT(rk) FROM (
# #   SELECT h.*,
# #          ROW_NUMBER() OVER (
# #            PARTITION BY user_id, movie_id, event_ts, device_type
# #            ORDER BY progress_ratio DESC, minutes_watched DESC
# #          ) AS rk
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history` h
# # )
# # WHERE rk = 1;

### Verification Prompt
Generate a before/after count query comparing raw vs `watch_history_dedup`.


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Compare raw vs deduplicated row counts
  SELECT 'raw' AS table_name, COUNT(*) AS row_count FROM `{}.netflix.watch_history`
  UNION ALL
  SELECT 'deduplicated' AS table_name, COUNT(*) AS row_count FROM `{}.netflix.watch_history_dedup`;
  """.format(project_id, project_id)
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Unnamed: 0,table_name,row_count
0,raw,210000
1,deduplicated,100000


**Reflection:** Why do duplicates arise (natural vs system-generated)? How do they corrupt labels and KPIs?

Duplicates can arise from several sources, broadly categorized as natural or system-generated:

*   **Natural Duplicates:** These occur in the real world and are reflected in the data. Examples include a customer placing the same order twice, or a person submitting the same review multiple times.
*   **System-Generated Duplicates:** These are introduced by data collection, processing, or integration errors. Examples include:
    *   **Double Logging:** An event is recorded multiple times due to retry mechanisms or errors in the logging system.
    *   **Data Integration Issues:** Combining data from different sources without proper matching keys can lead to the same entity appearing multiple times.
    *   **Manual Data Entry Errors:** Typos or repeated entries during manual data input.
    *   **System Glitches:** Software bugs can cause data to be duplicated during storage or transfer.

Duplicates can significantly corrupt labels and KPIs:

*   **Corrupted Labels:** If duplicates are not handled, they can lead to incorrect labeling for machine learning models. For example, if a user's watch history contains duplicate entries for the same movie, it might incorrectly inflate the perceived engagement with that movie, leading to a biased training label for a recommendation model.
*   **Inflated KPIs:** Duplicates artificially inflate metrics that rely on counting distinct events or entities. For instance, calculating the number of movie views or the total watch time without deduplication will result in overestimates, leading to inaccurate reporting and decision-making based on flawed KPIs.
*   **Skewed Analysis:** Duplicates can skew statistical analysis and aggregations, giving undue weight to the duplicated records.
*   **Resource Waste:** Storing and processing duplicate data wastes storage space and computational resources.

### 5.3 Outliers (minutes_watched) — What & Why
Estimate extreme values via IQR; report % outliers; **winsorize** to P01/P99 for robustness while also **flagging** extremes.

In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Create watch_history_robust with minutes_watched_capped at P01/P99
  CREATE OR REPLACE TABLE `{0}.netflix.watch_history_robust` AS
  WITH q AS (
    SELECT
      APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(1)]  AS p01,
      APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(98)] AS p99
    FROM `{0}.netflix.watch_history_dedup`
  )
  SELECT
    h.*,
    GREATEST(q.p01, LEAST(q.p99, h.watch_duration_minutes)) AS minutes_watched_capped
  FROM `{0}.netflix.watch_history_dedup` h, q;
  """.format(project_id)
  query_job = client.query(query)
  query_job.result() # Wait for the table creation to complete

  query_quantiles = """
  -- Quantiles before vs after
  WITH before AS (
    SELECT 'before' AS which, APPROX_QUANTILES(watch_duration_minutes, 5) AS q
    FROM `{0}.netflix.watch_history_dedup`
  ),
  after AS (
    SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
    FROM `{0}.netflix.watch_history_robust`
  )
  SELECT * FROM before UNION ALL SELECT * FROM after;
  """.format(project_id)
  query_job_quantiles = client.query(query_quantiles)
  results_quantiles = query_job_quantiles.result()
  display(results_quantiles.to_dataframe())

else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Unnamed: 0,which,q
0,after,"[4.4, 24.6, 41.5, 61.5, 92.0, 203.6]"
1,before,"[0.2, 24.9, 41.7, 61.3, 91.9, 799.3]"


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Compute IQR bounds for minutes_watched and report % outliers
  WITH dist AS (
    SELECT
      APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(1)] AS q1,
      APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(3)] AS q3
    FROM `{}.netflix.watch_history_dedup`
  ),
  bounds AS (
    SELECT q1, q3, (q3-q1) AS iqr,
           q1 - 1.5*(q3-q1) AS lo,
           q3 + 1.5*(q3-q1) AS hi
    FROM dist
  )
  SELECT
    COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi) AS outliers,
    COUNT(*) AS total,
    ROUND(100*COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi)/COUNT(*),2) AS pct_outliers
  FROM `{}.netflix.watch_history_dedup` h
  CROSS JOIN bounds b;
  """.format(project_id, project_id)
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Unnamed: 0,outliers,total,pct_outliers
0,3505,100000,3.5


### Build Prompt
Generate **two BigQuery SQL cells**:
1) Compute IQR bounds for `minutes_watched` on `watch_history_dedup` and report % outliers.
2) Create `watch_history_robust` with `minutes_watched_capped` capped at P01/P99; return quantile summaries before/after.


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)

  # 1. Compute IQR bounds for minutes_watched on watch_history_dedup and report % outliers.
  query_outliers = """
  -- Compute IQR bounds for minutes_watched and report % outliers
  WITH dist AS (
    SELECT
      APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(1)] AS q1,
      APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(3)] AS q3
    FROM `{}.netflix.watch_history_dedup`
  ),
  bounds AS (
    SELECT q1, q3, (q3-q1) AS iqr,
           q1 - 1.5*(q3-q1) AS lo,
           q3 + 1.5*(q3-q1) AS hi
    FROM dist
  )
  SELECT
    COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi) AS outliers,
    COUNT(*) AS total,
    ROUND(100*COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi)/COUNT(*),2) AS pct_outliers
  FROM `{}.netflix.watch_history_dedup` h
  CROSS JOIN bounds b;
  """.format(project_id, project_id)
  print("Outlier analysis:")
  query_job_outliers = client.query(query_outliers)
  results_outliers = query_job_outliers.result()
  display(results_outliers.to_dataframe())

  # 2. Create table watch_history_robust with minutes_watched_capped at P01/P99; return quantile summaries before/after.
  query_winsorize = """
  -- Create watch_history_robust with minutes_watched_capped at P01/P99
  CREATE OR REPLACE TABLE `{0}.netflix.watch_history_robust` AS
  WITH q AS (
    SELECT
      APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(1)]  AS p01,
      APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(98)] AS p99
    FROM `{0}.netflix.watch_history_dedup`
  )
  SELECT
    h.*,
    GREATEST(q.p01, LEAST(q.p99, h.watch_duration_minutes)) AS minutes_watched_capped
  FROM `{0}.netflix.watch_history_dedup` h, q;
  """.format(project_id)
  print("\nCreating watch_history_robust and showing quantile summaries:")
  query_job_winsorize = client.query(query_winsorize)
  query_job_winsorize.result() # Wait for the table creation to complete

  query_quantiles = """
  -- Quantiles before vs after
  WITH before AS (
    SELECT 'before' AS which, APPROX_QUANTILES(watch_duration_minutes, 5) AS q
    FROM `{0}.netflix.watch_history_dedup`
  ),
  after AS (
    SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
    FROM `{0}.netflix.watch_history_robust`
  )
  SELECT * FROM before UNION ALL SELECT * FROM after;
  """.format(project_id)
  query_job_quantiles = client.query(query_quantiles)
  results_quantiles = query_job_quantiles.result()
  display(results_quantiles.to_dataframe())

else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Outlier analysis:


Unnamed: 0,outliers,total,pct_outliers
0,3505,100000,3.5



Creating watch_history_robust and showing quantile summaries:


Unnamed: 0,which,q
0,after,"[4.4, 24.6, 41.5, 61.5, 92.0, 203.6]"
1,before,"[0.2, 24.9, 41.7, 61.3, 91.9, 799.3]"


In [None]:
# # EXAMPLE (from LLM) — IQR outlier rate (commented)
# # WITH dist AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(1)] AS q1,
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(3)] AS q3
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # bounds AS (
# #   SELECT q1, q3, (q3-q1) AS iqr,
# #          q1 - 1.5*(q3-q1) AS lo,
# #          q3 + 1.5*(q3-q1) AS hi
# #   FROM dist
# # )
# # SELECT
# #   COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi) AS outliers,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi)/COUNT(*),2) AS pct_outliers
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h
# # CROSS JOIN bounds b;

In [None]:
# # EXAMPLE (from LLM) — Winsorize + quantiles (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust` AS
# # WITH q AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(1)]  AS p01,
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(98)] AS p99
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # )
# # SELECT
# #   h.*,
# #   GREATEST(q.p01, LEAST(q.p99, h.minutes_watched)) AS minutes_watched_capped
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h, q;
# #
# # -- Quantiles before vs after
# # WITH before AS (
# #   SELECT 'before' AS which, APPROX_QUANTILES(minutes_watched, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # after AS (
# #   SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`
# # )
# # SELECT * FROM before UNION ALL SELECT * FROM after;

### Verification Prompt
Generate a query that shows min/median/max before vs after capping.


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Min/Median/Max before vs after capping
  WITH before AS (
    SELECT 'before' AS which,
           MIN(watch_duration_minutes) AS min_val,
           APPROX_QUANTILES(watch_duration_minutes, 2)[OFFSET(1)] AS median_val,
           MAX(watch_duration_minutes) AS max_val
    FROM `{}.netflix.watch_history_dedup`
  ),
  after AS (
    SELECT 'after' AS which,
           MIN(minutes_watched_capped) AS min_val,
           APPROX_QUANTILES(minutes_watched_capped, 2)[OFFSET(1)] AS median_val,
           MAX(minutes_watched_capped) AS max_val
    FROM `{}.netflix.watch_history_robust`
  )
  SELECT * FROM before UNION ALL SELECT * FROM after;
  """.format(project_id, project_id)
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Unnamed: 0,which,min_val,median_val,max_val
0,after,4.4,51.4,203.6
1,before,0.2,51.0,799.3


**Reflection:** When might capping be harmful? Name a model type less sensitive to outliers and why.

Capping might be harmful when:

1.  **Outliers represent important information:** If extreme values are not errors but represent genuine, albeit rare, events (e.g., a customer with exceptionally high engagement), capping can distort the data and lead to a loss of valuable information. This can negatively impact models that need to identify and act on these extreme cases.
2.  **The underlying distribution is naturally skewed:** Capping can artificially alter the distribution of the data, which might be problematic for models that assume a certain data distribution.
3.  **The goal is to understand the full range of data:** Capping obscures the true minimum and maximum values, making it difficult to understand the complete variability and range of the data.

A model type less sensitive to outliers is **tree-based models**, such as **Decision Trees, Random Forests, and Gradient Boosting Machines (like XGBoost or LightGBM)**.

**Why they are less sensitive:** Tree-based models partition the data based on feature values at different thresholds. The splitting decisions are based on the *rank* of the data points rather than the exact magnitude of the values. Outliers, while they might influence the creation of a split point, do not have a disproportionate impact on the overall structure of the tree or the final prediction as they do in models that rely on distance or magnitude, such as linear regression or k-nearest neighbors. The impact of an outlier is limited to the specific node it falls into, rather than affecting the entire model globally.

### 5.4 Business anomaly flags — What & Why
Human-readable flags help both product decisioning and ML features (e.g., binge behavior).

### Build Prompt
Generate **three BigQuery SQL cells** (adjust if columns differ):
1) In `watch_history_robust`, compute and summarize `flag_binge` for sessions > 8 hours.
2) In `users`, compute and summarize `flag_age_extreme` if age can be parsed from `age_band` (<10 or >100).
3) In `movies`, compute and summarize `flag_duration_anomaly` where `duration_min` < 15 or > 480 (if exists).
Each cell should output count and percentage and include 1–2 comments.


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Compute and summarize flag_age_extreme if age is <10 or >100
  SELECT
    COUNTIF(age < 10 OR age > 100) AS extreme_age_rows,
    COUNT(*) AS total,
    ROUND(100*COUNTIF(age < 10 OR age > 100)/COUNT(*),2) AS pct_extreme_age
  FROM `{}.netflix.users`;
  """.format(project_id)
  print("Age extreme flag analysis:")
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Age extreme flag analysis:


Unnamed: 0,extreme_age_rows,total,pct_extreme_age
0,358,20600,1.74


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  # Assuming the movies table has a 'duration_minutes' column.
  # If not, you may need to adjust the column name based on the actual schema.
  query = """
  -- Compute and summarize flag_duration_anomaly where duration is < 15 or > 480 minutes
  SELECT
    COUNTIF(duration_minutes < 15) AS titles_under_15m,
    COUNTIF(duration_minutes > 480) AS titles_over_8h,
    COUNT(*) AS total,
    ROUND(100*COUNTIF(duration_minutes < 15 OR duration_minutes > 480)/COUNT(*),2) AS pct_duration_anomaly
  FROM `{}.netflix.movies`;
  """.format(project_id)
  print("Duration anomaly flag analysis:")
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Duration anomaly flag analysis:


Unnamed: 0,titles_under_15m,titles_over_8h,total,pct_duration_anomaly
0,24,22,2080,2.21


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Compute and summarize flag_binge for sessions > 8 hours (480 minutes)
  SELECT
    COUNTIF(watch_duration_minutes > 480) AS sessions_over_8h,
    COUNT(*) AS total,
    ROUND(100*COUNTIF(watch_duration_minutes > 480)/COUNT(*),2) AS pct_sessions_over_8h
  FROM `{}.netflix.watch_history_robust`;
  """.format(project_id)
  print("Binge watching flag analysis:")
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Binge watching flag analysis:


Unnamed: 0,sessions_over_8h,total,pct_sessions_over_8h
0,639,100000,0.64


In [None]:
# # EXAMPLE (from LLM) — flag_binge (commented)
# # SELECT
# #   COUNTIF(minutes_watched > 8*60) AS sessions_over_8h,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(minutes_watched > 8*60)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`;

In [None]:
# # EXAMPLE (from LLM) — flag_age_extreme (commented)
# # SELECT
# #   COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #           CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100) AS extreme_age_rows,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #                     CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`;

In [None]:
# # EXAMPLE (from LLM) — flag_duration_anomaly (commented)
# # SELECT
# #   COUNTIF(duration_min < 15) AS titles_under_15m,
# #   COUNTIF(duration_min > 8*60) AS titles_over_8h,
# #   COUNT(*) AS total
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`;

### Verification Prompt
Generate a single compact summary query that returns two columns per flag: `flag_name, pct_of_rows`.


In [None]:
import os
from google.cloud import bigquery
from IPython.display import display

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if project_id:
  client = bigquery.Client(project=project_id)
  query = """
  -- Single compact summary query for anomaly flags
  SELECT 'flag_binge' AS flag_name,
         ROUND(100*COUNTIF(watch_duration_minutes > 480)/COUNT(*),2) AS pct_of_rows
  FROM `{}.netflix.watch_history_robust`
  UNION ALL
  SELECT 'flag_age_extreme' AS flag_name,
         ROUND(100*COUNTIF(age < 10 OR age > 100)/COUNT(*),2) AS pct_of_rows
  FROM `{}.netflix.users`
  UNION ALL
  SELECT 'flag_duration_anomaly' AS flag_name,
         ROUND(100*COUNTIF(duration_minutes < 15 OR duration_minutes > 480)/COUNT(*),2) AS pct_of_rows
  FROM `{}.netflix.movies`;
  """.format(project_id, project_id, project_id)
  print("Anomaly Flag Summary:")
  query_job = client.query(query)
  results = query_job.result()
  display(results.to_dataframe())
else:
  print("GOOGLE_CLOUD_PROJECT environment variable not set.")

Anomaly Flag Summary:


Unnamed: 0,flag_name,pct_of_rows
0,flag_binge,0.64
1,flag_age_extreme,1.74
2,flag_duration_anomaly,2.21


**Reflection:** Which anomaly flag is most common? Which would you keep as a feature and why?

Based on the summary query, the `flag_duration_anomaly` is the most common flag (2.21%), followed by `flag_age_extreme` (1.74%), and `flag_binge` (0.64%).

The decision of which anomaly flag to keep as a feature depends on the specific business problem or machine learning task. However, **`flag_binge`** would likely be a valuable feature to keep, especially for recommendation systems or churn prediction.

**Reasoning:**

*   **Business Relevance:** Binge-watching behavior (`flag_binge`) is a strong indicator of user engagement and content affinity. Understanding and predicting binge patterns can be crucial for personalized recommendations, content scheduling, and identifying highly engaged users.
*   **Potential for Insight:** Analyzing characteristics of users who binge-watch or content that is often binged can provide valuable business insights.
*   **Actionable:** Identifying binge-watching sessions can lead to actionable strategies, such as suggesting the next episode, recommending similar content, or understanding the impact of content format on viewing habits.

While `flag_age_extreme` and `flag_duration_anomaly` might also be relevant depending on the context, `flag_binge` seems to have a more direct and immediate link to user behavior that Netflix would likely be interested in leveraging for personalization and engagement.

## 6) Save & submit — What & Why
Reproducibility: save artifacts and document decisions so others can rerun and audit.

### Build Prompt
Generate a checklist (Markdown) students can paste at the end:
- Save this notebook to the team Drive.
- Export a `.sql` file with your DQ queries and save to repo.
- Push notebook + SQL to the **team GitHub** with a descriptive commit.
- Add a README with your `PROJECT_ID`, `REGION`, bucket, dataset, and today’s row counts.


## Grading rubric (quick)
- Profiling completeness (30)  
- Cleaning policy correctness & reproducibility (40)  
- Reflection/insight (20)  
- Hygiene (naming, verification, idempotence) (10)
