# A: Where Data Comes From

This section will cover the main places data comes from, what assumptions each source bakes in, and how to do a quick *source audit* before cleaning.

## A.1 Files: CSV, JSON, Excel

Flat files are common, portable, and simple

**What these formats are good at:**
<table style="text-align=left";>
    <tr>
        <th>Format</th>
        <th>Best suited for</th>
        <th>Common misuse</th>
    </tr>
    <tr>
        <th>CSV</th>
        <th>Simple rectangular tables; system exports; quick sharing</th>
        <th>Encoding complex structure or multiple tables in one file</th>
    </tr>
    <tr>
        <th>Excel</th>
        <th>Human-facing analysis, manual edits, reporting</th>
        <th>Using as a source of truth or automated data pipeline input</th>
    </tr>
    <tr>
        <th>JSON</th>
        <th>Nested or semi-structured records; API responses</th>
        <th>Assuming fields are stable or consistently present</th>
    </tr>
</table>

### Hidden assumptions
* Someone chose column names, encodings, delimiters, and header rows
* Missing values may be implicit("", NA, -1).
* Types are inferred later, not enforced at creation.

### Microlab: File sanity check

Simulate loading messy exports and practice the first questions that should be asked:
* What are the columns?
* What looks like missing data?
* What types are being inferred?

In [1]:
import pandas as pd
from io import StringIO
import json

print("=== CSV export (note blanks, NA, weird spacing) ===")
csv_text = """
user_id,age,income,city
101,34,72000,Denver
102,,NA, Boulder
103,29,65000,Denver
104, -1,5400, "Colorado Springs"
"""
df_csv = pd.read_csv(StringIO(csv_text),
                     skipinitialspace=True)
print(df_csv)
print("\nDtypes:", df_csv.dtypes.to_dict())
print("Missing by column:", df_csv.isna().sum().to_dict())

print("\n=== JSON export (nested fields) ===")
json_text = json.dumps([
      {"user_id": 201, "profile": {"age": 31, "city": "Denver"}, "income": 80000},
      {"user_id": 202, "profile": {"age": None, "city": "Boulder"}, "income": "79000"},
      {"user_id": 203, "profile": {"city": "Denver"}, "income": 61000}
])

records = json.loads(json_text)
df_json = pd.json_normalize(records)
print(df_json)
print("\nDtypes:", df_json.dtypes.to_dict())

print("\nTry:")
print("- Replace -1 with a real age and re-check missingness")
print("- Make income consistently numeric")
print("- Rename columns to something consistent (snake_case)")


=== CSV export (note blanks, NA, weird spacing) ===
   user_id   age   income              city
0      101  34.0  72000.0            Denver
1      102   NaN      NaN           Boulder
2      103  29.0  65000.0            Denver
3      104  -1.0   5400.0  Colorado Springs

Dtypes: {'user_id': dtype('int64'), 'age': dtype('float64'), 'income': dtype('float64'), 'city': dtype('O')}
Missing by column: {'user_id': 0, 'age': 1, 'income': 1, 'city': 0}

=== JSON export (nested fields) ===
   user_id income  profile.age profile.city
0      201  80000         31.0       Denver
1      202  79000          NaN      Boulder
2      203  61000          NaN       Denver

Dtypes: {'user_id': dtype('int64'), 'income': dtype('O'), 'profile.age': dtype('float64'), 'profile.city': dtype('O')}

Try:
- Replace -1 with a real age and re-check missingness
- Make income consistently numeric
- Rename columns to something consistent (snake_case)


#### Example with cleaned data

In [3]:
import pandas as pd
from io import StringIO
import json

print("=== CSV export (note blanks, NA, weird spacing) ===")
csv_text = """
user_id,age,income,city
101,34,72000,Denver
102,27,99000,Boulder
103,29,65000,Denver
104,44,5400,Colorado Springs
"""
df_csv = pd.read_csv(StringIO(csv_text),
                     skipinitialspace=True)
print(df_csv)
print("\nDtypes:", df_csv.dtypes.to_dict())
print("Missing by column:", df_csv.isna().sum().to_dict())

print("\n=== JSON export (nested fields) ===")
json_text = json.dumps([
      {"user_id": 201, "profile": {"age": 31, "city": "Denver"}, "income": 80000},
      {"user_id": 202, "profile": {"age": 27, "city": "Boulder"}, "income": 79000},
      {"user_id": 203, "profile": {"age": 44, "city": "Denver"}, "income": 61000}
])

records = json.loads(json_text)
df_json = pd.json_normalize(records)
print(df_json)
print("\nDtypes:", df_json.dtypes.to_dict())

=== CSV export (note blanks, NA, weird spacing) ===
   user_id  age  income              city
0      101   34   72000            Denver
1      102   27   99000           Boulder
2      103   29   65000            Denver
3      104   44    5400  Colorado Springs

Dtypes: {'user_id': dtype('int64'), 'age': dtype('int64'), 'income': dtype('int64'), 'city': dtype('O')}
Missing by column: {'user_id': 0, 'age': 0, 'income': 0, 'city': 0}

=== JSON export (nested fields) ===
   user_id  income  profile.age profile.city
0      201   80000           31       Denver
1      202   79000           27      Boulder
2      203   61000           44       Denver

Dtypes: {'user_id': dtype('int64'), 'income': dtype('int64'), 'profile.age': dtype('int64'), 'profile.city': dtype('O')}


## A.2 SQL databases

Databases reflect how applications think about the world.  

**Why databases exist**  

Databases are optimized for transactions, consistency, and multi-user access, not analysis. Their schemas encode business logic: users, orders, events, states.  

<table style="text-align:left;">
    <tr>
        <th>Type</th>
        <th>What a row represents</th>
        <th>Typical pitfall</th>
    </tr>
    <tr>
        <th>Events</th>
        <th>Something that happened at a time</th>
        <th>Accidentally double counting or missing time windows</th>
    </tr>
    <tr>
        <th>State</th>
        <th>Current snapshot of something</th>
        <th>Assuming it contains historical truth</th>
    </tr>
</table>

### Microlab: Join logic and granularity traps

This microlab creates two tables (users and orders) and shows how a join can silently change your "row meaning"

In [2]:
import sqlite3

# Create an in-memory SQLite database
conn = sqlite3.connect(":memory:")
cur = conn.cursor()

# Create tables
cur.execute("""
CREATE TABLE users (
    user_id INTEGER PRIMARY KEY,
    plan TEXT
)
""")

cur.execute("""
CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    user_id INTEGER,
    amount REAL
)
""")

# Insert data
cur.executemany(
    "INSERT INTO users (user_id, plan) VALUES (?, ?)",
    [(1, "free"), (2, "paid"), (3, "paid")]
)

cur.executemany(
    "INSERT INTO orders (order_id, user_id, amount) VALUES (?, ?, ?)",
    [(101, 2, 20.0), (102, 2, 35.0), (103, 3, 15.0), (104, 3, 60.0)]
)

conn.commit()

print("=== users (state table) ===")
for row in cur.execute("SELECT * FROM users"):
    print(row)

print("\n=== orders (event table) ====")
for row in cur.execute("SELECT * FROM orders"):
    print(row)

print("\n=== joined view ===")
for row in cur.execute("""
SELECT u.user_id, u.plan, o.order_id, o.amount
FROM users u
LEFT JOIN orders o
ON u.user_id = o.user_id
"""):
    print(row)

print("\nQuestion: How many paid users are there?")

# Correct answer
cur.execute("SELECT COUNT(*) FROM users WHERE plan = 'paid'")
print("Correct (from users table):", cur.fetchone()[0])

# WRONG answer: counting rows after join
cur.execute("""
SELECT COUNT(*)
FROM users u
LEFT JOIN orders o
ON u.user_id = o.user_id
WHERE u.plan = 'paid'
""")
print("WRONG (after join, counting rows):", cur.fetchone()[0])

print("\nFix: define unite of analysis explicitly.")

#Correct fix using DISTINCT
cur.execute("""
SELECT COUNT(DISTINCT u.user_id)
FROM users u
LEFT JOIN orders o
ON u.user_id = o.user_id
WHERE u.plan = 'paid'
""")
print("Paid users (DISTINCT user_id):", cur.fetchone()[0])

print("\nTry:")
print("- Insert another order for user 2")
print("- Compute total revenue by plan using GROUP BY")
print("- Ask: what does ONE ROW represent after each query?")

=== users (state table) ===
(1, 'free')
(2, 'paid')
(3, 'paid')

=== orders (event table) ====
(101, 2, 20.0)
(102, 2, 35.0)
(103, 3, 15.0)
(104, 3, 60.0)

=== joined view ===
(1, 'free', None, None)
(2, 'paid', 101, 20.0)
(2, 'paid', 102, 35.0)
(3, 'paid', 103, 15.0)
(3, 'paid', 104, 60.0)

Question: How many paid users are there?
Correct (from users table): 2
WRONG (after join, counting rows): 4

Fix: define unite of analysis explicitly.
Paid users (DISTINCT user_id): 2

Try:
- Insert another order for user 2
- Compute total revenue by plan using GROUP BY
- Ask: what does ONE ROW represent after each query?


## A.3 APIs

APIs give you data through someone else's interface and rules  

**What APIs provide**  

* Programmatic access to live or regularly updated data.
* Structured responses (often JSON).
* Authentication, quotas, and versioning

**Common constraints**
* Rate limits: you cannot pull everything at once
* Partial views: pagination, filters, or redacted fields
* Instability: fields can change or disappear.

### Microlab: Pagination + schema drift

In [5]:
import pandas as pd

# Simulated API responses (page 1 vs page 2)
page1 = [
    {"id": 1, "name": "Ava", "score": 0.91},
    {"id": 2, "name": "Ben", "score": 0.74},
]

page2 = [
    {"id": 3, "name": "Cam", "score": "0.88", "country": "US"},
    {"id": 4, "name": "Dee", "score": 0.67, "country": None},
]

df = pd.json_normalize(page1 + page2)

print(df)
print("\nDtypes:", df.dtypes.to_dict())

print("\nCommon API tasks:")
print("1) Combine pages (done)")
print("2) Normalize fields (types + names)")
print("3) Decide how to handle missing optional fields")

# Example: force score numeric
df["score"] = pd.to_numeric(df["score"], errors="coerce")
print("\nAfter forcing score numeric:")
print(df)
print("\nDtypes:", df.dtypes.to_dict())

print("\nTry:")
print("- Add a page3 missing 'name'")
print("- Rename 'id' to 'user_id'")
print("- Drop rows with missing critical fields vs keep and flag")

   id name score country
0   1  Ava  0.91     NaN
1   2  Ben  0.74     NaN
2   3  Cam  0.88      US
3   4  Dee  0.67    None

Dtypes: {'id': dtype('int64'), 'name': dtype('O'), 'score': dtype('O'), 'country': dtype('O')}

Common API tasks:
1) Combine pages (done)
2) Normalize fields (types + names)
3) Decide how to handle missing optional fields

After forcing score numeric:
   id name  score country
0   1  Ava   0.91     NaN
1   2  Ben   0.74     NaN
2   3  Cam   0.88      US
3   4  Dee   0.67    None

Dtypes: {'id': dtype('int64'), 'name': dtype('O'), 'score': dtype('float64'), 'country': dtype('O')}

Try:
- Add a page3 missing 'name'
- Rename 'id' to 'user_id'
- Drop rows with missing critical fields vs keep and flag


## A.4 Web Scraping

Scraping is extracting structure from pages built for humans.  

**Why scraping exists**  

Sometimes the data is visible but not downloadable. Scraping turns HTML pages into rows and columns.  

**Why scraping is fragile**  
* HTML structure changes without notice
* Content may be dynamically loaded
* Legal and ethical constraints apply


### Microlab: parse a tiny HTML snippet into a table

This is a minimal example:

In [11]:
import pandas as pd
from bs4 import BeautifulSoup

html = """
<table>
  <tr><th>Name</th><th>Price</th></tr>
  <tr><td>Widget A</td><td>$10</td></tr>
  <tr><td>Widget B</td><td>$12</td></tr>
</table>
"""

soup = BeautifulSoup(html, "html.parser")
rows = []
for tr in soup.find_all("tr")[1:]:
    tds = tr.find_all("td")
    rows.append({
        "name": tds[0].get_text(strip=True),
        "price": tds[1].get_text(strip=True),
    })

df = pd.DataFrame(rows)
print(df)

print("\nTry:")
print("- Change <td>$10</td> to <td>10 usd</td> and clean price")
print("- Add a new column in the HTML and update your parser")
print("- Remove the header row and see how the assumptions break")

       name price
0  Widget A   $10
1  Widget B   $12

Try:
- Change <td>$10</td> to <td>10 usd</td> and clean price
- Add a new column in the HTML and update your parser
- Remove the header row and see how the assumptions break


## A.5 Practice

**Goals**
* **Files:** clean a messy CSV, normalize a nested JSON, handle Excel-style human edits
* **SQL:** use SQLite to see how joins change row meaning and how to aggregate safely
* **APIs:** combine paginated responses and handle schema drift (simulated)
* **Scraping:** parse a tiny HTML table and clean extracted fields.
* **MiniProject:** merge sources into one analysis-ready table + write a data quality report

# B. Data Quality and Structure

"Messy data" is not random. It has structure and causes. Quality problems usually come from how systems record events, how humans edit files, and how pipelines evolve over time.  

This section focuses on auditing what the data means before it is trusted.

## B.1 Missing Data: types and causes

**Three patterns of missingness**

<table style="text-align:left;">
    <tr>
        <th>Pattern</th>
        <th>What is means (plain language)</th>
        <th>Example</th>
        <th>Why it matters</th>
    </tr>
    <tr>
        <th>MCAR (Missing Completely at Random)</th>
        <th>Values are missing for reasons unrelated to the data itself (pure noise or chance)</th>
        <th>A sensor randomly drops readings due to network glitches</th>
        <th>Dropping or simple imputation usually does not bias results (but reduces sample size)</th>
    </tr>
    <tr>
        <th>MAR (Missing At Random, conditional on observed data)</th>
        <th>Missingness depends on other columns you can see, but not on the missing value itself once you account for them</th>
        <th>Income is missing more often for younger users, but within each age group it is random.</th>
        <th>Imputation using observed features can be reasonable, but assumptions must be documented.</th>
    </tr>
    <tr>
        <th>MNAR (Missing Not At Random)</th>
        <th>Missingness depends on the value that is missing, even after accounting for obserced data</th>
        <th>People with very low income skip the income question because it is low</th>
        <th>Naive dropping or imputation can bias conclusions; this is the hardest case to fix.</th>
    </tr>
</table>

**Missing is often encoded**
* **Sentinels:** -1, 999, "unknown", "N/A"
* **Empty strings:** ""
* **Whitespace:** " "
* **Type coercion:** numbers stored as strings, turning failures into nulls

### Microlab: detect missingness + missingness as a feature

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
  "user_id": [101, 102, 103, 104, 105, 106],
  "age": [34, "", 29, -1, 41, "  "],
  "income": ["72000", "NA", "65000", "54000", None, "999"],
  "city": ["Denver", "Boulder", "", "Colorado Springs", "Denver", "unknown"],
  "plan": ["paid", "free", "paid", "free", "paid", "free"]
})

print("Raw data:")
print(df)

# Step 1: standardize common missing encodings
missing_tokens = ["", "  ", "NA", "N/A", "unknown", "999"]
df_clean = df.replace(missing_tokens, np.nan)

# Step 2: coerce numeric columns
df_clean["age"] = pd.to_numeric(df_clean["age"], errors="coerce")
df_clean.loc[df_clean["age"] < 0, "age"] = np.nan
df_clean["income"] = pd.to_numeric(df_clean["income"], errors="coerce")

print("\nAfter normalizing missing encodings:")
print(df_clean)
print("\nMissing counts:", df_clean.isna().sum().to_dict())

# Missingness can be informative:
df_clean["income_missing"] = df_clean["income"].isna().astype(int)

print("\nIncome missing rate by plan:")
print(df_clean.groupby("plan")["income_missing"].mean())

print("\nTry:")
print("- Add more rows where only 'free' users hide income")
print("- Ask: does dropping missing-income rows change the plan mix?")

Raw data:
   user_id age income              city  plan
0      101  34  72000            Denver  paid
1      102         NA           Boulder  free
2      103  29  65000                    paid
3      104  -1  54000  Colorado Springs  free
4      105  41   None            Denver  paid
5      106        999           unknown  free

After normalizing missing encodings:
   user_id   age   income              city  plan
0      101  34.0  72000.0            Denver  paid
1      102   NaN      NaN           Boulder  free
2      103  29.0  65000.0               NaN  paid
3      104   NaN  54000.0  Colorado Springs  free
4      105  41.0      NaN            Denver  paid
5      106   NaN      NaN               NaN  free

Missing counts: {'user_id': 0, 'age': 3, 'income': 3, 'city': 2, 'plan': 0}

Income missing rate by plan:
plan
free    0.666667
paid    0.333333
Name: income_missing, dtype: float64

Try:
- Add more rows where only 'free' users hide income
- Ask: does dropping missing-income row

  df_clean = df.replace(missing_tokens, np.nan)


## B2. Duplicates and inconsistencies

Duplication is usually a modeling bug waiting to happen: double-counting, label leakage, or identity confusion

<table style="text-align:left;">
    <tr>
        <th>Type</th>
        <th>What it looks like</th>
        <th>Common cause</th>
        <th>What to do</th>
    </tr>
    <tr>
        <th>Exact duplicates</th>
        <th>Same value across all columns</th>
        <th>Export glitch, retry logic, copy/paste</th>
        <th>Usually safe to drop</th>
    </tr>
    <tr>
        <th>Entity duplicates</th>
        <th>Same real-world entity appears multiple times</th>
        <th>Multiple IDs, casing\typos, merges</th>
        <th>Requires a dedupe rule</th>
    </tr>
</table>

**Inconsistency is often "almost the same"**  

* **Strings:** casing, whitespace, punctuation ("NY" vs "New York")
* **Categories:** synonyms ("M", "male", "man")
* **Units:** kg vs lb, dollars vs cents
* **Timezones:** mixing UTC and local time

### Microlab: Exact duplicates vs entity duplicates 

Remove exact duplicates and then face the harder case: multiple records that probably refer to the same customer. Practice dedupe rule.

In [2]:
import pandas as pd

df = pd.DataFrame([
  {"customer_id": "C001", "email": "ava@example.com", "name": "Ava Li", "updated_at": "2025-01-01", "city": "Denver"},
  {"customer_id": "C001", "email": "ava@example.com", "name": "Ava Li", "updated_at": "2025-01-01", "city": "Denver"},  # exact dup
  {"customer_id": "C009", "email": "ava@example.com", "name": "Ava L.", "updated_at": "2025-02-10", "city": "Denver"},   # entity dup (same email)
  {"customer_id": "C010", "email": "ben@example.com", "name": "Ben", "updated_at": "2025-02-09", "city": " Boulder "},    # whitespace
  {"customer_id": "C011", "email": "ben@example.com", "name": "Benjamin", "updated_at": "2025-02-12", "city": "Boulder"}, # entity dup
])

df["updated_at"] = pd.to_datetime(df["updated_at"])
print("Raw:")
print(df)

print("\n1) Drop exact duplicates:")
df1 = df.drop_duplicates()
print(df1)

print("\n2) Normalize strings (city):")
df1["city"] = df1["city"].str.strip()
print(df1[["customer_id","email","city"]])

print("\n3) Entity dedupe rule: keep the most recent row per email")
df2 = df1.sort_values("updated_at").drop_duplicates(subset=["email"], keep="last")
print(df2)

print("\nTry:")
print("- Change the rule: keep row with longest name, or fewest missing fields")
print("- Ask: what is your unit of analysis — customer, email, or account?")

Raw:
  customer_id            email      name updated_at       city
0        C001  ava@example.com    Ava Li 2025-01-01     Denver
1        C001  ava@example.com    Ava Li 2025-01-01     Denver
2        C009  ava@example.com    Ava L. 2025-02-10     Denver
3        C010  ben@example.com       Ben 2025-02-09   Boulder 
4        C011  ben@example.com  Benjamin 2025-02-12    Boulder

1) Drop exact duplicates:
  customer_id            email      name updated_at       city
0        C001  ava@example.com    Ava Li 2025-01-01     Denver
2        C009  ava@example.com    Ava L. 2025-02-10     Denver
3        C010  ben@example.com       Ben 2025-02-09   Boulder 
4        C011  ben@example.com  Benjamin 2025-02-12    Boulder

2) Normalize strings (city):
  customer_id            email     city
0        C001  ava@example.com   Denver
2        C009  ava@example.com   Denver
3        C010  ben@example.com  Boulder
4        C011  ben@example.com  Boulder

3) Entity dedupe rule: keep the most recent 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1["city"] = df1["city"].str.strip()


## B.3 Schema drift

Over time, the meaning of columns change, sometimes quietly.  

**What "drift" looks like in raw data**.
* New columns appear (a new feature launches)
* Columns disappear (a field is deprecated)
* Type changes (integer becomes string; cents becomes dollars)
* Semantic changes (same name, different meaning)

**Why this matters**
Drift can break pipelines, but worse: It can create silent correctness failures. Your model may keep training, on a different problem than intended.


### Microlab: Compare "yesterday vs today" schemas

Simulate two daily exports, detect differences, and choose a conservative strategy:
* Align columns
* Coerce types
* Log drift for review

In [3]:
import pandas as pd

day1 = pd.DataFrame([
  {"id": 1, "score": 0.91, "country": "US"},
  {"id": 2, "score": 0.74, "country": "US"},
])

day2 = pd.DataFrame([
  {"id": 3, "score": "0.88", "country": "CA", "tier": "pro"},  # score becomes string; new column tier
  {"id": 4, "score": "0.67", "country": None, "tier": "free"},
])

print("=== Day 1 ===")
print(day1)
print("Dtypes:", day1.dtypes.to_dict())

print("\n=== Day 2 ===")
print(day2)
print("Dtypes:", day2.dtypes.to_dict())

# Detect drift
cols1, cols2 = set(day1.columns), set(day2.columns)
print("\nSchema drift:")
print("Added columns:", sorted(list(cols2 - cols1)))
print("Removed columns:", sorted(list(cols1 - cols2)))

# Align columns conservatively
all_cols = sorted(list(cols1 | cols2))
aligned = pd.concat([day1.reindex(columns=all_cols), day2.reindex(columns=all_cols)], ignore_index=True)

# Coerce score numeric as a normalization step
aligned["score"] = pd.to_numeric(aligned["score"], errors="coerce")

print("\n=== Combined (aligned + normalized) ===")
print(aligned)
print("Dtypes:", aligned.dtypes.to_dict())

print("\nTry:")
print("- Rename 'score' to 'risk_score' on day2 and detect semantic drift")
print("- Decide: should missing 'tier' on day1 be 'unknown' or null?")

=== Day 1 ===
   id  score country
0   1   0.91      US
1   2   0.74      US
Dtypes: {'id': dtype('int64'), 'score': dtype('float64'), 'country': dtype('O')}

=== Day 2 ===
   id score country  tier
0   3  0.88      CA   pro
1   4  0.67    None  free
Dtypes: {'id': dtype('int64'), 'score': dtype('O'), 'country': dtype('O'), 'tier': dtype('O')}

Schema drift:
Added columns: ['tier']
Removed columns: []

=== Combined (aligned + normalized) ===
  country  id  score  tier
0      US   1   0.91   NaN
1      US   2   0.74   NaN
2      CA   3   0.88   pro
3    None   4   0.67  free
Dtypes: {'country': dtype('O'), 'id': dtype('int64'), 'score': dtype('float64'), 'tier': dtype('O')}

Try:
- Rename 'score' to 'risk_score' on day2 and detect semantic drift
- Decide: should missing 'tier' on day1 be 'unknown' or null?


## B.4 A practical quality checklist

A small set of checks catches most downstream bugs.  

<table style="text-align:left;">
    <thead>
        <tr>
            <th>Check</th>
            <th>Question</th>
            <th>Quick Method</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Row meaning</td>
            <td>What does one row represent?</td>
            <td>Write a one-line row definition</td>
        </tr>
        <tr>
            <td>Missingness</td>
            <td>What is missing, and for whom?</td>
            <td>Missing rate by column and by group</td>
        </tr>
        <tr>
            <td>Uniqueness</td>
            <td>What should be unique?</td>
            <td>Check duplicates on key fields</td>
        </tr>
        <tr>
            <td>Ranges</td>
            <td>Are numeric values plausible?</td>
            <td>Min/max, quantiles, unit sanity check</td>
        </tr>
        <tr>
            <td>Categories</td>
            <td>Do labels look consistent?</td>
            <td>Top values + "other" bucket</td>
        </tr>
        <tr>
            <td>Time</td>
            <td>Are timestamps consistent?</td>
            <td>Timezone check; gaps; ordering</td>
        </tr>
        <tr>
            <td>Schema stability</td>
            <td>Did fields/types change over time?</td>
            <td>Compare scchema snapshots (diff)</td>
        </tr>
    </tbody>
</table>

## B.5 - Lab in other notebook

# C - Data Wrangling and Transformation

Cleaning is only step one. To build models, dashboards, or decisions, you typically need to **reshape**, **combine**, and **encode** raw data into analysis-ready form.  

This section will cover the pandas patterns that show up everywhere:  
* selecting
* filtering
* creating columns
* grouping / aggregating
* joining tables
* extracting structure

The goal is learning how to preserve meaning while tranforming data

## C.1 - Pandas Fundamentals

**Three fundamental questions**
* What does one row represent right now?
* What columns are inputs vs derived outputs?
* What assumptions am I encoding when I transform?

Rule of Thumb:  
Transformations should be reversible in your head

### Microlab - Pandas Basics (columns, filters, summaries)  

Create a few clean derived columns without changing row meaning

In [3]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
  "unique_key": [1, 2, 3, 4],
  "created_date": pd.to_datetime(["2025-02-01 10:00", "2025-02-01 11:30", "2025-02-02 09:15", "2025-02-02 12:00"]),
  "closed_date":  pd.to_datetime(["2025-02-01 14:00", None, "2025-02-02 10:00", "2025-02-03 12:00"]),
  "borough": ["MANHATTAN", "BROOKLYN", "MANHATTAN", "QUEENS"],
  "status": ["Closed", "Open", "Closed", "Closed"],
})

print("Raw:")
print(df)


# Derived columns
df["is_closed"] = (df["status"] == "Closed").astype(int)

# Duration in hours (only when closed)
df["resolution_hours"] = (df["closed_date"] - df["created_date"]).dt.total_seconds() / 3600
df.loc[df["closed_date"].isna(), "resolution_hours"] = np.nan

print("\nWith derived column:")
print(df)

print("\nSummary:")
print("Closed rate:", df["is_closed"].mean())
print("Median resolution (hrs):", df["resolution_hours"].median())

print("\nTry:")
print("- Filter to MANHATTEN and compute the same stats")
print("- Decide: should negative or huge durations be flagged?")


Raw:
   unique_key        created_date         closed_date    borough  status
0           1 2025-02-01 10:00:00 2025-02-01 14:00:00  MANHATTAN  Closed
1           2 2025-02-01 11:30:00                 NaT   BROOKLYN    Open
2           3 2025-02-02 09:15:00 2025-02-02 10:00:00  MANHATTAN  Closed
3           4 2025-02-02 12:00:00 2025-02-03 12:00:00     QUEENS  Closed

With derived column:
   unique_key        created_date         closed_date    borough  status  \
0           1 2025-02-01 10:00:00 2025-02-01 14:00:00  MANHATTAN  Closed   
1           2 2025-02-01 11:30:00                 NaT   BROOKLYN    Open   
2           3 2025-02-02 09:15:00 2025-02-02 10:00:00  MANHATTAN  Closed   
3           4 2025-02-02 12:00:00 2025-02-03 12:00:00     QUEENS  Closed   

   is_closed  resolution_hours  
0          1              4.00  
1          0               NaN  
2          1              0.75  
3          1             24.00  

Summary:
Closed rate: 0.75
Median resolution (hrs): 4.0

Try:

## C.2 - groupby, joins/merges

Aggregation changes what a row means. Joins can multiply rows.

**groupby: choosing the unit of analysis**  

When you group and aggregate, you are redefining the dataset. A dataset of requests can become a dataset of boroughs, agencies, or days.  

**Joins: the most ocmmon silent bug**  

<table style="text-align:left;">
    <thead>
        <tr>
            <th>Operation</th>
            <th>Risk</th>
            <th>Common Symptom</th>
            <th>Guardrail</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>groupby + agg</td>
            <td>Unit-of-analysis shift</td>
            <td>Metrics no longer comparable</td>
            <td>Write the new row definition</td>
        </tr>
        <tr>
            <td>merge/join</td>
            <td>Row multiplication</td>
            <td>Counts and totals inflate</td>
            <td>Check keys + row counts before/after</td>
        </tr>
    </tbody>
</table>

### Microlab - aggregate, then join back safely  

Complete an agency-level metric (like average resolution time) and then merge it back into the request-level table. This is a standard pattern for feature engineering

In [4]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
  "unique_key": [1, 2, 3, 4, 5],
  "agency": ["NYPD", "NYPD", "DOT", "DOT", "DOT"],
  "resolution_hours": [4.0, np.nan, 1.0, 10.0, 2.0],
  "status": ["Closed", "Open", "Closed", "Closed", "Closed"],
})

print("Request-level rows:", len(df))
print(df)

# Aggregate to agency-legel (new unit of analysis)
agency_stats = (
    df.groupby("agency", dropna=False)
      .agg(
          n_requests=("unique_key", "count"),
          closed_rate=("status", lambda s: (s == "Closed").mean()),
          median_resolution=("resolution_hours", "median"),
      )
      .reset_index()
)

print("\nAgency-level table:")
print(agency_stats)

# Merge back: add columns, should not change row count
df2 = df.merge(agency_stats, on="agency", how="left", validate="many_to_one")

print("\nAfter merge back (rows should match original):", len(df2))
print(df2)

print("\nTry:")
print("- Change validate to see what happens if the keys are not unique")
print("- Add another table that has multiple rows per agency and observe row multiplication")

Request-level rows: 5
   unique_key agency  resolution_hours  status
0           1   NYPD               4.0  Closed
1           2   NYPD               NaN    Open
2           3    DOT               1.0  Closed
3           4    DOT              10.0  Closed
4           5    DOT               2.0  Closed

Agency-level table:
  agency  n_requests  closed_rate  median_resolution
0    DOT           3          1.0                2.0
1   NYPD           2          0.5                4.0

After merge back (rows should match original): 5
   unique_key agency  resolution_hours  status  n_requests  closed_rate  \
0           1   NYPD               4.0  Closed           2          0.5   
1           2   NYPD               NaN    Open           2          0.5   
2           3    DOT               1.0  Closed           3          1.0   
3           4    DOT              10.0  Closed           3          1.0   
4           5    DOT               2.0  Closed           3          1.0   

   median_resol

## C.3 - String handling and regex

Real datasets hide structure inside messy text. Extract it carefully.  

**What text cleaning is (and is not)**  
* **Cleaning:** removing noise (whitespace, casing), normalizing formats.
* **Extraction:** pulling a structured field out of text (zip code, street number, category).
* **Not magic:** extraction always has errors, you must measure and handle them  

### Microlab - extract a number and normalize a label  

Pull a street number from an address-like string, and normalize a messy category label. Then measure how many rows failed exctraction.

In [5]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
  "incident_address": ["123 Main St", " 55-01 31st Ave ", None, "Broadway & W 50th", "12B Elm Street"],
  "complaint_type": ["Noise - Street/Sidewalk", "noise - street/sidewalk", "NOISE - Street/Sidewalk", "Rodent", "Rodents"],
})

# Normalize labels (a light touch)
df["complaint_norm"] = (
  df["complaint_type"]
    .astype("string")
    .str.strip()
    .str.lower()
)

# Regex: try to extract leading street number (very imperfect)
df["street_number"] = (
  df["incident_address"]
    .astype("string")
    .str.extract(r"^\s*(\d+)", expand=False)
)

df["street_number"] = pd.to_numeric(df["street_number"], errors="coerce")

print(df)

fail_rate = df["street_number"].isna().mean()
print("\nStreet number extraction fail rate:", round(fail_rate, 3))

print("\nTry:")
print("- Improve the regex to capture patterns like '55-01' or '12B'")
print("- Decide whether you want a strict or lenient extractor")
print("- Measure: how many rows become null after extraction?")

    incident_address           complaint_type           complaint_norm  \
0        123 Main St  Noise - Street/Sidewalk  noise - street/sidewalk   
1    55-01 31st Ave   noise - street/sidewalk  noise - street/sidewalk   
2               None  NOISE - Street/Sidewalk  noise - street/sidewalk   
3  Broadway & W 50th                   Rodent                   rodent   
4     12B Elm Street                  Rodents                  rodents   

   street_number  
0            123  
1             55  
2           <NA>  
3           <NA>  
4             12  

Street number extraction fail rate: 0.4

Try:
- Improve the regex to capture patterns like '55-01' or '12B'
- Decide whether you want a strict or lenient extractor
- Measure: how many rows become null after extraction?


## C.4 - Feature construction from raw fields  

Features are not "more columns". They are structured signals that match your decision.  

**Good features are predictable and auditable**  
* **Stable:** defined the same way over time.
* **Non-leaky:** available at prediction time
* **Interpretable:** you can explain what it means and why it helps
* **Measurable quality:** you can quantify missingness and noise

**Common "first-pass" features for event-like data**  

<table style="text-align:left;">
    <thead>
        <tr>
            <th>Raw field</th>
            <th>Feature idea</th>
            <th>Why it helps</th>
            <th>Risk</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><code>created_date</code></td>
            <td>hour/day-of=week, <br>is_weekend</td>
            <td>captures seasonality and staffing effects</td>
            <td>timezones issues</td>
        </tr>
        <tr>
            <td><code>complaint_type</code></td>
            <td>top-k categories + "Other"</td>
            <td>reduces high-cardinality noise</td>
            <td>rare categories get hidden</td>
        </tr>
        <tr>
            <td><code>borough</code></td>
            <td>one-hot or grouping</td>
            <td>location differences</td>
            <td>encodes demographic proxies</td>
        </tr>
        <tr>
            <td><code>text fields</code></td>
            <td>keyword flags / extracted tokens</td>
            <td>adds signal from descriptions</td>
            <td>high error rate; drift</td>
        </tr>
    </tbody>
</table>

### Microlab - Create a few "safe" features  

Create time-based features and simple top-k category encoding, the kind of features that are often good enough for a baseline model.  

In [6]:
import pandas as pd

df = pd.DataFrame({
  "unique_key": [1, 2, 3, 4, 5, 6],
  "created_date": pd.to_datetime([
    "2025-02-01 10:00", "2025-02-01 23:30", "2025-02-02 09:15",
    "2025-02-02 12:00", "2025-02-03 08:05", "2025-02-03 17:45"
  ]),
  "complaint_type": ["Noise", "Noise", "Rodent", "Water Leak", "Noise", "Other Weird Thing"],
})

df["hour"] = df["created_date"].dt.hour
df["dayofweek"] = df["created_date"].dt.dayofweek  # 0=Mon
df["is_weekend"] = df["dayofweek"].isin([5,6]).astype(int)

# Top-k category encoding
k = 2
topk = df["complaint_type"].value_counts().head(k).index
df["complaint_topk"] = df["complaint_type"].where(df["complaint_type"].isin(topk), other="Other")

print(df)

print("\nTry:")
print("- Change k to 3 or 4")
print("- Add a rare category and see how it collapses into 'Other'")
print("- Decide: is 'Other' acceptable for your decision?")

   unique_key        created_date     complaint_type  hour  dayofweek  \
0           1 2025-02-01 10:00:00              Noise    10          5   
1           2 2025-02-01 23:30:00              Noise    23          5   
2           3 2025-02-02 09:15:00             Rodent     9          6   
3           4 2025-02-02 12:00:00         Water Leak    12          6   
4           5 2025-02-03 08:05:00              Noise     8          0   
5           6 2025-02-03 17:45:00  Other Weird Thing    17          0   

   is_weekend complaint_topk  
0           1          Noise  
1           1          Noise  
2           1         Rodent  
3           1          Other  
4           0          Noise  
5           0          Other  

Try:
- Change k to 3 or 4
- Add a rare category and see how it collapses into 'Other'
- Decide: is 'Other' acceptable for your decision?


## C.5 - Lab in other notebook