# A: Where Data Comes From

This section will cover the main places data comes from, what assumptions each source bakes in, and how to do a quick *source audit* before cleaning.

## A.1 Files: CSV, JSON, Excel

Flat files are common, portable, and simple

**What these formats are good at:**
<table style="text-align=left";>
    <tr>
        <th>Format</th>
        <th>Best suited for</th>
        <th>Common misuse</th>
    </tr>
    <tr>
        <th>CSV</th>
        <th>Simple rectangular tables; system exports; quick sharing</th>
        <th>Encoding complex structure or multiple tables in one file</th>
    </tr>
    <tr>
        <th>Excel</th>
        <th>Human-facing analysis, manual edits, reporting</th>
        <th>Using as a source of truth or automated data pipeline input</th>
    </tr>
    <tr>
        <th>JSON</th>
        <th>Nested or semi-structured records; API responses</th>
        <th>Assuming fields are stable or consistently present</th>
    </tr>
</table>

### Hidden assumptions
* Someone chose column names, encodings, delimiters, and header rows
* Missing values may be implicit("", NA, -1).
* Types are inferred later, not enforced at creation.

### Microlab: File sanity check

Simulate loading messy exports and practice the first questions that should be asked:
* What are the columns?
* What looks like missing data?
* What types are being inferred?

In [1]:
import pandas as pd
from io import StringIO
import json

print("=== CSV export (note blanks, NA, weird spacing) ===")
csv_text = """
user_id,age,income,city
101,34,72000,Denver
102,,NA, Boulder
103,29,65000,Denver
104, -1,5400, "Colorado Springs"
"""
df_csv = pd.read_csv(StringIO(csv_text),
                     skipinitialspace=True)
print(df_csv)
print("\nDtypes:", df_csv.dtypes.to_dict())
print("Missing by column:", df_csv.isna().sum().to_dict())

print("\n=== JSON export (nested fields) ===")
json_text = json.dumps([
      {"user_id": 201, "profile": {"age": 31, "city": "Denver"}, "income": 80000},
      {"user_id": 202, "profile": {"age": None, "city": "Boulder"}, "income": "79000"},
      {"user_id": 203, "profile": {"city": "Denver"}, "income": 61000}
])

records = json.loads(json_text)
df_json = pd.json_normalize(records)
print(df_json)
print("\nDtypes:", df_json.dtypes.to_dict())

print("\nTry:")
print("- Replace -1 with a real age and re-check missingness")
print("- Make income consistently numeric")
print("- Rename columns to something consistent (snake_case)")


=== CSV export (note blanks, NA, weird spacing) ===
   user_id   age   income              city
0      101  34.0  72000.0            Denver
1      102   NaN      NaN           Boulder
2      103  29.0  65000.0            Denver
3      104  -1.0   5400.0  Colorado Springs

Dtypes: {'user_id': dtype('int64'), 'age': dtype('float64'), 'income': dtype('float64'), 'city': dtype('O')}
Missing by column: {'user_id': 0, 'age': 1, 'income': 1, 'city': 0}

=== JSON export (nested fields) ===
   user_id income  profile.age profile.city
0      201  80000         31.0       Denver
1      202  79000          NaN      Boulder
2      203  61000          NaN       Denver

Dtypes: {'user_id': dtype('int64'), 'income': dtype('O'), 'profile.age': dtype('float64'), 'profile.city': dtype('O')}

Try:
- Replace -1 with a real age and re-check missingness
- Make income consistently numeric
- Rename columns to something consistent (snake_case)


#### Example with cleaned data

In [3]:
import pandas as pd
from io import StringIO
import json

print("=== CSV export (note blanks, NA, weird spacing) ===")
csv_text = """
user_id,age,income,city
101,34,72000,Denver
102,27,99000,Boulder
103,29,65000,Denver
104,44,5400,Colorado Springs
"""
df_csv = pd.read_csv(StringIO(csv_text),
                     skipinitialspace=True)
print(df_csv)
print("\nDtypes:", df_csv.dtypes.to_dict())
print("Missing by column:", df_csv.isna().sum().to_dict())

print("\n=== JSON export (nested fields) ===")
json_text = json.dumps([
      {"user_id": 201, "profile": {"age": 31, "city": "Denver"}, "income": 80000},
      {"user_id": 202, "profile": {"age": 27, "city": "Boulder"}, "income": 79000},
      {"user_id": 203, "profile": {"age": 44, "city": "Denver"}, "income": 61000}
])

records = json.loads(json_text)
df_json = pd.json_normalize(records)
print(df_json)
print("\nDtypes:", df_json.dtypes.to_dict())

=== CSV export (note blanks, NA, weird spacing) ===
   user_id  age  income              city
0      101   34   72000            Denver
1      102   27   99000           Boulder
2      103   29   65000            Denver
3      104   44    5400  Colorado Springs

Dtypes: {'user_id': dtype('int64'), 'age': dtype('int64'), 'income': dtype('int64'), 'city': dtype('O')}
Missing by column: {'user_id': 0, 'age': 0, 'income': 0, 'city': 0}

=== JSON export (nested fields) ===
   user_id  income  profile.age profile.city
0      201   80000           31       Denver
1      202   79000           27      Boulder
2      203   61000           44       Denver

Dtypes: {'user_id': dtype('int64'), 'income': dtype('int64'), 'profile.age': dtype('int64'), 'profile.city': dtype('O')}


## A.2 SQL databases

Databases reflect how applications think about the world.  

**Why databases exist**  

Databases are optimized for transactions, consistency, and multi-user access, not analysis. Their schemas encode business logic: users, orders, events, states.  

<table style="text-align:left;">
    <tr>
        <th>Type</th>
        <th>What a row represents</th>
        <th>Typical pitfall</th>
    </tr>
    <tr>
        <th>Events</th>
        <th>Something that happened at a time</th>
        <th>Accidentally double counting or missing time windows</th>
    </tr>
    <tr>
        <th>State</th>
        <th>Current snapshot of something</th>
        <th>Assuming it contains historical truth</th>
    </tr>
</table>

### Microlab: Join logic and granularity traps

This microlab creates two tables (users and orders) and shows how a join can silently change your "row meaning"

In [2]:
import sqlite3

# Create an in-memory SQLite database
conn = sqlite3.connect(":memory:")
cur = conn.cursor()

# Create tables
cur.execute("""
CREATE TABLE users (
    user_id INTEGER PRIMARY KEY,
    plan TEXT
)
""")

cur.execute("""
CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    user_id INTEGER,
    amount REAL
)
""")

# Insert data
cur.executemany(
    "INSERT INTO users (user_id, plan) VALUES (?, ?)",
    [(1, "free"), (2, "paid"), (3, "paid")]
)

cur.executemany(
    "INSERT INTO orders (order_id, user_id, amount) VALUES (?, ?, ?)",
    [(101, 2, 20.0), (102, 2, 35.0), (103, 3, 15.0), (104, 3, 60.0)]
)

conn.commit()

print("=== users (state table) ===")
for row in cur.execute("SELECT * FROM users"):
    print(row)

print("\n=== orders (event table) ====")
for row in cur.execute("SELECT * FROM orders"):
    print(row)

print("\n=== joined view ===")
for row in cur.execute("""
SELECT u.user_id, u.plan, o.order_id, o.amount
FROM users u
LEFT JOIN orders o
ON u.user_id = o.user_id
"""):
    print(row)

print("\nQuestion: How many paid users are there?")

# Correct answer
cur.execute("SELECT COUNT(*) FROM users WHERE plan = 'paid'")
print("Correct (from users table):", cur.fetchone()[0])

# WRONG answer: counting rows after join
cur.execute("""
SELECT COUNT(*)
FROM users u
LEFT JOIN orders o
ON u.user_id = o.user_id
WHERE u.plan = 'paid'
""")
print("WRONG (after join, counting rows):", cur.fetchone()[0])

print("\nFix: define unite of analysis explicitly.")

#Correct fix using DISTINCT
cur.execute("""
SELECT COUNT(DISTINCT u.user_id)
FROM users u
LEFT JOIN orders o
ON u.user_id = o.user_id
WHERE u.plan = 'paid'
""")
print("Paid users (DISTINCT user_id):", cur.fetchone()[0])

print("\nTry:")
print("- Insert another order for user 2")
print("- Compute total revenue by plan using GROUP BY")
print("- Ask: what does ONE ROW represent after each query?")

=== users (state table) ===
(1, 'free')
(2, 'paid')
(3, 'paid')

=== orders (event table) ====
(101, 2, 20.0)
(102, 2, 35.0)
(103, 3, 15.0)
(104, 3, 60.0)

=== joined view ===
(1, 'free', None, None)
(2, 'paid', 101, 20.0)
(2, 'paid', 102, 35.0)
(3, 'paid', 103, 15.0)
(3, 'paid', 104, 60.0)

Question: How many paid users are there?
Correct (from users table): 2
WRONG (after join, counting rows): 4

Fix: define unite of analysis explicitly.
Paid users (DISTINCT user_id): 2

Try:
- Insert another order for user 2
- Compute total revenue by plan using GROUP BY
- Ask: what does ONE ROW represent after each query?


## A.3 APIs

APIs give you data through someone else's interface and rules  

**What APIs provide**  

* Programmatic access to live or regularly updated data.
* Structured responses (often JSON).
* Authentication, quotas, and versioning

**Common constraints**
* Rate limits: you cannot pull everything at once
* Partial views: pagination, filters, or redacted fields
* Instability: fields can change or disappear.

### Microlab: Pagination + schema drift

In [5]:
import pandas as pd

# Simulated API responses (page 1 vs page 2)
page1 = [
    {"id": 1, "name": "Ava", "score": 0.91},
    {"id": 2, "name": "Ben", "score": 0.74},
]

page2 = [
    {"id": 3, "name": "Cam", "score": "0.88", "country": "US"},
    {"id": 4, "name": "Dee", "score": 0.67, "country": None},
]

df = pd.json_normalize(page1 + page2)

print(df)
print("\nDtypes:", df.dtypes.to_dict())

print("\nCommon API tasks:")
print("1) Combine pages (done)")
print("2) Normalize fields (types + names)")
print("3) Decide how to handle missing optional fields")

# Example: force score numeric
df["score"] = pd.to_numeric(df["score"], errors="coerce")
print("\nAfter forcing score numeric:")
print(df)
print("\nDtypes:", df.dtypes.to_dict())

print("\nTry:")
print("- Add a page3 missing 'name'")
print("- Rename 'id' to 'user_id'")
print("- Drop rows with missing critical fields vs keep and flag")

   id name score country
0   1  Ava  0.91     NaN
1   2  Ben  0.74     NaN
2   3  Cam  0.88      US
3   4  Dee  0.67    None

Dtypes: {'id': dtype('int64'), 'name': dtype('O'), 'score': dtype('O'), 'country': dtype('O')}

Common API tasks:
1) Combine pages (done)
2) Normalize fields (types + names)
3) Decide how to handle missing optional fields

After forcing score numeric:
   id name  score country
0   1  Ava   0.91     NaN
1   2  Ben   0.74     NaN
2   3  Cam   0.88      US
3   4  Dee   0.67    None

Dtypes: {'id': dtype('int64'), 'name': dtype('O'), 'score': dtype('float64'), 'country': dtype('O')}

Try:
- Add a page3 missing 'name'
- Rename 'id' to 'user_id'
- Drop rows with missing critical fields vs keep and flag


## A.4 Web Scraping

Scraping is extracting structure from pages built for humans.  

**Why scraping exists**  

Sometimes the data is visible but not downloadable. Scraping turns HTML pages into rows and columns.  

**Why scraping is fragile**  
* HTML structure changes without notice
* Content may be dynamically loaded
* Legal and ethical constraints apply


### Microlab: parse a tiny HTML snippet into a table

This is a minimal example:

In [11]:
import pandas as pd
from bs4 import BeautifulSoup

html = """
<table>
  <tr><th>Name</th><th>Price</th></tr>
  <tr><td>Widget A</td><td>$10</td></tr>
  <tr><td>Widget B</td><td>$12</td></tr>
</table>
"""

soup = BeautifulSoup(html, "html.parser")
rows = []
for tr in soup.find_all("tr")[1:]:
    tds = tr.find_all("td")
    rows.append({
        "name": tds[0].get_text(strip=True),
        "price": tds[1].get_text(strip=True),
    })

df = pd.DataFrame(rows)
print(df)

print("\nTry:")
print("- Change <td>$10</td> to <td>10 usd</td> and clean price")
print("- Add a new column in the HTML and update your parser")
print("- Remove the header row and see how the assumptions break")

       name price
0  Widget A   $10
1  Widget B   $12

Try:
- Change <td>$10</td> to <td>10 usd</td> and clean price
- Add a new column in the HTML and update your parser
- Remove the header row and see how the assumptions break


## A.5 Practice

**Goals**
* **Files:** clean a messy CSV, normalize a nested JSON, handle Excel-style human edits
* **SQL:** use SQLite to see how joins change row meaning and how to aggregate safely
* **APIs:** combine paginated responses and handle schema drift (simulated)
* **Scraping:** parse a tiny HTML table and clean extracted fields.
* **MiniProject:** merge sources into one analysis-ready table + write a data quality report