# Module 2.A - Where Data Comes From (Real-World Pipeline Starter)

### Core dataset for the whole module 

**NYC 311 Service Requests (2020 - present)** (NYC Open Data/Socrata)  
Why this dataset works for learning:
* **Real mess:** missing values, inconsistent strings, free-text fields, and "weird" categories.
* **Real scale:** the full dataset is huge, so must learn how to pull a *slice*
* **Multiple access modes:** the same data is available as **CSV, API (JSON), SQL**
* **Real change over time:** fields and value distributions can shift (schema drift)

Will also create two supporting assets that will be reused later:
* **Data dictionary Excel file:** (`.xlsx`) from the publisher (documentation)
* **Borough reference table:** either scraped from the web or created as a seed file

### What will be built in 2.A  

Will produce a local, module-scoped workspace. Organized by module and stored as described:  
```bash
~/work/m2/data/
    raw/          # downloaded files, API responses
    reference/    # lookup tables, dictionaries
    warehouse/    # SQLite databases
```

Later notebooks will assume these exist:
* **2.B** Data quality: missingness, duplicates, inconsistent categories, schema drift
* **2.C** Wrangling: groupby, joins, string cleaning, feature construction
* **2.D** Scaling: incremental refresh, "raw &rarr; staged &rarr; curated" thinking
* **2.E** Outliers/validation: response times, anomaly checks, "is this plausible?" rules

The goal in 2.A is not perfect cleaning. It is learning how to acquire data reliably and keep the process reproducible

## Setup (requests, BeautifulSoup, and a writable workspace)

Common libraries:
* **requests:** a simple way to make HTTP requests
* **BeautifulSoup:** parses HTML to extract pieces

In [4]:
from __future__ import annotations

import json
import sqlite3
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any, Dict, List

import pandas as pd
import requests
from bs4 import BeautifulSoup

pd.set_option("display.max_columns", 60)
pd.set_option("display.width", 140)

# Writable workspace (module-scoped)
WORK_DIR   = Path("work")
MODULE_DIR = WORK_DIR / "m2"
DATA_DIR   = MODULE_DIR / "data"
RAW_DIR    = DATA_DIR / "raw"
REF_DIR    = DATA_DIR / "reference"
WH_DIR     = DATA_DIR / "warehouse"

for d in [RAW_DIR, REF_DIR, WH_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("Writable Module 2 data workspace ready:")
print(" ", DATA_DIR.resolve())

Writable Module 2 data workspace ready:
  /home/glake/Nextcloud/Classwork/CS6678 - Advanced Machine Learning/Jupyter Notebooks/work/m2/data


## A.0 - Source Audit Template

Before cleaning, provide answers to:  
* What does one row represent
* What system produced it?
* What time range does it cover?
* What are known limitations?
* Which fields look risky (missing, free-text, inconsistent categories)?

We will keep a small structured dictionary of notes that can be reused later

In [5]:
source_audit = {
    "dataset_name": "NYC 311 Service Requests (2020-present)",
    "publisher": "NYC Open Data / 311",
    "where_it_comes_from": "City 311 request intake system (customer service requests routed to agencies).",
    "unit_of_analysis": "Each row represents one 311 service request.", 
    "time_grain": "Requests are created continuously; rows include timestamps for created/closed when available.",
    "known_limitations": [
        "Many fields are optional depending on request type (expect missingness).",
        "Free-text fields (descriptor/address) can be inconsistent and messy.",
        "The dataset is continuously updated; results can change between runs.",
    ],
    "Notes 1/30/26": []
}

source_audit

{'dataset_name': 'NYC 311 Service Requests (2020-present)',
 'publisher': 'NYC Open Data / 311',
 'where_it_comes_from': 'City 311 request intake system (customer service requests routed to agencies).',
 'unit_of_analysis': 'Each row represents one 311 service request.',
 'time_grain': 'Requests are created continuously; rows include timestamps for created/closed when available.',
 'known_limitations': ['Many fields are optional depending on request type (expect missingness).',
  'Free-text fields (descriptor/address) can be inconsistent and messy.',
  'The dataset is continuously updated; results can change between runs.'],
 'Notes 1/30/26': []}

## A.1 Files (CSV): Download a Reproducible Slice

Large public datasets are often too big to download in full for learning. A useful technique is to define a slice that is:  
* small enough to iterate quickly (seconds, not minutes)
* recent enough to include real mess
* refreshable

We will pull the **last 14 days** of NYC 311 requests as a CSV  

**Note on Socrata Timestamps**  

NYC Open Data uses Socrata. Many timestamp fields are "floating timestamps" and expect ISO8601 without timezone suffixes (No Z, No +00:00) in the query string. So we format timestamps like: `2026-01-04T04:03:21`

In [10]:
NYC311_BASE = "https://data.cityofnewyork.us/resource/erm2-nwe9"

# Stable subset of columns that will be reused across Module 2.
NYC311_COLUMNS = [
    "unique_key",
    "created_date",
    "closed_date",
    "agency",
    "agency_name",
    "complaint_type",
    "descriptor",
    "status",
    "borough",
    "incident_zip",
    "incident_address",
    "street_name",
    "city",
    "latitude",
    "longitude",
]

def iso_floating(dt: datetime) -> str:
    """
    Socrata floating timestamps expect ISO8601 without timezone suffix.
    We will drop tzinfo and milliseconds to be conservative
    """
    dt = dt.astimezone(timezone.utc).replace(tzinfo=None)
    return dt.strftime("%Y-%m-%dT%H:%M:%S")

def download_if_missing(url: str, path: Path, params: dict | None = None, timeout: int = 30) -> Path:
    """
    Download a URL to disk only if the file is not already cached.
    We chace downloads so later notebooks (2.B - 2.E) can reuse the same files
    without hammering the public API repeatedly.
    """
    if path.exists() and path.stat().st_size > 0:
        print("Used cached:", path)
        return path

    print("Downloading:", url)
    r = requests.get(url, params=params, timeout=timeout)

    if r.status_code >= 400:
        print("Status:", r.status_code)
        print("Body (first 300 chars):", r.text[:300])

    r.raise_for_status()
    path.write_bytes(r.content)
    print("Saved:", path, f"({path.stat().st_size/1e6:.2f} MB)")
    return path

def socrata_csv_params(days: int=14, limit: int=5000) -> dict:
    end = datetime.now(timezone.utc)
    start = end - timedelta(days=days)

    select = ",".join(NYC311_COLUMNS)
    where = (
        f"created_date >= '{iso_floating(start)}'"
        f"AND created_date < '{iso_floating(end)}'"
    )
    return {"$select": select, "$where": where, "$order": "created_date DESC", "$limit": limit}

CSV_PATH = RAW_DIR / "nyc311_last14d.csv"
params = socrata_csv_params(days=14, limit=50000)

print("Where clause:", params["$where"])
download_if_missing(f"{NYC311_BASE}.csv", CSV_PATH, params=params)

Where clause: created_date >= '2026-01-17T04:58:39'AND created_date < '2026-01-31T04:58:39'
Downloading: https://data.cityofnewyork.us/resource/erm2-nwe9.csv
Saved: work/m2/data/raw/nyc311_last14d.csv (12.14 MB)


PosixPath('work/m2/data/raw/nyc311_last14d.csv')

### Load the CSV and do a quick source audit

In [11]:
df_csv = pd.read_csv(CSV_PATH)
df_csv.head(5)

Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,status,borough,incident_zip,incident_address,street_name,city,latitude,longitude
0,67720523,2026-01-30T01:51:21.000,,NYPD,New York City Police Department,Noise - Commercial,Loud Music/Party,Unspecified,Unspecified,,,,,,
1,67746090,2026-01-30T01:51:04.000,,DOE,Department of Education,School Maintenance,Heating Problem,In Progress,BROOKLYN,11226.0,911 FLATBUSH AVENUE,FLATBUSH AVENUE,BROOKLYN,40.649787,-73.95855
2,67758820,2026-01-30T01:50:53.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,In Progress,MANHATTAN,10025.0,936 AMSTERDAM AVENUE,AMSTERDAM AVENUE,NEW YORK,40.800498,-73.96568
3,67707975,2026-01-30T01:50:52.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,In Progress,STATEN ISLAND,10302.0,190 TRANTOR PLACE,TRANTOR PLACE,STATEN ISLAND,40.629156,-74.144411
4,67771794,2026-01-30T01:50:32.000,,TLC,Taxi and Limousine Commission,Taxi Complaint,Driver Complaint - Non Passenger,In Progress,QUEENS,11430.0,JOHN F KENNEDY AIRPORT,JOHN F KENNEDY AIRPORT,JAMAICA,40.64832,-73.788281


In [12]:
df_csv.dtypes

unique_key            int64
created_date         object
closed_date          object
agency               object
agency_name          object
complaint_type       object
descriptor           object
status               object
borough              object
incident_zip        float64
incident_address     object
street_name          object
city                 object
latitude            float64
longitude           float64
dtype: object

In [13]:
df_csv.isna().mean().sort_values(ascending=False).head(10)

closed_date         0.60766
city                0.07640
street_name         0.02592
incident_address    0.02588
latitude            0.01222
longitude           0.01222
incident_zip        0.00732
descriptor          0.00680
complaint_type      0.00000
agency              0.00000
dtype: float64