# ETL Transform

This notebook imports the core dataset so you can begin transformation steps independently of geo-specific work.

## Overview
- Load the dataset from `../data/archive.zip`.
- Compute per-site (State, County, City, Address) daily reading counts.
- Check whether daily counts are constant per site.

In [None]:
# import necessary libraries (single import cell)
import pandas as pd
from pathlib import Path
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load dataset (reusing the same path used in etl_extract_cood.ipynb)
# - Reads compressed CSV from ../data/archive.zip
# - Stores in df; prints shape and previews head

data_path = Path('../data/archive.zip')
df = pd.read_csv(data_path, compression='zip')
print(f'Loaded df with {len(df):,} rows and {df.shape[1]} columns')
df.head()

Loaded df with 1,746,661 rows and 29 columns


Unnamed: 0.1,Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,...,SO2 Units,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,1.145833,4.2,21,
1,1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,0.878947,2.2,23,25.0
2,2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,1.145833,4.2,21,
3,3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,0.878947,2.2,23,25.0
4,4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,...,Parts per billion,1.958333,3.0,22,4.0,Parts per million,0.85,1.6,23,


## Check for variance in count of daily readings per site.

In [None]:
# Group by State, County, City, Address, Date Local: are per-site daily counts constant?
# - Parse 'Date Local' to day-level
# - Count records per (site, day)
# - Check constancy of daily counts per site

required = ['State','County','City','Address','Date Local']
missing = [c for c in required if c not in df.columns]
if missing:
    raise KeyError(f"Missing required columns: {missing}")

# Parse Date Local to day-level
_dt = pd.to_datetime(df['Date Local'], errors='coerce')
if not _dt.notna().any():
    raise RuntimeError("Could not parse any dates from 'Date Local'.")

work = df.copy()
work['_date'] = _dt.dt.date
site_key = ['State','County','City','Address']
full_key = site_key + ['_date']

# Count readings per site-day
counts = (
    work.groupby(full_key, dropna=False)
        .size()
        .reset_index(name='n_readings')
)
print(f"Computed counts for {len(counts):,} site-date groups across {counts[site_key].drop_duplicates().shape[0]:,} sites.")

# For each site (State, County, City, Address), check if daily counts are constant across days
per_site_unique = counts.groupby(site_key)['n_readings'].nunique()
all_sites_constant = (per_site_unique == 1).all()
print(f"All sites have the same number of readings per day? {all_sites_constant}")

if not all_sites_constant:
    varying = per_site_unique[per_site_unique > 1]
    print(f"Sites with varying daily counts: {len(varying):,} of {per_site_unique.size:,}")
    # Show a few examples of varying sites with their daily counts
    sample_sites = (
        counts.merge(varying.rename('distinct_counts'), left_on=site_key, right_index=True)
              .sort_values(['distinct_counts','n_readings'], ascending=[False, False])
              .head(20)
    )
    print("Sample site-day counts for sites that vary:")
    display(sample_sites)

Computed counts for 412,856 site-date groups across 204 sites.
All sites have the same number of readings per day? False
Sites with varying daily counts: 154 of 204
Sample site-day counts for sites that vary:
Sample site-day counts for sites that vary:


Unnamed: 0,State,County,City,Address,_date,n_readings,distinct_counts
91765,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-05-24,96,6
91639,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-01-01,16,6
91640,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-01-02,16,6
91641,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-01-03,16,6
91642,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-01-04,16,6
91643,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-01-05,16,6
91644,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-01-06,16,6
91645,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-01-07,16,6
91646,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-01-08,16,6
91647,California,Riverside,Rubidoux,"5888 MISSION BLVD., RUBIDOUX",2011-01-09,16,6


There is clearly a variance in number of readings per day per

This however, may not be an issue. Finn has discovered a consensus that we use the worst reading.

Daniel is Skeptical about faulty readings being aggregated and suggests the median value for simple aggregation, or to automate outlier detection/removal prior to worse reading aggregate if practical and desirable.