## PurpleAir acquisition: South Bronx sensors, November 7–10, 2024

This notebook focuses **only on data acquisition** from PurpleAir.

### What is PurpleAir?

PurpleAir is a network of **low-cost air-quality sensors** that are often installed by communities, schools, or individuals. Because of their low cost there are many sensors deployed which enables PurpleAir to provide **dense coverage** across a city (often much denser than official monitoring networks).

PurpleAir data are especially useful for:

- Seeing **neighborhood-scale differences** in air quality (block-to-block variability),
- Adding **context** around fewer, higher-quality monitoring sites (for example, regulatory stations),
- Comparing time patterns across sources (later steps will handle data cleaning and alignment).

PurpleAir sensors mainly report **particle-based PM estimates** (commonly PM\$_{1}\$, PM\$_{2.5}\$, and PM\$_{10}\$). These are valuable, but they should be treated as **raw measurements** that typically need quality checks (and sometimes calibration) before drawing strong conclusions. Those steps are discussed further in this lesson's `m201-air-quality-measures-integrated` notebook.

### What this notebook does

- Uses a PurpleAir API key to:
  - Query the API for all sensors in a South Bronx bounding box,
  - Save a sensor metadata table (`SouthBronxPurpleAirSensors.parquet`),
  - Download time series data for a specified event window (Nov 7–10, 2024) for each sensor.

In [None]:
import datetime
import time
from pathlib import Path
import getpass
import pytz
import requests
import pandas as pd
import geopandas as gpd
import pyarrow 
# -------------------------------------------------------------------
# Paths and parameters
# -------------------------------------------------------------------

BASE_DIR = Path(".")
DATA_DIR = BASE_DIR / "data"
RAW_DIR = DATA_DIR / "raw" / "purpleair"

RAW_DIR.mkdir(parents=True, exist_ok=True)

# Event window in local time (America/New_York)
tz = pytz.timezone("America/New_York")
EVENT_START_LOCAL = tz.localize(datetime.datetime(2024, 11, 7, 0, 0, 0))
EVENT_END_LOCAL = tz.localize(datetime.datetime(2024, 11, 10, 23, 59, 59))

# Convert event window to UTC timestamps (seconds since epoch) for PurpleAir API
EVENT_START_UTC_TS = EVENT_START_LOCAL.astimezone(pytz.utc).timestamp()
EVENT_END_UTC_TS = EVENT_END_LOCAL.astimezone(pytz.utc).timestamp()

EVENT_START_LOCAL, EVENT_END_LOCAL, EVENT_START_UTC_TS, EVENT_END_UTC_TS

(datetime.datetime(2024, 11, 7, 0, 0, tzinfo=<DstTzInfo 'America/New_York' EST-1 day, 19:00:00 STD>),
 datetime.datetime(2024, 11, 10, 23, 59, 59, tzinfo=<DstTzInfo 'America/New_York' EST-1 day, 19:00:00 STD>),
 1730955600.0,
 1731301199.0)

## PurpleAir Data
Data are available via PurpleAir's API. You will need to use a gmail account and create an API key via [this dashboard](https://develop.purpleair.com/sign-in?redirectURL=%2Fdashboards%2Fkeys). You should create a "Read" key that has a status of "Enabled". It's a good idea to add a label, host restrictions restrict the use of the key to certain machines, you do not need to set these. 

Once you have generated your key, you can "read" the key value to use in making requests. First, run the cell below and enter the key you generated when prompted. 

If you do not have a Gmail account, or you don't want to set up an API, sample data shown here have been saved as part of the repository and can be used directly, in that case please proceed to the `m201-air-quality-measures-integrated` notebook.

In [2]:
"""
Prompt for the PurpleAir API key (not stored in the notebook) and test it.

You must have created an API key on the PurpleAir developer portal:
https://develop.purpleair.com/
"""

api_key = getpass.getpass("Enter your PurpleAir API key: ")

headers = {"X-API-Key": api_key}

# Simple "who am I" test
test_url = "https://api.purpleair.com/v1/keys"
resp = requests.get(test_url, headers=headers)
print("Status code:", resp.status_code)
print("Response snippet:", str(resp.text)[:200])


Status code: 200
Response snippet: {
  "api_version" : "V1.2.0-1.1.45",
  "time_stamp" : 1766175416,
  "api_key_type" : "READ"
}


If the API key is valid ("Status Code: 200"), a bounding box can be used to search for sensors. The coordinates in the cell below represent the longitude and latitude of the northwest and southeast corners of a box that encloses the South Bronx. The API request returns the identifiers of sensors within that bounding box. 

In [25]:
"""
Query the PurpleAir API for sensors within the South Bronx bounding box.

We request a list of sensors and then save the result as:
- A pandas DataFrame (metadata only)
- A GeoParquet file with a Point geometry column (EPSG:4326)
"""

sensors_url = "https://api.purpleair.com/v1/sensors"
# South Bronx bounding box (approx.)
# (min_lon, min_lat, max_lon, max_lat)

BBOX_SOUTH_BRONX = (-73.933, 40.80, -73.78, 40.9)
params = {
    "fields": ",".join([
        "sensor_index",
        "name",
        "latitude",
        "longitude",
        "location_type",
        "private",
        "date_created",
        "last_seen",
        "position_rating"
    ]),
    "nwlng": BBOX_SOUTH_BRONX[0],
    "nwlat": BBOX_SOUTH_BRONX[3],
    "selng": BBOX_SOUTH_BRONX[2],
    "selat": BBOX_SOUTH_BRONX[1]
}
resp = requests.get(sensors_url, headers=headers, params=params)
resp.raise_for_status()
sensor_payload = resp.json()

print("Raw payload keys:", sensor_payload.keys())
print("Number of sensors:", len(sensor_payload.get("data", [])))

# Build a DataFrame from "data" and "fields"
sensor_fields = sensor_payload["fields"]
sensor_data = sensor_payload["data"]

sensor_df = pd.DataFrame(sensor_data, columns=sensor_fields)

print("Sensor metadata shape:", sensor_df.shape)
display(sensor_df.head())

# Save as plain Parquet and GeoParquet with geometry
meta_parquet_path = RAW_DIR / "SouthBronxPurpleAirSensors.parquet"
sensor_df.to_parquet(meta_parquet_path, index=False)
print(f"Saved sensor metadata to {meta_parquet_path.resolve()}")

gdf = gpd.GeoDataFrame(
    sensor_df,
    geometry=gpd.points_from_xy(sensor_df.longitude, sensor_df.latitude),
    crs="EPSG:4326"
)

meta_geo_path = RAW_DIR / "SouthBronxPurpleAirSensors_geo.parquet"
gdf.to_parquet(meta_geo_path, index=False)
print(f"Saved GeoParquet sensor metadata to {meta_geo_path.resolve()}")

# Save as CSV
meta_csv_path = RAW_DIR / "SouthBronxPurpleAirSensors.csv"
sensor_df.to_csv(meta_csv_path, index=False)
print(f"Saved sensor metadata CSV to {meta_csv_path.resolve()}")

sensor_df[["sensor_index", "name", "latitude", "longitude"]]


Raw payload keys: dict_keys(['api_version', 'time_stamp', 'data_time_stamp', 'max_age', 'firmware_default_version', 'fields', 'location_types', 'data'])
Number of sensors: 19
Sensor metadata shape: (19, 9)


Unnamed: 0,sensor_index,date_created,last_seen,private,name,location_type,position_rating,latitude,longitude
0,77735,1602004813,1766177219,0,FreshAir-I2,1,3,40.860535,-73.8858
1,90249,1605560768,1766177201,0,FreshAir-O4,0,3,40.861225,-73.89016
2,90283,1605561111,1766164225,0,SIS-roof,0,5,40.81536,-73.888374
3,90363,1605561282,1766177219,0,FA-CTKi,1,5,40.83745,-73.91547
4,92169,1606238714,1766177141,0,FA-CTKo,0,5,40.837517,-73.91544


Saved sensor metadata to C:\git\TOPSTSCHOOL-air-quality\data\raw\purpleair\SouthBronxPurpleAirSensors.parquet
Saved GeoParquet sensor metadata to C:\git\TOPSTSCHOOL-air-quality\data\raw\purpleair\SouthBronxPurpleAirSensors_geo.parquet


Unnamed: 0,sensor_index,name,latitude,longitude
0,77735,FreshAir-I2,40.860535,-73.8858
1,90249,FreshAir-O4,40.861225,-73.89016
2,90283,SIS-roof,40.81536,-73.888374
3,90363,FA-CTKi,40.83745,-73.91547
4,92169,FA-CTKo,40.837517,-73.91544
5,92171,FreshAir-O1,40.860455,-73.88581
6,140062,Crbx,40.874683,-73.904335
7,140064,inwood,40.87065,-73.917595
8,140094,home,40.8934,-73.83783
9,140378,Fordham Hill Oval,40.865273,-73.90739


In [26]:
def get_sensor_data(sensor_id,field_names,start_date_stamp,end_date_stamp):
    '''Construct request and fetch the sensor data '''

    # update the url, it is the API sensor url with :sensor after the end of the base URL
    url = f"https://api.purpleair.com/v1/sensors/{sensor_id}/history"     

    params = {
        'fields':",".join(fields),
        'start_timestamp':start_date_stamp,
        'end_timestamp':end_date_stamp,

    }
    time.sleep(2) # add a pause to avoid "too many requests" error (429)
    with requests.get(url=url, headers=headers, params=params) as response:

        if response.status_code == 200 or response.status_code == 201:
            print('Success')
            sensor_data = response.json()
            print(len(sensor_data))


        else:
            print(f"Request failed with status code: {response.status_code}")
            print(response.text)
            return None
        
        return sensor_data
# start and end datetime stamps
# stored on the server in UTC, so we
# need to convert the timezone before
# making the timestamp
tz = pytz.timezone('America/New_York')

start_date = tz.localize(datetime.datetime(2024,11,7,0,0,0))
end_date = tz.localize(datetime.datetime(2024,11,10,23,59,59))

print(start_date)
print(end_date)
## convert to UTC time zone and generate timestamp
start_timestamp = start_date.astimezone(pytz.utc).timestamp()
end_timestamp = end_date.astimezone(pytz.utc).timestamp()

fields = [
    'pm2.5_alt',        #Estimated mass concentration PM2.5 (µg/m³).
    'pm2.5_atm',        #Estimated mass concentration PM2.5 (µg/m³) (raw value).
    'humidity',         #Relative humidity inside of the sensor housing (%). This matches the "Raw Humidity" map layer and on average is 4% lower than ambient conditions.
    'temperature',      #Temperature inside of the sensor housing (F). This matches the "Raw Temperature" map layer and on average is 8°F higher than ambient conditions.
    'pressure',         #Current pressure in Millibars.
]

sensor_response_data = dict()
sensor_ids = gdf.sensor_index.values
print(sensor_ids)
for id in sensor_ids:
    sensor_response_data[id] = get_sensor_data(
        id,
        field_names=fields,
        start_date_stamp=int(start_timestamp),
        end_date_stamp=int(end_timestamp)
    )

2024-11-07 00:00:00-05:00
2024-11-10 23:59:59-05:00
[ 77735  90249  90283  90363  92169  92171 140062 140064 140094 140378
 172111 188617 201901 201941 208367 208369 255843 257893 257905]
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9
Success
9


In [29]:
"""
Fetch time series (history) for each sensor in the South Bronx for the event window
and save one Parquet file per sensor in data/raw/purpleair/.

We intentionally do not resample or plot here; we just fetch and store.
"""

import time
import pandas as pd

# fields we want to retrieve from PurpleAir
FIELDS = [
    "pm2.5_alt",   # estimated PM2.5 (µg/m³)
    "pm2.5_atm",   # "raw" PM2.5 (µg/m³)
    "humidity",
    "temperature",
    "pressure",
]

def get_sensor_history(sensor_id, fields, start_ts, end_ts, headers, pause_seconds=2.0):
    """
    Fetch historical data for a single PurpleAir sensor.
    """
    url = f"https://api.purpleair.com/v1/sensors/{sensor_id}/history"

    # PurpleAir can reject float timestamps like 1730955600.0, so force int seconds.
    start_ts = int(float(start_ts))
    end_ts   = int(float(end_ts))

    params = {
        "fields": ",".join(fields),
        "start_timestamp": start_ts,
        "end_timestamp": end_ts,
    }

    time.sleep(pause_seconds)
    resp = requests.get(url, headers=headers, params=params)

    if resp.status_code not in (200, 201):
        print(f"[{sensor_id}] Request failed: {resp.status_code}")
        print(resp.text[:300])
        return None

    return resp.json()


def records_to_dataframe(records, fields, sensor_id=None, timestamp_column="time_stamp"):
    """
    Convert PurpleAir 'history' records into a pandas DataFrame with a datetime index (UTC).
    """
    df = pd.DataFrame(records, columns=fields)

    if sensor_id is not None:
        df["sensor_id"] = str(sensor_id)

    if timestamp_column in df.columns:
        dt_col = "date_time_stamp"
        df[dt_col] = pd.to_datetime(df[timestamp_column], unit="s", utc=True)
        df = df.set_index(dt_col).sort_index()

    return df


# ---- Loop over all sensors and save one Parquet per sensor ----

all_sensor_ids = sensor_df["sensor_index"].tolist()
print("Fetching history for sensor_ids:", all_sensor_ids)

for sid in all_sensor_ids:
    payload = get_sensor_history(
        sensor_id=sid,
        fields=FIELDS,
        start_ts=EVENT_START_UTC_TS,
        end_ts=EVENT_END_UTC_TS,
        headers=headers,
        pause_seconds=2.0,
    )

    if payload is None or "data" not in payload or "fields" not in payload:
        print(f"[{sid}] No data returned.")
        continue

    df_hist = records_to_dataframe(
        records=payload["data"],
        fields=payload["fields"],
        sensor_id=sid,
        timestamp_column="time_stamp",
    )

    out_path = RAW_DIR / f"PurpleAir_sensor_{sid}_2024_11_07_to_11_10.parquet"
    if df_hist.shape[0] > 0:
        df_hist.to_parquet(out_path, index=False)  # requires pyarrow or fastparquet installed
        print(f"[{sid}] Saved {df_hist.shape[0]} rows to {out_path.name}")
        out_csv = RAW_DIR / f"PurpleAir_sensor_{sid}_2024_11_07_to_11_10.csv"
        df_out = df_hist.reset_index().rename(columns={"date_time_stamp": "datetime_utc"})
        df_out.to_csv(out_csv, index=False)
        print(f"[{sid}] Saved {df_hist.shape[0]} rows to {out_csv.name}")
    else:
        print(f"[{sid}] No rows to save.")

Fetching history for sensor_ids: [77735, 90249, 90283, 90363, 92169, 92171, 140062, 140064, 140094, 140378, 172111, 188617, 201901, 201941, 208367, 208369, 255843, 257893, 257905]
[77735] Saved 575 rows to PurpleAir_sensor_77735_2024_11_07_to_11_10.parquet
[77735] Saved 575 rows to PurpleAir_sensor_77735_2024_11_07_to_11_10.csv
[90249] Saved 575 rows to PurpleAir_sensor_90249_2024_11_07_to_11_10.parquet
[90249] Saved 575 rows to PurpleAir_sensor_90249_2024_11_07_to_11_10.csv
[90283] No rows to save.
[90363] Saved 575 rows to PurpleAir_sensor_90363_2024_11_07_to_11_10.parquet
[90363] Saved 575 rows to PurpleAir_sensor_90363_2024_11_07_to_11_10.csv
[92169] Saved 574 rows to PurpleAir_sensor_92169_2024_11_07_to_11_10.parquet
[92169] Saved 574 rows to PurpleAir_sensor_92169_2024_11_07_to_11_10.csv
[92171] Saved 572 rows to PurpleAir_sensor_92171_2024_11_07_to_11_10.parquet
[92171] Saved 572 rows to PurpleAir_sensor_92171_2024_11_07_to_11_10.csv
[140062] Saved 575 rows to PurpleAir_sensor_1