Air quality measurements are made at a handful of stations in Allegheny County.

Overdose deaths are reported within a zipcode.

We want to estimate the air quality at the centroid of each zipcode in Allegheny County for every day with air quality data. (TODO: eventually use hourly data)

This is a fool's errand, because zipcodes are very difficult to tie to a location in space: they are based on mail carrier routes and the USPS does not share them. It's an approximation.

In [1]:
import os

import pandas as pd

from air_brain.data.get_data import DATA_DIR
from air_brain.util.air import PM25
from air_brain.util.loc import distance

ZIPCODE_FILE = os.path.join(DATA_DIR, "zip2latlon.csv")

In [2]:
# zipcode to lat/lon
zip_df = pd.read_csv(ZIPCODE_FILE)
zip_df.head()

Unnamed: 0,zipcode,place,latitude,longitude
0,15006,Bairdford,40.6312,-79.8814
1,15007,Bakerstown,40.6478,-79.931
2,15014,Brackenridge,40.6082,-79.7414
3,15015,Bradfordwoods,40.6372,-80.0811
4,15017,Bridgeville,40.3472,-80.1153


In [3]:
# air quality sensor locations
# in this case, PM 2.5
loc_df = PM25().site_loc()
loc_df.head()

Unnamed: 0,site,latitude,longitude
3,Lawrenceville,40.465433,-79.960742
6,Avalon,40.499789,-80.071347
9,North Braddock,40.402267,-79.860942
10,Clairton,40.294381,-79.885303
12,Lincoln,40.308278,-79.869103


In [4]:
# air quality data, by date and site
# wide formet
# in this case, PM 2.5
pm25 = PM25().by_site()
pm25.tail()

site,Avalon,Clairton,Lawrenceville,Liberty 2,Lincoln,North Braddock,Parkway East
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-11-23,9.0,8.0,9.0,8.0,,9.0,13.0
2024-11-24,41.0,33.0,39.0,34.0,,37.0,34.0
2024-11-25,55.0,80.0,54.0,62.0,,58.0,49.0
2024-11-26,33.0,24.0,29.0,31.0,,33.0,31.0
2024-11-27,39.0,33.0,37.0,43.0,,51.0,37.0


In [5]:
# air quality data, by date and site
# long format
# in this case, PM 2.5
pm25_long = PM25().daily_air()[['date', 'site', 'index_value']]
pm25_long.sort_values(['date', 'site']).head(5)

Unnamed: 0,date,site,index_value
0,2016-01-01,Lawrenceville,25
10,2016-01-01,Liberty 2,28
4,2016-01-01,Lincoln,35
17,2016-01-01,Parkway East,30
22,2016-01-02,Lawrenceville,40


## using mean or median over all sites
This is probably not a good idea, but it is easy to do.

No variability across zipcodes.

In [6]:
mean_df = pm25.mean(axis=1).rename("mean")

In [7]:
median_df = pm25.median(axis=1).rename("median")

## using inverse distance weighting
This is an ok idea, not as good as kriging, but still reasonable.

In [8]:
# find the distance between each measurement station and each zipcode
def apply_distance(latlon_tuple):
    return distance(latlon_tuple[0], latlon_tuple[1],
                    latlon_tuple[2], latlon_tuple[3])
    
idw_df = zip_df.copy()
for row in loc_df.itertuples():
    idw_df["site_lat"] = row.latitude
    idw_df["site_lon"] = row.longitude
    idw_df["latlon_tuple"] = list(zip(idw_df.site_lat, idw_df.site_lon, idw_df.latitude, idw_df.longitude))
    idw_df["{}_dist".format(row.site)] = idw_df.latlon_tuple.apply(apply_distance)

idw_df = idw_df[["zipcode"] + ["{}_dist".format(x) for x in loc_df.site]]

In [9]:
# generate the dataframe of zipcode x date for each measurement
idw_df = idw_df.merge(pm25.reset_index(), how="cross")
# compute the IDW
idw_df["num"] = 0
idw_df["denom"] = 0
for site in loc_df.site:
    idw_df["num"] += idw_df[site].fillna(0) * (1 / idw_df["{}_dist".format(site)])
    idw_df["denom"] += idw_df[site].notna() * (1 / idw_df["{}_dist".format(site)])
idw_df["idw"] = idw_df.num / idw_df.denom

idw_df = idw_df[["date", "zipcode", "idw"]]

## save for later

In [10]:
# merge in mean and median just for comparison
idw_df = idw_df.merge(mean_df, left_on="date", right_index=True)
idw_df = idw_df.merge(median_df, left_on="date", right_index=True)
idw_df.to_csv(os.path.join(DATA_DIR, "pm25_zipcode.csv"), index=False)