# 2. Aligning ACS and EPA Data

Our historical air quality data from the EPA is provided station by station, and the positions of each station is provided by coordinates. To make useful comparisons, we want to align the data - decide for each year which exact air quality data points align with each CBSA GeoID/region in ACS.

We'll refer to these as `epa_point` and `acs_region` respectively.

In [2]:
# set up path to app credentials - see exploration/README.md
%env GOOGLE_APPLICATION_CREDENTIALS=../google_app_credentials.json

# set up bigquery client
from google.cloud import bigquery
bq = bigquery.Client()

env: GOOGLE_APPLICATION_CREDENTIALS=../google_app_credentials.json


In [21]:
# set up some dependencies
import geopandas as gp
import json

First we'll load up the relevant data, which should already be set up.

In [10]:
# set up acs data
resp = bq.query('''
    SELECT DISTINCT do_date AS year
    FROM `eosc410-project.data.acs_cbsa_*`
    ORDER BY do_date ASC
''')
acs_years = [row["year"] for row in resp]
def load_geojson(y) -> gp.GeoDataFrame:
    print('=> loading %s' % y)
    geo = gp.read_file('../_data/tmp/acs_cbsa_%s/geojson.json' % y)
    print('loaded %s' % y)
    print(geo)
    return geo
acs_regions = [load_geojson(y) for y in acs_years]

# take a peak at the geography of one region
print(acs_regions[0])

3                            Kingston, NY Metro Area   M1    1  G3110   
4                             Astoria, OR Micro Area   M2    2  G3110   
..                                               ...  ...  ...    ...   
950  Riverside-San Bernardino-Ontario, CA Metro Area   M1    1  G3110   
951  Los Angeles-Long Beach-Santa Ana, CA Metro Area   M1    1  G3110   
952                           Redding, CA Metro Area   M1    1  G3110   
953                          Show Low, AZ Micro Area   M2    2  G3110   
954                          Danville, IL Metro Area   M1    1  G3110   

           ALAND      AWATER     INTPTLAT      INTPTLON  \
0    10433603617  2739477738  +19.5977643  -155.5024434   
1     6740993584    76055435  +45.5355586  -111.1734431   
2     1984070931     1809509  +33.2230377  -093.2328433   
3     2911755313    94599649  +41.9472156  -074.2654583   
4     2147298720   661241625  +46.0245092  -123.7050140   
..           ...         ...          ...           ...   
95

In [46]:
# look at what our geojson looks like
acs_regions_2007 = acs_regions[0]      # first year
test_year = '2007'
test_region = acs_regions_2007.loc[0]  # first region in this year
test_region_name = test_region['NAME'] # this region has a name
test_region_geo = test_region.geometry # this region has a geometry
print(test_region_name)
print(json.dumps(test_region_geo.__geo_interface__)[:100] + "...")

Kahului-Wailuku, HI
{"type": "MultiPolygon", "coordinates": [[[[-156.70540599999998, 20.82632], [-156.711227, 20.832228]...


In [47]:
# for each year and acs region, we want to get associated station numbers
# try with the test region: find all EPA points that fall within this region's geometry in that year
resp = bq.query('''
SELECT DISTINCT site_num, address, cbsa_name
FROM `eosc410-project.data.epa_air_quality_annual` as epa
WHERE
  ST_WITHIN(ST_GEOGPOINT(epa.longitude, epa.latitude), ST_GEOGFROMGEOJSON('%s'))
  AND year = %s
''' % (json.dumps(test_region_geo.__geo_interface__), test_year))
print('Stations in region "%s" in %s:' % (test_region_name, test_year))
for row in resp:
    print('site %s (%s, %s)' % (row['site_num'], row['address'], row['cbsa_name']))

Stations in region "Kahului-Wailuku, HI" in 2007:
site 9001 (Haleakala National Park, Kahului-Wailuku-Lahaina, HI)
site 0006 (KAIHOI ST AND KAIOLOHIA ST, Kahului-Wailuku-Lahaina, HI)
site 9000 (Haleakala National Park, HI 96768, Kahului-Wailuku-Lahaina, HI)


In [None]:
# TODO: do this for all years, all regions