# 2b. Spatial features

The focus now shifts to ridership at the bus route and "L" station level.
Since routes and stations exist in physical space, the spatial features surrounding them may affect their individual ridership.
In particular, we are interested in the characteristics of areas where transit ridership has recovered closest to pre-pandemic levels.

To that end, we select demographic characteristics of Chicago's 77 community areas and make a two-period cross-section to compare demographics before and after COVID.
We seek to identify how spatial heterogeneity relates to how different routes and stations have recovered to different levels of ridership relative to their pre-pandemic levels.

In [None]:
from dotenv import load_dotenv
load_dotenv()

import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from final_project.config import FEATURES_DIR, RAW_DIR
from final_project.data import acs
from final_project.features import spatial

In [None]:
sns.set_theme(style='white', palette='Set1')

## Getting ACS data

Demographic data is queried from the ACS 5-year estimates.
We use the 2018 release that covers 2014&ndash;18 for the pre-COVID period and the 2023 release that covers 2019&ndash;23 for the post-COVID period.

The 2023 release is the most recent release, but note that it includes one year of pre-COVID data, which may bias estimates of post-COVID demographics.
The 2024 release would be more appropriate for post-COVID estimates because it begins in 2020, but it does not come out until January 2026.

We select Census variables that summarize the following features of a community area:

- Age
- Race
- Socioeconomic status
    - Mean household income
    - Poverty rate
    - Educational attainment
- Work commuting behavior
    - Households with no vehicle
    - Share of workers who commute to work
    - Share of workers who work from home
- Built environment
    - Housing unit stock
    - Share of housing that is single-family housing
    - Share of housing that is multi-unit/family housing

In [None]:
# ACS 5-year estimates for 2014-18.
acs_18_df = acs.get_acs_data(year=2018)
acs_18_df = acs.clean_acs_data(acs_18_df)

# ACS 5-year estimates for 2019-23.
acs_23_df = acs.get_acs_data(year=2023)
acs_23_df = acs.clean_acs_data(acs_23_df)

acs_18_df.head()

These are aggregates tract-level aggregates.
The count of each category in a variable of interest is reported with the total sample size for that variable.

## Merging with census tract boundaries

The most straightforward way to get data for all census tracts in Chicago is to query the Census API for all of Cook County in one request.
Chicago is entirely within Cook County (except for a tiny portion of O'Hare that is, for practical purposes, negligible), but the county also includes suburbs such as Evanston, Skokie, and Cicero.

The GEOID is the Census Bureau's unique identifier for every geographic census unit.
Chicago's open data portal provides the GEOIDs and geometries of all census tracts in the city.

In [None]:
ct_gdf = spatial.load_ct_boundaries(RAW_DIR / 'census_tract_boundaries.geojson')
ct_gdf.head()

In [None]:
_, ax = plt.subplots(figsize=(8, 8))

ct_gdf.plot(ax=ax)

ax.set_title("Census tract boundaries in Chicago")
ax.set_xticklabels([])
ax.set_yticklabels([])

plt.tight_layout()
plt.show()

Merge the geometric and demographic data on GEOID.
This drops any part of Cook County that is not part of Chicago.

In [None]:
ct_18_gdf = pd.merge(ct_gdf, acs_18_df, how='left', on='geoid')
ct_23_gdf = pd.merge(ct_gdf, acs_23_df, how='left', on='geoid')

print("Number of census tracts in Chicago:", len(ct_18_gdf))
ct_18_gdf.head()

We now have demographic data assigned to each of the 801 census tracts in the city.

## Spatial joining into community areas

Chicago's community areas are defined for planning purposes, and their boundaries have been well established and stable for decades.
However, their boundaries were determined from historical ethnic and economic divisions within the city, so they do not align perfectly with modern census boundaries.

In [None]:
ca_gdf = spatial.load_ca_boundaries(RAW_DIR / 'community_area_boundaries.geojson')
ca_gdf.head()

In [None]:
_, ax = plt.subplots(figsize=(8, 8))

ca_gdf.plot(ax=ax)

ax.set_title("Community area boundaries in Chicago")
ax.set_xticklabels([])
ax.set_yticklabels([])

plt.tight_layout()
plt.show()

We use areal-weighted aggregation of the tract-level data to impute estimates of community area-level demographics.
For every variable, a census tract's contribution to a community area is its level in that variable weighted by the proportion of the tract's area that lies within the community area.

In [None]:
ca_agg_18_gdf = spatial.interpolate_ca_aggregates(ct_18_gdf, ca_gdf)
ca_agg_23_gdf = spatial.interpolate_ca_aggregates(ct_23_gdf, ca_gdf)
ca_agg_18_gdf.head()

Now convert the extensive variables to intensive measurements by computing ratios and shares.

In [None]:
ca_18_gdf = acs.compute_ratios(ca_agg_18_gdf)
ca_23_gdf = acs.compute_ratios(ca_agg_23_gdf)
ca_18_gdf.head()

## Save data

Save these spatial feature matrices.

In [None]:
ca_18_gdf = gpd.GeoDataFrame(ca_18_gdf, geometry='geometry')
ca_23_gdf = gpd.GeoDataFrame(ca_23_gdf, geometry='geometry')

ca_18_gdf.to_file(FEATURES_DIR / 'X_ca_2018.geojson', driver='GeoJSON')
ca_23_gdf.to_file(FEATURES_DIR / 'X_ca_2023.geojson', driver='GeoJSON')