# 2b. Bus route analysis

We now examine how spatial heterogeneity relates to ridership recovery by bus route.
This is a two-period cross-sectional regression where the unit of observation is a bus route before and after COVID.
The dependent variable is the ridership recovery ratio, and the covariates are:

- Pre-COVID community area demographics weighted by the proportion of the route passing through that area.
- The change in post-COVID demographics, similarly weighted.
- Route-specific features (total length, number of stops, whether it services the Loop).

In [None]:
from dotenv import load_dotenv
load_dotenv()

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from final_project.config import CA_TO_REGION, FEATURES_DIR, RAW_DIR
from final_project.data import acs, cta
from final_project.features import spatial
from final_project.utils import read_geojson, save_figure

In [None]:
sns.set_theme(style='white', palette='Set1')

## Selecting routes

The ridership data go back to 2001, and several bus routes have been discontinued since then.
First, we group the bus ridership data by route.

In [None]:
# Load bus ridership and group by route.
bus_df = cta.load_ridership(RAW_DIR / 'CTA_bus_routes_daily_ridership.csv')
bus_dict = cta.group_ridership(bus_df, by='route')

# Load bus route geometries.
all_routes_gdf = read_geojson(RAW_DIR / 'CTA_bus_routes.geojson')
all_routes_gdf.head()

We define two periods of equal length to be able to compare two stable operating regimes:

- Pre-COVID: Jan 2018 &ndash; Dec 2019
- Post-COVID: Jan 2023 &ndash; Dec 2024

These windows capture the most recent and relevant part of the pre-COVID trend, avoid the pandemic noise of 2020&ndash;21, and avoid ambiguities about partial recovery from when the city was still locked down in 2022.
We select only routes that were operational over both periods.

In [None]:
# Pre-COVID.
pre_start = '2018-01'
pre_end = '2019-12'
pre_dates = pd.date_range(start=pre_start, end=pre_end, freq='D')

# Post-COVID.
post_start = '2023-01'
post_end = '2024-12'
post_dates = pd.date_range(start=post_start, end=post_end, freq='D')

routes_dict = {
    route: data for route, data in bus_dict.items()
    if pre_dates.isin(data.index).all() and post_dates.isin(data.index).all()
}
print("Number of bus routes:", len(routes_dict))

Get the geometries of these routes.

In [None]:
routes_gdf = all_routes_gdf[all_routes_gdf['route'].isin(routes_dict.keys())]
routes_gdf = routes_gdf[['route', 'name', 'geometry']]
routes_gdf.head()

## Ridership recovery

The dependent variable is the ratio of post-COVID ridership to pre-COVID ridership.

In [None]:
y_df = pd.DataFrame([{
    'route': route,
    'pre_ridership': data.loc[pre_dates]['rides'].sum(),
    'post_ridership': data.loc[post_dates]['rides'].sum()
} for route, data in routes_dict.items()])
y_df['recovery_ratio'] = y_df['post_ridership'] / y_df['pre_ridership']
y_df = y_df.set_index('route')
y = y_df['recovery_ratio']
y 

In [None]:
fig, ax = plt.subplots()

sns.histplot(y, ax=ax)

ax.set_title("Histogram of bus route recovery ratios")
ax.set_xlabel("Ratio of post-COVID (2023-24) to pre-COVID (2018-19) ridership")

plt.tight_layout()
plt.show()

In [None]:
save_figure(fig, 'recovery_ratios_histogram')

We can see right away that although the recovery ratios are close to normally distributed, the routes themselves are more unequally spatially distributed according to ridership recovery.

In [None]:
# Clip routes at the city limits.
chi_gdf = read_geojson(RAW_DIR / 'city_boundaries.geojson')
y_routes_gdf = routes_gdf.merge(y, on='route')
y_routes_gdf = y_routes_gdf.overlay(chi_gdf, how='intersection')

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

divider = make_axes_locatable(ax)
cax = divider.append_axes('bottom', size='2%', pad=0.1)

chi_gdf.plot(facecolor='none', edgecolor='black', ax=ax)
y_routes_gdf.plot(
    column='recovery_ratio',
    cmap='RdBu',
    linewidth=2.5,
    legend=True,
    legend_kwds={'orientation': 'horizontal', 'label': 'Recovery ratio'},
    ax=ax,
    cax=cax
)

ax.set_title("Spatial heterogeneity of bus route ridership recovery")
ax.set_axis_off()

plt.tight_layout()
plt.show()

In [None]:
save_figure(fig, 'bus_route_recovery_ratios')

## Demographic features

We select six core demographic dimensions that we anticipate affect bus ridership recovery:

1. Work-from-home exposure
2. Socioeconomic status
3. Transit dependence
4. Labor market attachment
5. Urban form
6. Demographic lifecycle

In [None]:
X_ca_agg_18 = read_geojson(FEATURES_DIR / 'X_ca_agg_2018.geojson')
X_ca_agg_23 = read_geojson(FEATURES_DIR / 'X_ca_agg_2023.geojson')

### Socioeconomic status index

We build a socioeconomic status (SES) index from the first principal component (PC) of the following measurements:

- (Log) average household income
- Poverty rate
- Undergraduate degree rate
- Graduate degree rate

Since PC loading is data-specific, we cannot use the PC decomposition of 2018 data and the decomposition of 2023 data and compare them as if they have a common scale.
The index must be constructed from PCA of pooled data.

In [None]:
X_ca_18 = acs.compute_ratios(X_ca_agg_18)
X_ca_23 = acs.compute_ratios(X_ca_agg_23)
X_ca_18['year'] = 2018
X_ca_23['year'] = 2023
X = pd.concat([X_ca_18, X_ca_23])

# Select the features for the index.
X = X[['avg_hh_income', 'poverty', 'has_undergrad_degree', 'has_grad_degree']].copy()
X = X.assign(log_income=np.log(X['avg_hh_income']))
X = X.drop(columns='avg_hh_income')

# PCA of standardized features.
X_scaled = StandardScaler().fit_transform(X)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

Principal components are linear combinations of the input features that capture.
The first principal component explains almost 80% of the variance observed in the joint distribution of income, poverty, and higher educational attainment.
We use this as a simple but effective SES index.

In [None]:
print("Ratio of variance explained by each PC:")
print(pca.explained_variance_ratio_)

The upper half of the matrix are the projections of the 2018 data, the lower half are the projections of the 2023 data.
Standardize the first PC to make the index more interpretable.

In [None]:
ses = (StandardScaler()
                    .fit_transform(X_pca[:,0].reshape(-1, 1))
                    .reshape(-1))
ses_18 = pd.Series(ses[:len(ses)//2], name='ses')
ses_23 = pd.Series(ses[len(ses)//2:], name='ses')

### Route-weighted assignment

Demographic features are recorded at the community area level.
Since we do not have microdata about who rides on each route or where, we assume that the distance a route passes through a community area can proxy for an area's population exposure to that route.

To that end, a bus route is assigned a community area's demographic feature weighted by the proportion of the route's total length that intersects that community area.
This includes the SES index.

In [None]:
X_ca_agg_18['ses'] = ses_18
X_ca_agg_23['ses'] = ses_23

X_18 = spatial.route_weighted_demographics(routes_gdf, X_ca_agg_18)
X_23 = spatial.route_weighted_demographics(routes_gdf, X_ca_agg_23)

### Demographic change

Demographics in 2018 are taken as the baseline.
For the 2023 environment, we regress on changes to baseline levels.
Note that this reflects changes _after_ COVID, not necessarily changes due to COVID.

In [None]:
dX_23 = pd.DataFrame(columns=X_23.columns, index=X_23.index)
for col in X_23.columns:
    if col in ['population_total', 'housing_total', 'avg_hh_income']:
        dX_23[col] = np.log(X_23[col]) - np.log(X_18[col])
    else:
        dX_23[col] = X_23[col] - X_18[col]
dX_23.head()

### Mechanism proxies

For a parsimonious model, we will select or create a proxy for each of the aforementioned demographic dimensions.

1. **Work from home exposure** is directly coded as the proportion of workers who work from home.
2. We have an **SES** index.
3. **Transit dependence** is measured by the proportion of workers who commute via public transportation.
4. **Labor market attachment** is proxied by unemployment.
5. For **urban form**, we emphasize residential density and proxy for it with the share of the housing stock that is **multifamily housing**.
6. Both young adults and senior citizens are more likely to take public transit, but we choose **young adults** to represent the demographic lifecycle because younger transit riders are more likely to rely on it to commute to work or school and therefore be more regular riders.

In [None]:
# Create the final demographic feature matrices.
proxy_cols = ['wfh', 'ses', 'commuter', 'unemployment', 'mf_home', 'young_adult']
X = X_18[proxy_cols]
dX = dX_23[proxy_cols]

## Route-specific features

We also consider the fixed characteristics of a bus route, which do not change over time&mdash;or, at least, are not expected to change in five years.
They are:

- Total route length
- Number of stops
- Number of nearby "L" stations
- Whether the route passes through downtown

Total route length is computed directly from the geometry.

In [None]:
total_length = pd.Series(routes_gdf.geometry.length.values, index=routes_gdf['route'])

Counting the number of stops that service each route is straightforward from the provided CTA data.

In [None]:
stops_gdf = read_geojson(RAW_DIR / 'CTA_bus_stops.geojson')

# Explode a comma-separated string of routes to a list of routes.
routes_serviced = stops_gdf['routesstpg'].dropna()
route_lists = routes_serviced.str.split(',').apply(lambda x: [route.strip() for route in x])

# Count how many stops serve our routes.
stop_counts = {route: 0 for route in routes_dict.keys()}
for route_list in route_lists:
    for route in route_list:
        if route in stop_counts:
            stop_counts[route] += 1
stop_counts = pd.Series(stop_counts, name='stop_count')

There are bus routes that do directly feed "L" stations, but this information is messy and inconsistent in the data.
Instead, create a 100 m buffer around each route and count the number of "L" stations that lie within that buffer.

In [None]:
stations_gdf = read_geojson(RAW_DIR / 'CTA_L_stations.geojson')

# Take a 100 m buffer around the length of the route.
routes_buffered_gdf = routes_gdf.copy()
routes_buffered_gdf['geometry'] = routes_buffered_gdf.geometry.buffer(100)

# Spatial join the stations within the buffers and count.
joined = gpd.sjoin(
    stations_gdf,
    routes_buffered_gdf,
    how='left',
    predicate='within')
station_counts = (joined
    .groupby('route')
    .size()
    .rename('num_L_stations')
    .reset_index())
station_counts = station_counts.set_index('route')['num_L_stations']

Downtown service is proxied by whether the route passes through the region of Central Chicago, which is comprised of the Loop, the Near North Side, and the Near South Side.

In [None]:
ca_gdf = spatial.load_ca_boundaries(RAW_DIR / 'community_area_boundaries.geojson')
central_chi_gdf = ca_gdf[ca_gdf['ca_number'].apply(lambda x: CA_TO_REGION[x]) == 'CENTRAL']
central_chi_gdf = central_chi_gdf.dissolve()

central_chi = routes_gdf.geometry.intersects(central_chi_gdf.geometry.iloc[0])
central_chi = central_chi.astype(int)
central_chi.index = routes_gdf['route']

We now have our final route characteristics matrix.

In [None]:
# Create the final route-characteristics matrix.
Z = pd.DataFrame({
    'total_length': total_length,
    'stop_count': stop_counts,
    'station_count': station_counts,
    'central_chicago': central_chi
})
Z = Z.fillna(0)
Z.head()

## Regression

The linear regression model is specified

$$
r \sim \beta_0 + \beta_1 X + \beta_2 \Delta X + \gamma Z
$$

where $r$ is the ridership recovery ratio, $X$ is a matrix of route-weighted community area demographics in 2018, $\Delta X$ is the change in those demographics from 2018 to 2023, and $Z$ is a matrix of fixed route characteristics.

In [None]:
dX.columns = [f'change_{col}' for col in dX.columns]

# Create a block feature matrix from X, dX, and Z.
data = pd.concat([X, dX, Z], axis=1)
data = sm.add_constant(data)
data.head()

The OLS estimates are as follows.

In [None]:
linreg = sm.OLS(y, data).fit()
linreg.summary()

None of the estimated coefficients are significant at any conventional level.
What does that mean for ridership recovery?

This concludes the extent of the work manageable and achieved for this project, but a null result still points us somewhere interesting.
The selected observed demographic characteristics do not meaningfully explain bus ridership post-COVID, and these are standard sources of spatial heterogeneity in urban economics.
But the usual characteristics, whether considered in this project or not, may still help explain the CTA's recovery successes and shortcomings.

Further research should explore possible structural changes that are consistent with the observed return to the pre-COVID temporal structure.