# 0. EDA

We first explore the daily bus and "L" ridership data provided by the CTA.
There are three things we anticipate seeing:

- The weekly and monthly seasonality in transit usage.
- The long-run trends in bus and "L" ridership.
- The structural break in both systems due to COVID.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from statsmodels.tsa.seasonal import MSTL
from final_project.config import RAW_DIR, RIDERSHIP_DIR
from final_project.data import cta
from final_project.utils import save_figure

In [None]:
sns.set_theme(style='white', palette='Set1')

## Aggregate system profiles

We aggregate the data systemwide and analyze total bus and "L" ridership.
Right now, we want to understand the time-varying behavior of ridership at a macro level.
We will visit individual routes and stations later.

In [None]:
# Bus ridership.
bus_df = cta.load_ridership(RAW_DIR / 'CTA_bus_routes_daily_ridership.csv')
bus_total = cta.aggregate_ridership(bus_df)
bus_total.head()

In [None]:
# "L" ridership.
L_df = cta.load_ridership(RAW_DIR / 'CTA_L_station_daily_entries.csv')
L_total = cta.aggregate_ridership(L_df)
L_total.head()

Overall trends are more easily seen by plotting annual totals over the years.


In [None]:
fig, ax = plt.subplots()

sns.lineplot(data=bus_total.resample('YS').sum()[:-1] / 1e6,
             label='Bus ridership',
             ax=ax)
sns.lineplot(data=L_total.resample('YS').sum()[:-1] / 1e6,
             label='"L" ridership',
             ax=ax)

ax.set_title("CTA annual systemwide ridership, 2001-24")
ax.set_xlabel("")
ax.set_ylabel("Ridership (millions)")

plt.tight_layout()
plt.show()

In [None]:
save_figure(fig, 'cta_annual_ridership')

Transit usage in general trended downward in the years leading up to 2020, when ridership plummeted at the onset of the pandemic, as expected.
The post-2020 trajectories are interesting; it appears that bus ridership has recovered faster than "L" ridership has.

In [None]:
# Save the data.
y = pd.DataFrame({
    'bus': bus_total,
    'L': L_total
})
y.to_csv(RIDERSHIP_DIR / 'y.csv')
y.head()

## Snapshots of seasonality

We can also isolate one month of the daily series and two years of monthly aggregates to better see the seasonality.

In [None]:
def daily(x, start_date, end_date):
    return x[start_date:end_date].reset_index(drop=True)


def monthly(x, start_year, end_year, how='mean'):
    start_year = str(start_year)
    end_year = str(end_year)
    return (x[start_year:end_year]
            .resample('MS')
            .agg(how)
            .reset_index(drop=True))

This is not a formal seasonal decomposition&mdash;that comes in the next section&mdash;but the snapshots illustrate the weekday/weekend and month-to-month seasonalities that are typical in transit ridership data.

Right now, we are interested in any structural changes to the seasonal components of the post-pandemic series.
For a fair comparison of daily ridership before and after COVID, we look at the same 28-day period starting on the same day of the week in 2019 and in 2024.
For a fair comparison of monthly levels, we look at the 24 months starting January that span 2018&ndash;19 and 2023&ndash;24.

In [None]:
# These start/end dates correspond to the same 28-day period in 2019 and 2024.
start_2019 = '2019-09-30'
end_2019 = '2019-10-27'
start_2024 = '2024-09-30'
end_2024 = '2024-10-27'

# Day-to-day comparison, Oct 2019 vs. Oct 2024.
daily_comparison = pd.DataFrame({
    'bus_oct_2019': daily(bus_total, start_2019, end_2019),
    'bus_oct_2024': daily(bus_total, start_2024, end_2024),
    'L_oct_2019': daily(L_total, start_2019, end_2019),
    'L_oct_2024': daily(L_total, start_2024, end_2024)
})
daily_comparison.index = range(1, len(daily_comparison) + 1)

# Month-to-month comparison, 2019 vs. 2024.
monthly_comparison = pd.DataFrame({
    'bus_2018_2019': monthly(bus_total, 2018, 2019, how='median'),
    'bus_2023_2024': monthly(bus_total, 2023, 2024, how='median'),
    'L_2018_2019': monthly(L_total, 2018, 2019, how='median'),
    'L_2023_2024': monthly(L_total, 2023, 2024, how='median')
})
monthly_comparison.index = range(1, len(monthly_comparison) + 1)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 4))

# Bus.
sns.lineplot(daily_comparison['bus_oct_2019'] / 1e5,
             label='Oct 2019',
             ax=axes[0])
sns.lineplot(daily_comparison['bus_oct_2024'] / 1e5,
             label='Oct 2024',
             ax=axes[0])
axes[0].set_title("Bus ridership")
axes[0].set_xlabel("Days into a 28-day cycle, starting Monday")
axes[0].set_ylabel("Ridership (100s of thousands)")

# L.
sns.lineplot(daily_comparison['L_oct_2019'] / 1e5,
             label='Oct 2019',
             ax=axes[1])
sns.lineplot(daily_comparison['L_oct_2024'] / 1e5,
             label='Oct 2024',
             ax=axes[1])
axes[1].set_title('"L" ridership')
axes[1].set_xlabel("Days into a 28-day cycle, starting Monday")
axes[1].set_ylabel("Ridership (100s of thousands)")

fig.suptitle("Monthly snapshot of daily CTA ridership, pre- and post-COVID")
plt.tight_layout()
plt.show()

In [None]:
save_figure(fig, 'cta_monthly_snapshot')

The weekday/weekend pattern is clear, and it is interesting that weekend ridership is basically back to pre-pandemic levels.
The heightened "L" ridership in the middle of this cycle corresponds to the Columbus Day weekend, which is a federal holiday.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 4))

# Bus.
sns.lineplot(monthly_comparison['bus_2018_2019'] / 1e5,
             label='2018-19',
             ax=axes[0])
sns.lineplot(monthly_comparison['bus_2023_2024'] / 1e5,
             label='2023-24',
             ax=axes[0])
axes[0].set_title("Bus ridership")
axes[0].set_xlabel("Months into a 24-month cycle, starting January")
axes[0].set_ylabel("Ridership (100s of thousands)")

# L.
sns.lineplot(monthly_comparison['L_2018_2019'] / 1e5,
             label='2018-19',
             ax=axes[1])
sns.lineplot(monthly_comparison['L_2023_2024'] / 1e5,
             label='2023-24',
             ax=axes[1])
axes[1].set_title("\"L\" ridership")
axes[1].set_xlabel("Months into a 24-month cycle, starting January")
axes[1].set_ylabel("Ridership (100s of thousands)")

fig.suptitle("Month-to-month median daily CTA ridership, pre- and post-COVID")
plt.tight_layout()
plt.show()

In [None]:
save_figure(fig, 'cta_monthly_comparison')

Here, median daily ridership in each month is plotted to visualize the month-to-month seasonality.
Taking the mean reveals the same pattern, but it is smoother and easier to see with the median.

Bus and "L" ridership appear to follow the same monthly pattern, but the bus is more volatile than the "L".
These series start in January, so we see that ridership increases as the weather warms up, then dips at the end of the school year and increases again over the summer.
It decreases again as fall turns to winter and the holidays start.

## MSTL model

Now we formalize our findings with Multiple Seasonal-Trend decomposition using LOESS (MSTL).
Break the date index at the date when the World Health Organization declared COVID to be a global pandemic: March 11, 2020.

In [None]:
# Pre-COVID.
pre_start = '2001-01-01'
pre_end = '2020-03-10'
# Post-COVID.
post_start = '2020-03-11'
post_end = '2025-09-30'

# Pre- and post-COVID bus splits.
y_bus_pre = y['bus'][pre_start:pre_end]
y_bus_post = y['bus'][post_start:post_end]
# Pre- and post-COVID L splits.
y_L_pre = y['L'][pre_start:pre_end]
y_L_post = y['L'][post_start:post_end]

Since the data frequency is daily, we set the seasonal periods to 7 and 365 to capture weekday/weekend and month-of-the-year effects.

In [None]:
def mstl_daily_decomposition(y):
    mstl = MSTL(y, periods=(7, 365), stl_kwargs={'robust': True})
    res = mstl.fit()
    return res

Decompose the four series: Daily bus ridership pre- and post-COVID, and the same for "L" ridership.

In [None]:
# Bus MSTL.
bus_pre = mstl_daily_decomposition(y_bus_pre)
bus_post = mstl_daily_decomposition(y_bus_post)
# L MSTL.
L_pre = mstl_daily_decomposition(y_L_pre)
L_post = mstl_daily_decomposition(y_L_post)

First, we present the trend components.
It does appear from this plot that bus ridership is recovering faster post-pandemic than "L" ridership is.
But in contrast with the aggregate ridership plot, the secular decline in "L" ridership from 2015&ndash;2020 is less pronounced.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 4))

# Bus.
sns.lineplot(bus_pre.trend / 1e5, label='Pre-COVID', ax=axes[0])
sns.lineplot(bus_post.trend / 1e5, label='Post-COVID', ax=axes[0])
axes[0].set_title("Bus ridership")
axes[0].set_xlabel("")
axes[0].set_ylabel("Ridership (100s of thousands)")

# L.
sns.lineplot(L_pre.trend / 1e5, label='Pre-COVID', ax=axes[1])
sns.lineplot(L_post.trend / 1e5, label='Post-COVID', ax=axes[1])
axes[1].set_title('"L" ridership')
axes[1].set_xlabel("")
axes[1].set_ylabel("Ridership (100s of thousands)")

fig.suptitle("Pre- and post-COVID trends in CTA ridership")
plt.tight_layout()
plt.show()

In [None]:
save_figure(fig, 'mstl_trend')

Next, we present the raw seasonal components.
The decreased amplitudes clearly reflect the reduced ridership after COVID.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, sharey=True, figsize=(12, 8))

# Weekly seasonality, bus.
sns.lineplot(bus_pre.seasonal['seasonal_7'] / 1e5, linewidth=0.1, ax=axes[0,0])
sns.lineplot(bus_post.seasonal['seasonal_7'] / 1e5, linewidth=0.1, ax=axes[0,0])
# Weekly seasonality, L.
sns.lineplot(L_pre.seasonal['seasonal_7'] / 1e5, linewidth=0.1, ax=axes[0,1])
sns.lineplot(L_post.seasonal['seasonal_7'] / 1e5, linewidth=0.1, ax=axes[0,1])
# Monthly seasonality, bus.
sns.lineplot(bus_pre.seasonal['seasonal_365'] / 1e5, linewidth=0.5, ax=axes[1,0])
sns.lineplot(bus_post.seasonal['seasonal_365'] / 1e5, linewidth=0.5, ax=axes[1,0])
# Monthly seasonality, L.
sns.lineplot(L_pre.seasonal['seasonal_365'] / 1e5, linewidth=0.5, ax=axes[1,1])
sns.lineplot(L_post.seasonal['seasonal_365'] / 1e5, linewidth=0.5, ax=axes[1,1])

for ax in axes.flatten():
    ax.set_xlabel("")
    ax.set_ylabel("Ridership (100s of thousands)")
axes[0,0].set_title("Bus, weekly seasonal component")
axes[0,1].set_title('"L", weekly seasonal component')
axes[1,0].set_title("Bus, monthly seasonal component")
axes[1,1].set_title('"L", monthly seasonal component')

fig.suptitle("Pre- and post-COVID seasonality in CTA ridership")
plt.tight_layout()
plt.show()

In [None]:
save_figure(fig, 'mstl_seasonal_amplitudes')

But it is important to remember that MSTL seasonal components are additive.
The amplitudes are smaller because the trends in ridership fell sharply after COVID, not necessarily because the seasonal structure has changed.
Instead, we should look at how _relative_ seasonality might have changed by comparing the seasonality's percent deviation from trend.

In [None]:
def percent_of_trend(trend, seasonal):
    return 100 * seasonal / trend

In [None]:
bus_pct = {
    'pre_7': percent_of_trend(bus_pre.trend, bus_pre.seasonal['seasonal_7']),
    'pre_365': percent_of_trend(bus_pre.trend, bus_pre.seasonal['seasonal_365']),
    'post_7': percent_of_trend(bus_post.trend, bus_post.seasonal['seasonal_7']),
    'post_365': percent_of_trend(bus_post.trend, bus_post.seasonal['seasonal_365'])
}
L_pct = {
    'pre_7': percent_of_trend(L_pre.trend, L_pre.seasonal['seasonal_7']),
    'pre_365': percent_of_trend(L_pre.trend, L_pre.seasonal['seasonal_365']),
    'post_7': percent_of_trend(L_post.trend, L_post.seasonal['seasonal_7']),
    'post_365': percent_of_trend(L_post.trend, L_post.seasonal['seasonal_365'])
}

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 4))

# Weekly seasonality, bus.
sns.lineplot(bus_pct['pre_7'], linewidth=0.1, ax=axes[0])
sns.lineplot(bus_pct['post_7'], linewidth=0.1, ax=axes[0])
# Weekly seasonality, L.
sns.lineplot(L_pct['pre_7'], linewidth=0.1, ax=axes[1])
sns.lineplot(L_pct['post_7'], linewidth=0.1, ax=axes[1])

for ax in axes.flatten():
    ax.set_xlabel("")
    ax.set_ylabel("Percent of trend")
axes[0].set_title("Bus")
axes[1].set_title('"L"')

fig.suptitle("Pre- and post-COVID weekly seasonality (percent-of-trend)")
plt.tight_layout()
plt.show()

In [None]:
save_figure(fig, 'mstl_seasonal_weekly_pct')

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 4))

# Monthly seasonality, bus.
sns.lineplot(bus_pct['pre_365'], linewidth=0.5, ax=axes[0])
sns.lineplot(bus_pct['post_365'], linewidth=0.5, ax=axes[0])
# Monthly seasonality, L.
sns.lineplot(L_pct['pre_365'], linewidth=0.5, ax=axes[1])
sns.lineplot(L_pct['post_365'], linewidth=0.5, ax=axes[1])

for ax in axes.flatten():
    ax.set_xlabel("")
    ax.set_ylabel("Percent of trend")
axes[0].set_title("Bus")
axes[1].set_title('"L"')

fig.suptitle("Pre- and post-COVID monthly seasonality (percent-of-trend)")
plt.tight_layout()
plt.show()

In [None]:
save_figure(fig, 'mstl_seasonal_monthly_pct')

The weekday/weekend pattern is stable in relative terms, both pre- and post-COVID.
The pandemic did not fundamentally alter the relative weekly structure of aggregate transit demand.