# Additional MBTA Commuter Rail Stats, 2024 Q2

This is a follow-up to [the on-time percentage graphs](q2_on_time_percentage.ipynb) I made earlier for a blog post. The first little bit of data processing is the same, but to help understand the data more I'll be calculating other things (like mean and median).

In [1]:
%%capture
%pip install pandas

In [2]:
import pandas as pd

raw_data = pd.read_csv("cr_2024_q2_sched_adherance.csv")

# Clean up columns that we won't use
cleaned_data = raw_data.drop(columns=[
  "gtfs_route_id",
  "gtfs_route_short_name",
  "gtfs_route_desc",
  "route_category",
  "mode_type",
  "peak_offpeak_ind",
  "metric_type",
  "cancelled_numerator",
  "ObjectId"
]).reindex()

# Clean up timestamps (we only care about the date, every time is the same)
cleaned_data.service_date = pd.to_datetime(raw_data["service_date"]).dt.date

# There are sometimes duplicate date entries... combine them by sum for now
cleaned_data = cleaned_data.groupby(["service_date", "gtfs_route_long_name"]).agg("sum")

# Calculate percentage based on numerator and denominator
def to_percent(row):
  if row.otp_denominator == 0:
    return 0
  
  return round(row.otp_numerator / row.otp_denominator, 4)

with_percent = cleaned_data
with_percent["on_time_percentage"] = cleaned_data.apply(to_percent, axis=1)
with_percent = with_percent\
                .drop(columns=["otp_numerator", "otp_denominator"]) \
                .reset_index() \
                .groupby(["gtfs_route_long_name"])

The next bit outputs summary statistics for each line. Note that each percentage represents a single day, so summary statistics describe that (not, for example, a potentially more interesting metric like time to wait).

In [3]:
for line, data in with_percent:
    print(line[0])
    print(data["on_time_percentage"].describe())
    print("===============================")

Fairmount Line
count    88.000000
mean      0.957783
std       0.063305
min       0.636400
25%       0.939400
50%       0.975000
75%       1.000000
max       1.000000
Name: on_time_percentage, dtype: float64
Fitchburg Line
count    90.000000
mean      0.948556
std       0.070929
min       0.631600
25%       0.937500
50%       0.973700
75%       1.000000
max       1.000000
Name: on_time_percentage, dtype: float64
Framingham/Worcester Line
count    90.000000
mean      0.884640
std       0.103424
min       0.510600
25%       0.850000
50%       0.907400
75%       0.959750
max       1.000000
Name: on_time_percentage, dtype: float64
Franklin Line
count    90.000000
mean      0.917532
std       0.131052
min       0.050000
25%       0.884600
50%       0.961500
75%       1.000000
max       1.000000
Name: on_time_percentage, dtype: float64
Greenbush Line
count    90.000000
mean      0.919063
std       0.081259
min       0.692300
25%       0.875000
50%       0.937500
75%       1.000000
max       

I'm particularly interested in the Fitchburg line, as it's the one I usually take. It looks like on the average day, trains are on time about 95%-97% of the time (with some low outliers pulling the mean down compared to the median). To be honest, I'm surprised it's that high, as I've almost always had a delay of 10-15 minutes in Q3; it will be interesting to see if/when those numbers are released if they reflect that intuition (and the fact that there was construction as part of https://www.mbta.com/projects/commuter-rail-safety-and-resiliency-program).

If you're reading this and also use the Commuter Rail, how does your line measure up? Are you surprised?