-
Notifications
You must be signed in to change notification settings - Fork 0
/
2019-09-03-transit.Rmd
308 lines (273 loc) · 9.41 KB
/
2019-09-03-transit.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
title: "Understanding Transit Ridership Trends"
author: "Kyle Ward"
date: 2019-09-03
categories: ["transportation"]
tags: ["R", "transit"]
image:
placement: 2
caption: ""
focal_point: ""
preview_only: false
projects: ""
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, out.width = "100%")
library(tidyverse)
library(lubridate)
library(readxl)
library(here)
library(plotly)
library(RColorBrewer)
```
```{r}
xlsx <- here::here(
"static", "data", "2019-09-03-transit", "June 2019 Adjusted Database.xlsx"
)
upt_tbl_raw <- read_excel(xlsx, sheet = "UPT")
# We will use this table as a starting point for several charts below
upt_tbl <- upt_tbl_raw %>%
filter(Active == "Active") %>%
gather(key = "month_year", value = "trips", JAN02:JUN19) %>%
mutate(
year = substr(
month_year, start = str_length(month_year) - 1, stop = str_length(month_year)
),
month = substr(month_year, start = 1, stop = 3),
date = parse_date_time2(month_year, orders = "my")
) %>%
arrange(Agency, date)
# Assign a color to each agency for consistent charts
color_pal <- colorRampPalette(brewer.pal(8, "Set2"))
color_tbl <- upt_tbl %>%
filter(month_year == "JUN19") %>%
group_by(Agency) %>%
summarize(trips = sum(trips)) %>%
arrange(desc(trips)) %>%
mutate(color = color_pal(n())) %>%
select(-trips)
upt_tbl <- upt_tbl %>%
left_join(color_tbl, by = "Agency")
```
```{r}
# convenience functions
# shortens the agency name to the first 4 characters. Improves legend.
shorten_agency <- function(Agency){
Agency = gsub("[^a-zA-Z0-9 _]", "", Agency)
word_count = str_count(Agency, "\\W+") + 1
agency_short = word(Agency, 1, pmin(word_count, 4))
return(agency_short)
}
# Create a function that performs seasonal decomposition
extract_trend <- function(x, start_year) {
myts <- ts(x, start = c(start_year, 1), end = c(2019, 6), frequency = 12)
fit <- stl(myts, s.window="period")
df <- as.data.frame(fit$time.series)
return(df$trend)
}
```
Before talking about national transit trends, it's important to grasp the
relative size of the NYC market. The chart below shows the top 10 transit
agencies by monthly ridership (June 2019 unlinked trips). NYC dwarfs the other
transit markets, and for this post, is set aside as an outlier.
```{r, out.width="100%", warning=FALSE}
data1 <- upt_tbl %>%
filter(month_year == "JUN19") %>%
group_by(Agency) %>%
summarize(Trips = sum(trips, na.rm = TRUE)) %>%
arrange(desc(Trips)) %>%
head(10) %>%
arrange(Trips) %>%
mutate(Agency = shorten_agency(Agency))
plot_ly(
data1, y = ~Agency, x = ~Trips, type = "bar", orientation = "h",
text = ~Agency, textposition = "auto", hoverinfo = 'x'
) %>%
layout(
title = "Top 10 Transit Agencies (June 2019 trips)",
yaxis = list(
categoryorder = "array",
categoryarray = ~Agency,
showticklabels = FALSE
)
)
```
So what's going on outside of NYC? Here is the raw data of transit ridership
over time by agency. We'll need to do some further processing before any
useful information can be extracted.
```{r, out.width="100%", warning=FALSE}
data2 <- upt_tbl %>%
filter(Agency != "MTA New York City Transit") %>%
group_by(Agency, date) %>%
summarize(
trips = sum(trips, na.rm = TRUE),
color = first(color)
) %>%
arrange(Agency, date) %>%
ungroup() %>%
mutate(Agency = shorten_agency(Agency))
plot_ly(
data2, x = ~date, y = ~trips, type = "scatter", mode = "lines",
color = ~Agency, colors = ~unique(color),
hoverinfo = "x+y+text", hovertext = ~Agency
) %>%
layout(
showlegend = FALSE,
xaxis = list(title = "Date"),
yaxis = list(title = "Trips")
)
```
## Collapsing Agencies
A histogram of June 2019 ridership tells us that there are a few
agencies between 10 and 40 million monthly trips, several between 2 and 10
million, and then the large majority of agencies are below 2 million monthly
trips.
```{r}
data3 <- upt_tbl %>%
filter(month_year == "JUN19", Agency != "MTA New York City Transit") %>%
group_by(Agency) %>%
summarize(Trips = sum(trips, na.rm = TRUE))
plot_ly(
data3, x = ~Trips, type = "histogram", nbinsx = 30
) %>%
layout(
title = "Histogram of Agencies by Number of Trips",
yaxis = list(title = "Count of Agencies")
)
```
To make the time series charts more legible, we'll collapse agencies under 2
million into a single agency and display their median ridership. We'll create
another median agency representing agencies between 2 and 10 million.
## Seasonality
Understanding transit ridership is also complicated by the seasonal nature of
it. For most agencies, ridership is low in December and January and peaks around
October. Removing these seasonal variations helps us understand the underlying
trends. The chart below shows the monthly ridership for the Chicago Transit
Authority. The actual ridership is shown by the jagged line while the trend is
highlighted in orange. The trend line makes it clear that (seasonally adjusted)
ridership peaked in April of 2012. As a point of reference, the vertical green
line marks when Uber began service in Chicago.
```{r}
data4 <- upt_tbl %>%
filter(Agency == "Chicago Transit Authority") %>%
group_by(Agency, date) %>%
summarize(
trips = sum(trips, na.rm = TRUE),
color = first(color)
) %>%
mutate(trend = extract_trend(trips, 2002))
plot_ly(data4) %>%
add_trace(
x = ~date, y = ~trips, type = "scatter", mode = "lines",
alpha = .33, name = "seasonal", color = ~Agency, colors = ~unique(color),
hoverinfo = "none"
) %>%
add_trace(
x = ~date, y = ~trend, type = "scatter", mode = "lines",
name = "trend", alpha = 1, hoverinfo = "x+y"
) %>%
add_trace(
x = c(ymd("2011-09-01"), ymd("2011-09-01")),
y = c(0, 52000000),
type = "scatter", mode = "lines", name = "Uber in CHI",
hoverinfo = "none",
showlegend = FALSE
) %>%
layout(
title = "Chicago Transit Authority Ridership",
xaxis = list(title = "Date"),
yaxis = list(title = "Trips", rangemode = "tozero"),
annotations = list(
x = ymd("2011-09-01"), y = 35000000, text = "Uber Starts",
xref = "x", yref = "y",
showarrow = TRUE, ax = 40, ay = 25
)
)
```
## Results
With the above modifications in place, the chart below tells a fairly concise
story about what is happening to transit ridership. Medium markets like
Chicago began seeing declines in ridership between 2012 and 2014, and this trend
has has continued. This is the same time frame when Uber and Lyft began showing
up in these cities. While not enough to say that these companies caused the
decline, it is certainly interesting.
```{r}
data5 <- upt_tbl %>%
filter(
Agency != "MTA New York City Transit",
date >= ymd("2010-01-01")
) %>%
group_by(Agency, date) %>%
summarize(
trips = sum(trips, na.rm = TRUE),
color = first(color)
) %>%
arrange(Agency, date) %>%
mutate(current_ridership = last(trips)) %>%
ungroup() %>%
mutate(
Agency = case_when(
current_ridership <= 2000000 ~ "Under 2M",
current_ridership <= 10000000 ~ "2M to 10M",
TRUE ~ Agency
)
) %>%
group_by(Agency, date) %>%
summarize(
trips = median(trips),
color = first(color)
) %>%
mutate(trend = extract_trend(trips, 2010)) %>%
ungroup()
plot_ly(data5) %>%
add_trace(
x = ~date, y = ~trend, type = "scatter", mode = "lines",
color = ~Agency, colors = ~unique(color),
text = ~Agency, hoverinfo = "x+y+text",
showlegend = FALSE
) %>%
layout(
title = "Adjusted Ridership by Agency",
xaxis = list(title = "Date"),
yaxis = list(title = "Trips")
)
```
The small (<10 million) and tiny (<2 million) agencies look relatively stable in
the chart above, but that is mainly due to the scale of the graph. The charts
below zoom in to provide a more accurate picture. The median ridership in these
markets is also falling. The decline in the tiny markets starts noticeably later.
One theory could be that Uber/Lyft simply arrived in these regions later.
```{r}
data6 <- data5 %>%
filter(Agency %in% c("Under 2M", "2M to 10M"))
p1 <- plot_ly(data5 %>% filter(Agency == "2M to 10M")) %>%
add_trace(
x = ~date, y = ~trend, type = "scatter", mode = "lines",
color = ~Agency, colors = ~unique(color),
hoverinfo = "x+y"
) %>%
layout(yaxis = list(rangemode = "tozero"))
p2 <- plot_ly(data5 %>% filter(Agency == "Under 2M")) %>%
add_trace(
x = ~date, y = ~trend, type = "scatter", mode = "lines",
color = ~Agency, colors = ~unique(color),
hoverinfo = "x+y"
) %>%
layout(yaxis = list(rangemode = "tozero"))
subplot(p1, p2) %>%
layout(
title = "Median Ridership for Small Markets",
yaxis = list(title = "Trips")
)
```
The Uber/Lyft connection seems obvious, but this post has not done the work
required to say anything for sure. In a future post, I hope to look at the link
in more detail. The goal of this post was primarily to provide insight into
what is happening with transit ridership.
## Reproducibility
Reproduce what you see above and take it further! If you do, send me a link and
I'll post it here. Everything you need to get started is on GitHub:
Data: [FTA xlsx](https://github.com/dkyleward/blog/blob/master/static/data/June%202019%20Adjusted%20Database.xlsx) ([original source](https://www.transit.dot.gov/ntd/data-product/monthly-module-adjusted-data-release) is always the latest month)
Code: [Rmd](https://github.com/dkyleward/blog/blob/master/content/post/2019-09-03-transit/2019-09-03-transit.Rmd)