-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_using_more_than_two_data_points.Rmd
427 lines (355 loc) · 18.9 KB
/
02_using_more_than_two_data_points.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
---
title: "Why Two Years Won't Suffice"
author: "Craig A. Sloss"
date: "2023-06-15"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction and Setup
This notebook is a supplement to the article "Why Two Years Won't Suffice: Unveiling the Drawbacks of Limited Data" in *Questioning the Numbers*, showing the code used to generate the graphics and statistical results appearing in the article. (See https://questioning-the-numbers.ghost.io/) Section titles in this document correspond to the sections in the article. This notebook also contains some additional discussion of technical details that are not in the article.
The following packages are used in this notebook:
```{r cars}
require(tidyverse)
require(curl)
```
Download the full Uniform Crime Reporting Survey data from https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3510017701:
```{r}
curl_download("https://www150.statcan.gc.ca/n1/tbl/csv/35100177-eng.zip", "ucr_full.zip")
unzip("ucr_full.zip")
ucr_full = read_csv("./35100177.csv")
```
Preview the data:
```{r}
ucr_full %>% head()
```
Keep data only for Kitchener-Waterloo-Cambridge, and drop columns that are not needed for this analysis. To ensure that the results appearing in the newsletter can be replicated, limit the data to 2021 and earlier so that this gives the same results when additional years of data are added in the future:
```{r}
reported_crime_data = ucr_full %>%
filter(GEO == "Kitchener-Cambridge-Waterloo, Ontario [35541]" &
REF_DATE <= 2021) %>%
select(REF_DATE, Violations, Statistics, VALUE)
reported_crime_data %>% head()
```
Download the full Crime Severity Index data from https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3510002601:
```{r}
curl_download("https://www150.statcan.gc.ca/n1/en/tbl/csv/35100026-eng.zip?st=b0kCP-3L", "csi_full.zip")
unzip("csi_full.zip")
csi_full = read_csv("./35100026.csv")
```
Preview the data:
```{r}
csi_full %>% head()
```
Keep data only for Kitchener-Waterloo-Cambridge, and drop columns that are not needed for this analysis:
```{r}
csi_data = csi_full %>%
filter(GEO == "Kitchener-Cambridge-Waterloo, Ontario [35541]" & Statistics == "Crime severity index" & REF_DATE <= 2021) %>%
select(REF_DATE, VALUE)
csi_data
```
The trend_sensitivity_test function from the notebook 01_subjectivity_in_data_analysis.Rmd will be re-used in this notebook. (See the previous notebook for a detailed discussion.) I added a "log-poisson" option to this function, to provide an option for discrete distributions, since in this notebook I look at some statistics that are counts rather than rates.
```{r}
trend_sensitivity_test = function(x, y, method = "log-gamma") {
x_min = min(x)
x_max = max(x) - 2
x_range = x_min:x_max
x_length = length(x_range)
upper_ci = rep(NA, x_length)
middle = rep(NA, x_length)
lower_ci = rep(NA, x_length)
data = data.frame(x, y)
for (i in 1:x_length) {
truncated_data = data %>%
filter(x >= x_range[i])
if (method == "log-gamma") {
trend_glm = glm(y ~ x, data = truncated_data, family = Gamma(link = "log"))
}
else if (method == "linear") {
trend_glm = glm(y ~ x, data = truncated_data, family = gaussian(link = "identity"))
}
else if (method == "log-normal") {
trend_glm = glm(y ~ x, data = truncated_data, family = gaussian(link = "log"))
}
else if (method == "log-inverse-gaussian") {
trend_glm = glm(y ~ x, data = truncated_data, family = inverse.gaussian(link = "log"))
}
else if (method == "log-poisson") {
trend_glm = glm(y ~ x, data = truncated_data, family = poisson(link = "log"))
}
else{
print("Error: valid methods are log-gamma, linear, log-normal, or log-inverse-gaussian")
stop()
}
se = summary(trend_glm)$coefficients[2,2]
beta = trend_glm$coefficients[2]
if (method == "linear") {
upper_ci[i] = beta + 2 * se
middle[i] = beta
lower_ci[i] = beta - 2 * se
}
else {
upper_ci[i] = (exp(beta + 2 * se) - 1) * 100
middle[i] = (exp(beta) - 1) * 100
lower_ci[i] = (exp(beta - 2 * se) - 1) * 100
}
}
trend_result_data = data.frame(x_range, upper_ci, middle, lower_ci)
return(trend_result_data)
}
```
# Year-over-year changes have no predictive value
This section looks at changes in the number of incidents of "Total firearms, use of, discharge, pointing" over time in order to demonstrate the limitations of relying on year-over-year changes for budget planning purposes. Two classes of firearm incidents -- using a firearm in commission of an offence and pointing a firearm -- were effective April 1, 2008 (see https://www.statcan.gc.ca/en/statistical-programs/document/3302_D15_V10). To avoid the change in definition impacting the analysis, I decided only to use data from 2009 and later, since 2009 is the first full year under the new definition. Note also that the UCR classifies incidents according to the most serious violation to occur as part of the incident, so this data excludes situations in which a firearm was used in an incident that was recorded as a more serious type of crime.
First, produce the graph of the number of incidents over time:
```{r message=FALSE, warning=FALSE}
firearms_data = reported_crime_data %>% filter(REF_DATE >= 2009 &
Violations == "Total firearms, use of, discharge, pointing [150]" &
Statistics == "Actual incidents")
ggplot(data = firearms_data, aes(x = REF_DATE, y = VALUE)) +
geom_line() +
geom_point() +
coord_cartesian(ylim = c(0, 55)) +
scale_x_continuous(breaks = seq(2009, 2021, 1)) +
labs(title = "Reported incidents - Total firearms use, discharge, pointing",
subtitle = "Kitchener-Cambridge-Waterloo",
x = "Year",
y = "Number of Incidents Reported",
caption = "Statistics Canada, Uniform Crime Reporting Survey")
ggsave("./firearms_incidents_2009_2021.jpg", device = "jpeg")
ggsave("./firearms_incidents_2009_2021_feature.jpg", device = "jpeg", width = 2400, height = 1200, units = "px")
ggsave("./firearms_incidents_2009_2021_twitter.jpg", device = "jpeg", width = 2400, height = 1254, units = "px")
```
To better understand the volatility, look at some summary statistics in the year-over-year percentage change. This function calculates year-over-year changes in the number of incidents, as a percentage increase or decrease, and then produces some distributional statistics related to the year-over-year changes. The inputs to this function are:
* data is a subset of the UCR data, containing the columns "Violations", "Statistics", and "VALUE". It should be pre-filtered for the geographical area and years of interest.
* violation is a character string specifying the type of violation for which the statistics will be computed.
The output is a list with four items:
* mean is the average value of the year-over-year percentage change
* quantiles is a list containing the minimum, first quartile, median, third quartile, and maximum year-over-year percentage change
* mean_abs is the average size of the year-over-year change, ignoring direction
* quantiles_abs is a list containing the minimum, first quartile, median, third quartile, and maximum year-over-year percentage change, ignoring direction
```{r}
yoy_count_change_summary_statistics = function(data, violation) {
base_data = data %>%
filter(Violations == violation & Statistics == "Actual incidents")
change_data = base_data %>%
left_join(base_data %>%
mutate(REF_DATE = REF_DATE + 1) %>%
select(REF_DATE, PREV_VALUE = VALUE)) %>%
mutate(yoy_pct_change = (VALUE / PREV_VALUE - 1) * 100,
size_of_change = abs(yoy_pct_change))
return(list(mean = change_data %>% pull(yoy_pct_change) %>% mean(na.rm = TRUE),
quantiles = quantile(change_data %>% pull(yoy_pct_change), na.rm = TRUE),
mean_abs = change_data %>% pull(size_of_change) %>% mean(na.rm = TRUE),
quantiles_abs = quantile(change_data %>% pull(size_of_change), na.rm = TRUE)))
}
```
Assess typical distribution of year-over-year percentage changes:
```{r}
firearms_change_summary = yoy_count_change_summary_statistics(data = reported_crime_data %>% filter(REF_DATE >= 2009 & REF_DATE <= 2021),
violation = "Total firearms, use of, discharge, pointing [150]")
```
```{r}
firearms_change_summary$quantiles
```
```{r}
firearms_change_summary$quantiles_abs
```
Note that half of the time, the size of the change is bigger than 22%.
Check the percentage increase in 2021:
```{r}
((reported_crime_data %>% filter(REF_DATE == 2021 & Violations == "Total firearms, use of, discharge, pointing [150]" & Statistics == "Actual incidents") %>% pull(VALUE)) / (reported_crime_data %>% filter(REF_DATE == 2020 & Violations == "Total firearms, use of, discharge, pointing [150]" & Statistics == "Actual incidents") %>% pull(VALUE)) - 1) * 100
```
Next, check the sensitivity of the trend analysis to the start point of the trend line:
```{r}
trend_result_data = trend_sensitivity_test(x = firearms_data %>% pull(REF_DATE),
y = firearms_data %>% pull(VALUE),
method = "log-poisson") %>%
rename(years = x_range)
trend_result_data
```
```{r message=FALSE}
ggplot(trend_result_data, aes(x = years, y = middle)) +
geom_hline(yintercept = 0, colour = "black", linetype = "dashed") +
geom_line(colour = "blue") +
geom_line(aes(y = upper_ci), colour = "red", linetype = "dashed") +
geom_line(aes(y = lower_ci), colour = "red", linetype = "dashed") +
scale_x_continuous(breaks = seq(2009, 2019, 2)) +
labs(title = "Trend analysis results based on first year included in analysis",
subtitle = "Reported incidents - Total firearms use, discharge, pointing - Kitchener-Cambridge-Waterloo",
x = "First year included in trend calculation",
y = "Percentage trend calculated",
caption = "Based on Statistics Canada, Uniform Crime Reporting Survey")
ggsave("./firearms_trend_sensitivity_analysis.jpg", device = "jpeg")
```
There are potentially three interpretations from this:
* Using the full range of data is consistent with an interpretation that this statistic is not trending, and instead is oscillating around an average vaule of 33.
* If the trend line starts in the range 2013-2017, and we assume it continues to the present, we would conclude that there is an increasing trend.
* There was an increasing medium-term trend, but the number of incidents has since stabilized. It now oscillates around an average that is higher than it was historically (e.g. 40 incidents, using data from 2018 onward).
# Using only two years of data can overestimate the size of a trend
## Luring incidents
Visualize the data on number of luring incidents in the UCR survey:
```{r}
luring_data = reported_crime_data %>% filter(REF_DATE >= 2009 &
Violations == "Luring a child via a computer [1370]" &
Statistics == "Actual incidents")
ggplot(data = luring_data, aes(x = REF_DATE, y = VALUE)) +
geom_line() +
geom_point() +
geom_smooth(method = "glm", method.args = list(family = poisson(link = "log"))) +
coord_cartesian(ylim = c(0, 35)) +
scale_x_continuous(breaks = seq(2009, 2021, 1)) +
labs(title = "Reported incidents - Luring a Child via a Computer",
subtitle = "Kitchener-Cambridge-Waterloo",
x = "Year",
y = "Number of Incidents Reported",
caption = "Statistics Canada, Uniform Crime Reporting Survey")
ggsave("./luring_incidents_2009_2021.jpg", device = "jpeg")
```
Check sensitivity of the trend analysis to the choice of starting point:
```{r}
trend_result_data = trend_sensitivity_test(x = luring_data %>% pull(REF_DATE),
y = luring_data %>% pull(VALUE),
method = "log-poisson") %>%
rename(years = x_range)
trend_result_data
```
```{r message=FALSE}
ggplot(trend_result_data, aes(x = years, y = middle)) +
geom_hline(yintercept = 0, colour = "black", linetype = "dashed") +
geom_line(colour = "blue") +
geom_line(aes(y = upper_ci), colour = "red", linetype = "dashed") +
geom_line(aes(y = lower_ci), colour = "red", linetype = "dashed") +
scale_x_continuous(breaks = seq(2009, 2019, 2)) +
labs(title = "Trend analysis results based on first year included in analysis",
subtitle = "Reported incidents - Luring a Child via a Computer - Kitchener-Cambridge-Waterloo",
x = "First year included in trend calculation",
y = "Percentage trend calculated",
caption = "Based on Statistics Canada, Uniform Crime Reporting Survey")
ggsave("./luring_trend_sensitivity_analysis.jpg", device = "jpeg")
```
The conclusion of an increasing trend is fairly consistent regardless of the starting point, with the exception of very recent years which could be due to low data volume.
Switch to using a linear model to obtain a trend that can be interperted additively.
```{r}
trend_result_data = trend_sensitivity_test(x = luring_data %>% pull(REF_DATE),
y = luring_data %>% pull(VALUE),
method = "linear") %>%
rename(years = x_range)
trend_result_data
```
Assess typical distribution of year-over-year percentage changes:
```{r}
luring_change_summary = yoy_count_change_summary_statistics(data = reported_crime_data %>% filter(REF_DATE >= 2009 & REF_DATE <= 2021),
violation = "Luring a child via a computer [1370]")
```
```{r}
luring_change_summary$quantiles
```
```{r}
luring_change_summary$quantiles_abs
```
## Crime Severity Index
Check the sensitivity of the "year-to-2021" percentage change in the CSI, to the selection of the comparison year.
```{r}
ggplot(data = csi_data, aes(x = REF_DATE, y = VALUE)) +
geom_point() +
geom_line() +
geom_smooth(method = "glm", method.args = list(family = Gamma(link = "log"))) +
coord_cartesian(ylim = c(0, 105)) +
scale_x_continuous(breaks = seq(1998, 2021, 2)) +
labs(title = "Total Crime Severity Index",
subtitle = "Kitchener-Cambridge-Waterloo",
x = "Year",
y = "Total CSI",
caption = "Based on Statistics Canada, Crime Severity Index and Weighted Clearance Rates")
```
For calculating the percentage change relative to a reference year, check the sensitivity of the choice of reference year:
```{r}
csi_2021 = csi_data %>% filter(REF_DATE == 2021) %>% pull(VALUE)
csi_percentage_change = csi_data %>%
filter(REF_DATE <= 2020) %>%
mutate(change_to_2021 = csi_2021 / VALUE,
change_to_2021_annualized = change_to_2021^(1 / (2021 - REF_DATE)),
pct_change_to_2021 = 100 * (change_to_2021 - 1),
Direction = ifelse(pct_change_to_2021 > 0, "Increase", "Decrease"))
csi_percentage_change
```
```{r}
ggplot(csi_percentage_change, aes(x = REF_DATE, y = pct_change_to_2021, fill = Direction)) +
geom_bar(stat = "identity", width = 0.25) +
geom_point(size = 3, shape = 21, colour = "Black") +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_x_continuous(breaks = seq(1998, 2020, 2)) +
scale_fill_manual(values = c("Green", "Purple")) +
labs(title = "Impact of selection of reference year on percentage change in CSI",
subtitle = "Total Crime Severity Index - Kitchener-Cambridge-Waterloo",
x = "Reference year",
y = "Percentage change by 2021",
caption = "Based on Statistics Canada, Crime Severity Index and Weighted Clearance Rates")
ggsave("./csi_reference_year_sensitivity_test.jpg", device = "jpeg")
```
```{r}
quantile(csi_percentage_change$pct_change_to_2021, na.rm = TRUE)
```
For comparison, what do the results of fitting a trend line look like?
```{r}
trend_result_data = trend_sensitivity_test(x = csi_data %>% pull(REF_DATE),
y = csi_data %>% pull(VALUE),
method = "log-gamma") %>%
rename(years = x_range)
trend_result_data
```
```{r message=FALSE}
ggplot(trend_result_data, aes(x = years, y = middle)) +
geom_hline(yintercept = 0, colour = "black", linetype = "dashed") +
geom_line(colour = "blue") +
geom_line(aes(y = upper_ci), colour = "red", linetype = "dashed") +
geom_line(aes(y = lower_ci), colour = "red", linetype = "dashed") +
scale_x_continuous(breaks = seq(1998, 2019, 2)) +
labs(title = "Trend analysis results based on first year included in analysis",
subtitle = "Total Crime Severity Index - Kitchener-Cambridge-Waterloo",
x = "First year included in trend calculation",
y = "Percentage trend calculated",
caption = "Based on Statistics Canada, Uniform Crime Reporting Survey")
```
# When are Year-Over-Year Changes Meaningful?
This function extracts the "Percentage Change in Rate" statistic for a specified violation from the USR data, and then computes various summary statistics. (It is similar to the function yoy_count_change_summary_statistics described above, except that there is no need to calculate the change in rate because it is already in the dataset.)
```{r}
yoy_rate_change_summary_statistics = function(data, violation) {
yoy_data = data %>%
filter(Violations == violation & Statistics == "Percentage change in rate") %>%
mutate(size_of_change = abs(VALUE))
return(list(mean = yoy_data %>% pull(VALUE) %>% mean(na.rm = TRUE),
quantiles = quantile(yoy_data %>% pull(VALUE), na.rm = TRUE),
mean_abs = yoy_data %>% pull(size_of_change) %>% mean(na.rm = TRUE),
quantiles_abs = quantile(yoy_data %>% pull(size_of_change), na.rm = TRUE)))
}
```
Create a graph of "Total theft over $5,000 (non-motor vehicle) [230]":
```{r}
theft_over_5k_data = reported_crime_data %>%
filter(Violations == "Total theft over $5,000 (non-motor vehicle) [230]" & Statistics == "Rate per 100,000 population")
ggplot(data = theft_over_5k_data, aes(x = REF_DATE, y = VALUE)) +
geom_line() +
geom_point() +
geom_smooth(method = "glm", method.args = list(family = Gamma(link = "log"))) +
scale_x_continuous(breaks = seq(1998, 2021, 2)) +
coord_cartesian(ylim = c(0, 125)) +
labs(title = "Theft over $5000 (non-motor vehicle) -- Rate per 100,000 population",
subtitle = "Kitchener-Cambridge-Waterloo",
x = "Year",
y = "Incidents per 100,000 population",
caption = "Statistics Canada, Uniform Crime Reporting Survey")
ggsave("./theft_over_5k_1998_2021.jpg", device = "jpeg")
```
```{r}
theft_change_summary = yoy_rate_change_summary_statistics(data = reported_crime_data %>% filter(REF_DATE <= 2019), violation = "Total theft over $5,000 (non-motor vehicle) [230]")
```
```{r}
theft_change_summary$quantiles_abs
```
For 2020 and 2021, the year-over-year changes are:
```{r}
reported_crime_data %>%
filter(REF_DATE > 2019 & Violations == "Total theft over $5,000 (non-motor vehicle) [230]" & Statistics == "Percentage change in rate")
```