-
Notifications
You must be signed in to change notification settings - Fork 0
/
my-vignette.Rmd
232 lines (177 loc) · 8.73 KB
/
my-vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
title: "NYC OpenData - NYC 311 Phone Call Information"
author: "Wencong (Priscilla) Li: liwencong1995@gmail.com - Smith College '18"
date: "12/13/2016"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using etl to connect to a MySQL database & Using NYC311 data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include=FALSE, message=FALSE}
library(knitr)
opts_chunk$set(fig.width = 6, fig.height = 4)
```
## NYC 311 Open Data
[3-1-1](https://en.wikipedia.org/wiki/3-1-1) is a telephone number used in the United States for citizens to get access to non-emergency municipal services. The `nyc311` package provides an interface to NYC 311 Phone Call information from [NYC OpenData](https://nycopendata.socrata.com). Because these are **medium** data, the `nyc311` packages leverages the `etl` package for creating and maintaining an SQL database.
## Getting started:
### Install packages
The `etl` package provides the generic framework for the `nyc311` package. Since the `nyc311` package currently lives on GitHub and not on CRAN, you have to install it using `devtools`.
```{r, eval=FALSE, message=FALSE, warning=FALSE}
install.packages("devtools")
devtools::install_github("beanumber/nyc311")
```
### Load the package
This command loads both `etl` and `dplyr`.
```{r, message=TRUE, warning=FALSE}
library(nyc311)
```
Note:In order to initialize an etl object above using a `src_sqlite` database connection, you need to have [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/index.html) installed in your `System Library`.
## Examples
### Create an `etl` object:
First, you must create an `etl` object.
The `etl` function has three key attributes -- `object`, `dir`, and `db`.
```{r, warning=FALSE}
calls <- etl("nyc311")
summary(calls)
```
By default, a [SQLite](http://www.github.com/rstats-db/RSQLite) database is created in a temporary directory. Please read the [etl doucumentation](https://github.com/beanumber/etl/blob/master/README.Rmd) for more information.
```{r, eval=FALSE, warning=FALSE}
help(etl)
```
### Populate the database:
First, we will populate the database with data from January and February of 2010 and 2011. These data will be downloaded directly from the NYC OpenData portal.
```{r, message=FALSE, warning=FALSE}
calls %>%
etl_extract(years = 2010:2011, months = 1:2, num_calls = 100) %>%
etl_transform(years = 2010:2011, months = 1:2) %>%
etl_load(years = 2010:2011, months = 1:2)
```
The SQLite database that has just been created is populated with 100 calls from each of these four months.
```{r, warning=FALSE}
calls
```
Now you can pull the data that you are interested in into your R Environment.
```{r, message=FALSE, warning=FALSE}
my_calls <- calls %>%
tbl("calls") %>%
collect()
```
And take a look at it.
```{r, message=FALSE, warning=FALSE}
glimpse(my_calls)
```
Since all of the `Date` variables have been converted into `character` vectors, it would be helpful to transform the data types.
### Transform data types:
```{r, message=FALSE, warning=FALSE}
library(lubridate)
calls_cleaned <- my_calls %>%
mutate(closed_date = ymd_hms(closed_date)) %>%
mutate(created_date = ymd_hms(created_date)) %>%
mutate(resolution_action_updated_date =
ymd_hms(resolution_action_updated_date))
```
Now, take a look at it again.
```{r, message=FALSE, warning=FALSE}
glimpse(calls_cleaned)
```
### How long does it take for NYC agencies to solve an issue?
One obvious question is how many days it takes for NYC agencies to resolve each case. We can compute this by subtracting `closed_date` from `created_date`.
```{r, warning=FALSE}
calls_new <- calls_cleaned %>%
mutate(difference = as.numeric(closed_date-created_date))%>%
arrange(difference)
```
Here, we get the number of seconds it takes to resolve each case under a colunmn called `difference`.
Let's take a look at the first 5 rows of `calls_new`.
```{r, warning=FALSE}
calls_new %>%
select(created_date, closed_date, difference) %>%
head(5)
```
As you can see, some of the `difference` cells have negative entries. Since a case must be created before it is closed, some of the NYC agencies must have made some mistakes in the process of tracking and recording Phone Call information.
Therefore, we probably want to filter out the readings that have negative `difference` values. Then, we can the average number of days by dividing the average number of seconds by `24*60*60` seconds.
```{r, warning=FALSE}
calls_new %>%
filter(difference >= 0) %>%
summarize(N = n(), mean_hrs = mean(difference) / (24 * 60 * 60))
```
For these calls, it took the agencies approximately 13 hours and 24 minutes to solve and close an issue.
### How long does it take for different NYC agencies to solve an issue?
Does the average number of hours it takes to solve an issue vary among multiple NYC departments?
```{r, warning=FALSE}
agency_diff <- calls_new %>%
filter(difference >= 0) %>%
group_by(agency_name) %>%
summarize(number_hrs = mean(difference) / (60 * 60 * 24)) %>%
arrange(number_hrs)
agency_diff
```
It seems that `Department of Housing Preservation and Development` is doing a better job on quickly solving problems. It takes it on average less than 13 minutes to solve each issue.
### How long does it take for NYC agencies to solve differnt types of issue?
Are there differences in the average number of hours it takes to solve each type of complaint?
```{r, warning=FALSE}
complaint_diff <- calls_new %>%
filter(difference >= 0) %>%
group_by(complaint_type) %>%
summarize(number_hrs = mean(difference)/(60*60*24)) %>%
arrange(number_hrs)
complaint_diff %>%
head(10)
```
As you can see from the list above, `HEATING` takes the least time to resolve!
### Where did `General Construction` issues occur in NYC?
Most of the calls have a latitude and longitude coordinate record, so we can contextualize these observations using a map. In this example, we illustrate where general construction issues occurred.
```{r, warning=FALSE}
complaint_type <- calls_new %>%
group_by(complaint_type) %>%
summarize(N = n()) %>%
arrange(desc(N))
general_construction_location <- my_calls %>%
filter(complaint_type == "GENERAL CONSTRUCTION") %>%
select(latitude, longitude) %>%
mutate(latitude = as.numeric(latitude), longitude = as.numeric(longitude))
```
```{r}
class(general_construction_location$latitude)
if (require(leaflet)) {
leaflet(data = general_construction_location) %>%
addTiles() %>%
addCircles()
}
```
The blue dots shown in the map above represent NYC construction sites that had been complained about in the months specified.
## More Examples: 2014 New York City New Year Celebration
The 2014 New York City New Year Celebration took place in Times Square, and let's take a look at what the issue that annoys people living in New York City the most is on New Year's Day.
Here, I am using `etl_update` which is equivalent to a combination of `etl_extract, etl_transform, and etl_load`. Instead of having different arguments for each one of the three functions, you can enter all arguments needed in `etl_update`.
```{r}
calls2 <- etl("nyc311")
calls2 %>%
etl_update(years = 2014, months = 1, num_calls = 100 )
calls2 <- calls2 %>%
tbl("calls") %>%
collect()
```
```{r}
new_year <- calls2 %>%
select(latitude, longitude, complaint_type)
new_year %>%
group_by(complaint_type) %>%
summarise(N = n()) %>%
arrange(desc(N)) %>%
head(3)
if (require(leaflet)) {
leaflet(data = new_year) %>%
addTiles() %>%
addCircles()
}
```
As shown in the table and the map above, most of the first 100 calls reported `HEATING` problems happened `NEW YORK` country.
###An Interesting Article about NYC311
Steven Johnson wrote an interesting article about NYC311 dataset, and this article can be found by clicking the link below.
Steven Johnson, ["What A Hundred Million Calls to 311 Reveal About New York”](http://www.wired.com/magazine/2010/11/ff_311_new_york/all/1) November 10, 2010.
Steven started his article by introducing the "new aroma" event which took place in New York City about 15 years ago. The city officials realized that the NYC311 phone call tracking system data actually provided clues about where the arma was from.
By using this packge to analyze NYC311 data, you will be able to find more interesting "clues" about what is happening in New York City.
###Acknowledgement
I want to thank Professor [Benjamin Baumer](https://www.smith.edu/academics/faculty/ben-baumer), Assistant Professor in Statistical & Data Sciences at Smith College, for teaching me all the techniques that I need for the completion of this package.
I also want to thank Shuqi Mao, Connie Zhang, and Crystal Zang for helping with the testing of this package. They gave me very useful comments on how to make this vignette more straightforward.