-
Notifications
You must be signed in to change notification settings - Fork 8
/
16-dates-and-times.Rmd
471 lines (365 loc) · 15 KB
/
16-dates-and-times.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
```{r setup16, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
library(ggplot2)
library(dplyr)
library(tidyr)
library(nycflights13)
library(forcats)
library(lubridate)
library(hms)
```
# Ch. 16: Dates and times
```{block2, type='rmdimportant'}
**Key questions:**
* 16.2.4. #3
* 16.3.4. #1, 4, 5
* 16.4.5. #4
```
```{block2, type='rmdtip'}
**Functions and notes:**
```
* `today` get current date
* `now` get current date-time
* `ymd_hms` one example of straight-forward set-of of functions that take either strings or unquoted numbers and output dates or date-times
* `make_datetime` create date-time from individual components, e.g. make_datetime(year, month, day, hour, minute)
* `as_date_time` and `as_date` let you switch between date-time and dates, e.g. `as_datetime(today())` or `as_date(now())`
* Accessor functions let you pull out components from an existing date-time:
+ `year`, `month`, `mday`, `yday`, `wday`, `hour`, `minute`, `second`
+ `month` and `wday` have `label = TRUE` to pull the abbreviated name rather than the number, and pull full name with `abbr = FALSE`
+ You can also use these to set particular components `year(datetime) <- 2020`
* `update` allows you to specify multiple values at one time, e.g. `update(datetime, year = 2020, month = 2, mday = 2, hour = 2)`
+ When values are too big they roll-over e.g. `update(ymd("2015-02-01"), mday = 30)` will become '2015-03-02'
* Rounding functions to nearest unit of time
+ `floor_date`, `round_date`, `ceiling_date`
* `as.duration` convert diff-time to a duration
* Durations (can add and multiply):
+ `dseconds`, `dhours`, `ddays`, `dweeks`, `dyears`
* Periods (can add and multiply), more likely to do what you expect than duration:
+ `seconds`, `minutes`, `hours`, `days`, `weeks`, `months`
* Interval is a duration with a starting point, making it precise and possible to determine EXACT length
+ e.g. `(today() %--% next_year) / ddays(1)` to find exact duration
* `Sys.timezone` to see what R thinks your current time zone is
* `tz = ` arg in `ymd_hms` let's you change printing behavior (not underlying value, as assumes UTC unless changed)
* `with_tz` allows you to print an existing date-time object to a specific other timezone
* `force_tz` when have an object that's been labeled with wrong time-zone and need to fix it
## 16.2: Creating date/times
Note that 1 in date-times is treated as 1 - second in numeric contexts, so example below sets `binwidth = 86400` to specify 1 day
```{r}
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate_at(c("dep_time", "arr_time", "sched_dep_time", "sched_arr_time"), ~make_datetime_100(year, month, day, .)) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400)
```
### 16.2.4
1. What happens if you parse a string that contains invalid dates?
```{r}
ymd(c("2010-10-10", "bananas"))
```
* Outputs an NA and sends warning of number that failed to parse
1. What does the `tzone` argument to `today()` do? Why is it important?
* Let's you specify timezones, may be different days depending on location
```{r}
today(tzone = "MST")
now(tzone = "MST")
```
1. Use the appropriate lubridate function to parse each of the following dates:
```{r}
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
```
```{r}
mdy(d1)
ymd(d2)
dmy(d3)
mdy(d4)
mdy(d5)
```
## 16.3: Date-time components
This allows you to plot the number of flights per week
```{r}
flights_dt %>%
count(week = floor_date(dep_time, "week")) %>%
ggplot(aes(week, n)) +
geom_line()
```
### 16.3.4
1. How does the distribution of flight times within a day change over the
course of the year?
*Median flight time by day*
```{r}
flights_dt %>%
transmute(quarter_dep = quarter(dep_time) %>% factor(),
day_dep = as_date(dep_time),
dep_time = as.hms(dep_time)) %>%
group_by(quarter_dep, day_dep) %>%
summarise(day_median = median(dep_time)) %>%
ungroup() %>%
ggplot(aes(x = day_dep, y = day_median)) +
geom_line(aes(colour = quarter_dep, group = 1)) +
labs(title = "Median flight times by day, coloured by quarter", subtitle = "Typical flight times change with daylight savings times")+
geom_vline(xintercept = ymd("20130310"), linetype = 2)+
geom_vline(xintercept = ymd("20131103"), linetype = 2)
```
* First couple and last couple months tend to have slightly earlier start times
*Quantiles of flight times by month*
```{r}
flights_dt %>%
transmute(month_dep = month(dep_time, label = TRUE),
quarter_dep = quarter(dep_time) %>% factor(),
wk_dep = week(dep_time),
dep_time = as.hms(dep_time)) %>%
group_by(month_dep, wk_dep) %>%
ungroup() %>%
ggplot(aes(x = month_dep, y = dep_time, group = month_dep)) +
geom_boxplot()
```
* Reinforces prior plot, shows that first couple and last couple months of year tend to have slightly higher proportion of flights earlier in day
* Last week of the year have a lower proportion of late flights, and a higher proportion of morning flights
See [16.3.4.1] for a few other plots I looked at.
1. Compare `dep_time`, `sched_dep_time` and `dep_delay`. Are they consistent?
Explain your findings.
```{r}
flights_dt %>%
mutate(dep_delay_check = (dep_time - sched_dep_time) / dminutes(1),
same = dep_delay == dep_delay_check,
difference = dep_delay_check - dep_delay) %>%
filter(abs(difference) > 0)
```
* They are except in the case when it goes over a day, the day is not pushed forward so it counts it as being 24 hours off
1. Compare `air_time` with the duration between the departure and arrival.
Explain your findings. (Hint: consider the location of the airport.)
```{r}
flights_dt %>%
mutate(air_time_check = (arr_time - dep_time) / dminutes(1)) %>%
select(air_time_check, air_time, dep_time, arr_time, everything())
```
* Initial check is off, so need to take into account the time-zone and difference from NYC, so join timezone document
```{r}
flights_dt %>%
left_join(select(nycflights13::airports, dest = faa, tz), by = "dest") %>%
mutate(arr_time_new = arr_time - dhours(tz + 5)) %>%
mutate(air_time_tz = (arr_time_new - dep_time) / dminutes(1),
diff_Airtime = air_time_tz - air_time) %>%
select( origin, dest, tz, contains("time"), -(contains("sched")))
```
* Is closer but still off. In chapter 5, problem 5.5.2.1 I go further into this
* In [Appendix] section [16.3.4.3] filter to NAs
1. How does the average delay time change over the course of a day?
Should you use `dep_time` or `sched_dep_time`? Why?
```{r}
flights_dt %>%
mutate(sched_dep_time = as.hms(floor_date(sched_dep_time, "30 mins"))) %>%
group_by(sched_dep_time) %>%
summarise(delay_mean = mean(arr_delay, na.rm = TRUE),
n = n(),
n_na = sum(is.na(arr_delay)) / n,
delay_median = median(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(x = sched_dep_time, y = delay_mean, size = n)) +
geom_point()
```
* It goes-up throughout the day
* Use `sched_dep_time` because it has the correct day
1. On what day of the week should you leave if you want to minimise the
chance of a delay?
```{r}
flights_dt %>%
mutate(weekday = wday(sched_dep_time, label = TRUE)) %>%
group_by(weekday) %>%
summarise(prop_delay = sum(dep_delay > 0) / n())
```
* wknd has a slightly lower proportion of flights delayed (Thursday has the worst)
1. What makes the distribution of `diamonds$carat` and
`flights$sched_dep_time` similar?
```{r, fig.align = "default", fig.show='hold', out.width = "50%"}
ggplot(diamonds, aes(x = carat)) +
geom_histogram(bins = 500)+
labs(title = "Distribution of carat in diamonds dataset")
ggplot(flights, aes(x = as.hms(sched_dep_time))) +
geom_histogram(bins = 24*6)+
labs(title = "Distribution of scheduled departure times in flights dataset")
```
* Both have gaps and peaks at 'attractive' values
1. Confirm my hypothesis that the early departures of flights in minutes
20-30 and 50-60 are caused by scheduled flights that leave early.
Hint: create a binary variable that tells you whether or not a flight
was delayed.
```{r}
mutate(flights_dt,
mins_dep = minute(dep_time),
mins_sched = minute(sched_dep_time),
delayed = dep_delay > 0) %>%
group_by(mins_dep) %>%
summarise(prop_delayed = sum(delayed) / n()) %>%
ggplot(aes(x = mins_dep, y = prop_delayed)) +
geom_line()
```
* Consistent with above hypothesis
## 16.4: Time spans
* **durations**, which represent an exact number of seconds.
* **periods**, which represent human units like weeks and months.
* **intervals**, which represent a starting and ending point.
![Permitted arithmetic operations between different data types](datetimes-arithmetic.png)
Periods example, using durations to fix oddity of problem when flight arrives overnight
```{r}
flights_dt <- flights_dt %>%
mutate(
overnight = arr_time < dep_time,
arr_time = arr_time + days(overnight * 1),
sched_arr_time = sched_arr_time + days(overnight * 1)
)
```
Intervals example to get precise number of days dependent on specific time
```{r}
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
```
To find out how many periods fall in an interval, need to use integer division
```{r}
(today() %--% next_year) %/% days(1)
```
### 16.4.5
1. Why is there `months()` but no `dmonths()`?
* the duration varies from month to month
1. Explain `days(overnight * 1)` to someone who has just started
learning R. How does it work?
* this used in the example above makes it such that if `overnight` is TRUE, it will return the same time period but one day ahead, if false, does not change (as is adding 0 days)
1. a. Create a vector of dates giving the first day of every month in 2015.
```{r}
x <- ymd("2015-01-01")
mons <- c(0:11)
(x + months(mons)) %>% wday(label = TRUE)
```
b. Create a vector of dates giving the first day of every month
in the _current_ year.
```{r}
x <- today() %>% update(month = 1, mday = 1)
mons <- c(0:11)
(x + months(mons)) %>% wday(label=TRUE)
```
1. Write a function that given your birthday (as a date), returns
how old you are in years.
```{r}
birthday_age <- function(birthday) {
(ymd(birthday) %--% today()) %/% years(1)
}
birthday_age("1989-09-07")
```
1. Why can't `(today() %--% (today() + years(1)) / months(1)` work?
* Can't add and subtract intervals
## Appendix
### 16.3.4.1
*Weekly flight proportions by 4 hour blocks*
```{r}
flights_dt %>%
transmute(month_dep = month(dep_time, label = TRUE),
wk_dep = week(dep_time),
dep_time_4hrs = floor_date(dep_time, "4 hours"),
hour_dep_4hrs = hour(dep_time_4hrs) %>% factor) %>%
count(wk_dep, hour_dep_4hrs) %>%
group_by(wk_dep) %>%
mutate(wk_tot = sum(n),
wk_prop = round(n / wk_tot, 3)) %>%
ungroup() %>%
ggplot(aes(x = wk_dep, y = wk_prop)) +
geom_col(aes(fill = hour_dep_4hrs))
```
*Weekly median fight time*
```{r}
flights_dt %>%
transmute(quarter_dep = quarter(dep_time) %>% factor(),
day_dep = as_date(dep_time),
wk_dep = floor_date(dep_time, "1 week") %>% as_date,
dep_time = as.hms(dep_time)) %>%
group_by(quarter_dep, wk_dep) %>%
summarise(wk_median = median(dep_time)) %>%
ungroup() %>%
mutate(wk_median = as.hms(wk_median)) %>%
ggplot(aes(x = wk_dep, y = wk_median)) +
geom_line(aes(colour = quarter_dep, group = 1))
```
*Proportion of flights in each hour, by quarter*
```{r}
flights_dt %>%
transmute(quarter_dep = quarter(dep_time) %>% factor(),
hour_dep = hour(dep_time)) %>%
count(quarter_dep, hour_dep) %>%
group_by(quarter_dep) %>%
mutate(quarter_tot = sum(n),
quarter_prop = round(n / quarter_tot, 3)) %>%
ungroup() %>%
ggplot(aes(x = hour_dep, y = quarter_prop)) +
geom_line(aes(colour = quarter_dep))
```
* Q1 seems to be a little more extreme at the local maximas
*Look at proportion of flights by hour faceted by each month*
```{r}
flights_dt %>%
transmute(month_dep = month(dep_time, label = TRUE),
hour_dep = hour(dep_time)) %>%
count(month_dep, hour_dep) %>%
group_by(month_dep) %>%
mutate(month_tot = sum(n),
month_prop = round(n / month_tot, 3)) %>%
ungroup() %>%
ggplot(aes(x = hour_dep, y = month_prop)) +
geom_line() +
facet_wrap( ~ month_dep)
```
### 16.3.4.3
* Perhaps these are flights where landed in different location...
```{r}
flights_dt %>%
mutate(arr_delay_test = (arr_time - sched_arr_time) / dminutes(1)) %>%
select( origin, dest, dep_delay, arr_delay, arr_delay_test, contains("time")) %>%
filter(is.na(arr_delay))
```
### 16.3.4.4
Below started looking at proportions...
```{r}
mutate(flights_dt,
dep_old = dep_time,
sched_old = sched_dep_time,
dep_time = floor_date(dep_time, "5 minutes"),
sched_dep_time = floor_date(sched_dep_time, "5 minutes"),
mins_dep = minute(dep_time),
mins_sched = minute(sched_dep_time),
delayed = dep_delay > 0) %>%
group_by(mins_dep, mins_sched) %>%
summarise(num_delayed = sum(delayed),
num = n(),
prop_delayed = num_delayed / num) %>%
group_by(mins_dep) %>%
mutate(num_tot = sum(num),
prop_sched = num / num_tot,
sched_dep_diff = mins_dep - mins_sched) %>%
ungroup() %>%
ggplot(aes(x = mins_dep, y = prop_sched, fill = factor(mins_sched))) +
geom_col()+
labs(title = "Proportion of early flights by minute scheduled v. minute departed")
mutate(flights_dt,
dep_old = dep_time,
sched_old = sched_dep_time,
# dep_time = floor_date(dep_time, "5 minutes"),
# sched_dep_time = floor_date(sched_dep_time, "5 minutes"),
mins_dep = minute(dep_time),
mins_sched = minute(sched_dep_time),
early_less10 = dep_delay >= -10) %>%
filter(dep_delay < 0) %>%
group_by(mins_dep) %>%
summarise(num = n(),
sum_recent10 = sum(early_less10),
prop_recent10 = sum_recent10 / num) %>%
ungroup() %>%
ggplot(aes(x = mins_dep, y = prop_recent10)) +
geom_line()+
labs(title = "proportion of early flights that were scheduled to leave within 10 mins of when they did")
```