-
Notifications
You must be signed in to change notification settings - Fork 8
/
13-relational-data.Rmd
684 lines (540 loc) · 26.8 KB
/
13-relational-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
```{r setup13, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
library(ggplot2)
library(dplyr)
library(tidyr)
library(nycflights13)
library(babynames)
library(nasaweather)
library(lubridate)
```
# Ch. 13: Relational data
```{block2, type='rmdimportant'}
**Key questions:**
* 13.4.6 , #1, 3, 4
* 13.5.1, #2, 4
```
```{block2, type='rmdtip'}
**Functions and notes:**
```
> "The relations of three or more tables are always a property of the relations between each pairs."
*Three families of verbs in relational data:*
* __Mutating joins__, which add new variables to one data frame from matching
observations in another.
+ `inner_join`: match when equal
+ `left_join`: keep all observations in table in 1st arg
+ `right_join`: keep all observations in table in 2nd arg
+ `full_join`: keep all observations in table in 1st and 2nd arg
* __Filtering joins__, which filter observations from one data frame based on
whether or not they match an observation in the other table.
+ `semi_join(x, y)` __keeps__ all observations in `x` that have a match in `y`.
+ `anti_join(x, y)` __drops__ all observations in `x` that have a match in `y`.
* __Set operations__, which treat observations as if they were set elements.
+ `intersect(x, y)`: return only observations in both `x` and `y` (when inputs are a df, is comparing across all values in a row).
+ `union(x, y)`: return unique observations in `x` and `y`.
+ `setdiff(x, y)`: return observations in `x`, but not in `y`.
`base::merge()` can perform all four types of mutating join:
dplyr | merge
-------------------|-------------------------------------------
`inner_join(x, y)` | `merge(x, y)`
`left_join(x, y)` | `merge(x, y, all.x = TRUE)`
`right_join(x, y)` | `merge(x, y, all.y = TRUE)`,
`full_join(x, y)` | `merge(x, y, all.x = TRUE, all.y = TRUE)`
SQL is the inspiration for dplyr's conventions, so the translation is straightforward:
dplyr | SQL
-----------------------------|-------------------------------------------
`inner_join(x, y, by = "z")` | `SELECT * FROM x INNER JOIN y USING (z)`
`left_join(x, y, by = "z")` | `SELECT * FROM x LEFT OUTER JOIN y USING (z)`
`right_join(x, y, by = "z")` | `SELECT * FROM x RIGHT OUTER JOIN y USING (z)`
`full_join(x, y, by = "z")` | `SELECT * FROM x FULL OUTER JOIN y USING (z)`
## 13.2 nycflights13
```{r, eval = FALSE}
flights
airlines
airports
planes
weather
```
### 13.2.1
1. *Imagine you wanted to draw (approximately) the route each plane flies from*
*its origin to its destination. What variables would you need? What tables*
*would you need to combine?*
To draw a line from origin to destination, I need the lat lon points from airports` as well as the dest and origin variables from `flights`.
1. *I forgot to draw the relationship between `weather` and `airports`.*
*What is the relationship and how should it appear in the diagram?*
`origin` from `weather connects to `faa` from `airports` in a many to one relationship
1. *`weather` only contains information for the origin (NYC) airports. If*
*it contained weather records for all airports in the USA, what additional*
*relation would it define with `flights`?*
It would connect to `dest`.
1. *We know that some days of the year are "special", and fewer people than*
*usual fly on them. How might you represent that data as a data frame?*
*What would be the primary keys of that table? How would it connect to the*
*existing tables?*
Make a set of days that are less popular and have these dates connect by month and day
## 13.3 Keys
### 13.3.1
1. *Add a surrogate key to `flights`.*
```{r}
flights %>%
mutate(surrogate_key = row_number()) %>%
glimpse()
```
1. *Identify the keys in the following datasets*
1. `Lahman::Batting`: player, year, stint
```{r}
Lahman::Batting %>%
count(playerID, yearID, stint) %>%
filter(n > 1)
```
2. `babynames::babynames`: name, sex, year
```{r}
babynames::babynames %>%
count(name, sex, year) %>%
filter(n > 1)
```
3. `nasaweather::atmos`: lat, long, year, month
```{r}
nasaweather::atmos %>%
count(lat, long, year, month) %>%
filter(n > 1)
```
4. `fueleconomy::vehicles`: id
```{r}
fueleconomy::vehicles %>%
count(id) %>%
filter(n > 1)
```
5. `ggplot2::diamonds`: needs surrogate
```{r}
diamonds %>%
count(x, y, z, depth, table, carat, cut, color, price, clarity ) %>%
filter(n > 1)
diamonds %>%
mutate(surrogate_id = row_number())
```
1. *Draw a diagram illustrating the connections between the `Batting`,`Master`, and `Salaries` tables in the Lahman package. Draw another diagram that shows the relationship between `Master`, `Managers`, `AwardsManagers`.*
* `Lahman::Batting` and `Lahman::Master` combine by `playerID`
* `Lahman::Batting` and `Lahman::Salaries` combine by `playerID`, `yearID`
* `Lahman::Master` and `Lahman::Salaries` combine by `playerID`
* `Lahman::Master` and `Lahman::Managers` combine by `playerID`
* `Lahman::Master` and `Lahman::AwardsManagers` combine by `playerID`
*How would you characterise the relationship between the `Batting`, `Pitching`, and `Fielding` tables?*
* All connect by `playerID`, `yearID`, `stint`
## 13.4 Mutating joins
> The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn't a match.
### 13.4.6
1. *Compute the average delay by destination, then join on the `airports`*
*data frame so you can show the spatial distribution of delays. Here's an*
*easy way to draw a map of the United States:*
```{r}
flights %>%
semi_join(airports, c("dest" = "faa")) %>%
group_by(dest) %>%
summarise(delay = mean(arr_delay, na.rm=TRUE)) %>%
left_join(airports, by = c("dest"="faa")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point(aes(colour = delay)) +
coord_quickmap()+
# see chapter 28 for information on scales
scale_color_gradient2(low = "blue", high = "red")
```
1. *Add the location of the origin _and_ destination (i.e. the `lat` and `lon`)*
*to `flights`.*
```{r}
flights %>%
left_join(airports, by = c("dest" = "faa")) %>%
left_join(airports, by = c("origin" = "faa"), suffix = c("_dest", "_origin")) %>%
select(flight, carrier, dest, lat_dest, lon_dest, origin, lat_origin, lon_origin)
```
Note that the suffix allows you to tag names onto first and second table, hence why vector is length 2
1. *Is there a relationship between the age of a plane and its delays?*
```{r}
group_by(flights, tailnum) %>%
summarise(avg_delay = mean(arr_delay, na.rm=TRUE),
n = n()) %>%
left_join(planes, by="tailnum") %>%
mutate(age = 2013 - year) %>%
filter(n > 50, age < 30) %>%
ggplot(aes(x = age, y = avg_delay))+
ggbeeswarm::geom_quasirandom()+
geom_smooth()
```
Looks as though planes that are roughly 5 to 10 years old have higher delays... Let's look at same thing using boxplots.
```{r}
group_by(flights, tailnum) %>%
summarise(avg_delay = mean(arr_delay, na.rm=TRUE),
n = n()) %>%
left_join(planes, by="tailnum") %>%
mutate(age = 2013 - year) %>%
filter(n > 50, age <= 30, age >= 0) %>%
ggplot()+
geom_boxplot(aes(x = cut_width(age, 2, boundary = 0), y = avg_delay))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Perhaps there is not an overall trend association between age and delays, though it seems that the particular group of planes in that time range seem to have delays than either newer or older planes. On the other hand, there does almost look to be a seasonality pattern -- though this may just be me seeing things... perhaps worth exploring more...
A simple way to test for a non-linear relationship would be to discretize age and then pass it through an anova...
```{r}
nycflights13::flights %>%
select(arr_delay, tailnum) %>%
left_join(planes, by="tailnum") %>%
filter(!is.na(arr_delay)) %>%
mutate(age = 2013 - year,
age_round_5 = (5 * age %/% 5) %>% as.factor()) %>%
with(aov(arr_delay ~ age_round_5)) %>%
summary()
```
* There are weaknesses to using anova, but the low p-value above suggests test arrival delay is not randomly distributed across age
* The reason for such a difference may be trivial or may be confounded by a more interesting pattern... but these are deeper questions
1. *What weather conditions make it more likely to see a delay?*
There are a lot of ways you could have approached this problem. Below, I look at the average weather value for each of the groups `FALSE`, `TRUE` and `Canceled` -- `FALSE` corresponding with non-delayed flights, `TRUE` with delayed flights and `Canceled` with flights that were canceled. If I were feeling fancy, I would have also added the standard errors on these...
```{r}
flights_weath <- mutate(flights, delay_TF = dep_delay > 0) %>%
separate(sched_dep_time,
into = c("hour_sched", "min_sched"),
sep = -3,
remove = FALSE,
convert = TRUE) %>%
left_join(weather, by = c("origin", "year","month", "day", "hour_sched"="hour"))
flights_weath_gath <- flights_weath %>%
select(sched_dep_time, delay_TF, sched_dep_time, temp:visib) %>%
mutate(key = row_number()) %>%
gather(temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib,
key="weather", value="values")
flights_summarized <- flights_weath_gath %>%
group_by(weather, delay_TF) %>%
summarise(median_weath = median(values, na.rm = TRUE),
mean_weath = mean(values, na.rm = TRUE),
sum_n = sum(!is.na(values))) %>%
ungroup() %>%
mutate(delay_TF = ifelse(is.na(delay_TF), "Canceled", delay_TF),
delay_TF = forcats::as_factor(delay_TF, c(FALSE, TRUE, "Canceled")))
flights_summarized %>%
ggplot(aes(x = delay_TF, y = mean_weath, fill = delay_TF))+
geom_col()+
facet_wrap(~weather, scales = "free_y")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
While precipitation is the largest difference, my guess is that the standard error on this would be much greater day to day because as you can see the values are very low, so it could be that a few cases with a lot of rain may tick it up, but it may be tough to actually use this as a predictor...
1. *What happened on June 13 2013? Display the spatial pattern of delays,*
*and then use Google to cross-reference with the weather.*
```{r, eval = FALSE, include = FALSE}
worst <- filter(flights, month == 6, day == 13)
worst %>%
group_by(dest) %>%
summarise(delay = mean(arr_delay, na.rm = TRUE),
n = sum(!is.na(arr_delay))) %>%
filter(n > 5) %>%
inner_join(airports, by = c("dest" = "faa")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point(aes(size = n, colour = delay)) +
coord_quickmap()
```
Looks like East coast is getting hammered and flights arriving from Atlanta an similar locations were very delayed. Guessing either weather issue, or problem in Atl or delta.
## 13.5 Filtering joins
### 13.5.1
1. *What does it mean for a flight to have a missing `tailnum`?*
All flights with a missing tailnum in the `flights` table were cancelled as you can see below.
```{r}
flights %>%
count(is.na(tailnum), is.na(arr_delay))
```
*What do the tail numbers that don't have a matching record in `planes` have in common?*
*(Hint: one variable explains ~90% of the problems.)*
```{r}
flights %>%
anti_join(planes, by="tailnum") %>%
count(carrier, sort = TRUE)
flights %>%
mutate(in_planes = tailnum %in% planes$tailnum) %>%
group_by(carrier) %>%
summarise(flights_inPlanes = sum(in_planes),
n = n(),
perc_inPlanes = flights_inPlanes / n) %>%
ungroup()
```
Some carriers do not have many of their tailnums data in the `planes` table. (Come back.)
1. *Filter flights to only show flights with planes that have flown at least 100 flights.*
```{r}
planes_many <- flights %>%
count(tailnum, sort=TRUE) %>%
filter(n > 100)
semi_join(flights, planes_many)
```
* `add_count()` is another helpful function that could have been used here
1. *Combine `fueleconomy::vehicles` and `fueleconomy::common` to find only the records for the most common models.*
```{r}
fueleconomy::vehicles %>%
semi_join(fueleconomy::common, by=c("make", "model"))
```
1. *Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the `weather` data. Can you see any patterns?*
First: Create two variables that together capture all 48 hour time windows across the year, at the day window of granularity (e.g. the time of day the flight takes off does not matter in establishing time windows for this example, only the day).
Second: Gather these time windows into a single dataframe (note that this will increase the length of your data by ~364/365 * 100 %)
Third: Group by `window_start_date` and calculate average `arr_delay` and related metrics.
```{r}
delays_windows <- flights %>%
#First
mutate(date_flight = lubridate::as_date(time_hour)) %>%
mutate(startdate_window1 = cut.Date(date_flight, "2 day")) %>%
mutate(date_flight2 = ifelse(!(date_flight == min(date_flight, na.rm = TRUE)), date_flight, NA),
date_flight2 = lubridate::as_date(date_flight2),
startdate_window2 = cut.Date(date_flight2, "2 day")) %>%
select(-date_flight, -date_flight2) %>%
#Second
gather(startdate_window1, startdate_window2, key = "start_window", value = "window_start_date") %>%
filter(!is.na(window_start_date)) %>%
#Third
group_by(window_start_date) %>%
summarise(num = n(),
perc_cancelled = sum(is.na(arr_delay)) / n(),
mean_delay = mean(arr_delay, na.rm = TRUE),
perc_delay = mean(arr_delay > 0, na.rm = TRUE),
total_delay_mins = sum(arr_delay, na.rm = TRUE)) %>%
ungroup()
#don't worry about warning of 'attributes are not identical...', that is
#because the cut function assigns attributes to the value, it's fine if
#these are dropped here.
```
Create tibble of worst 2-day period for mean `arr_delay`
```{r}
WorstWindow <- delays_windows %>%
mutate(mean_delay_rank = dplyr::min_rank(-mean_delay)) %>%
filter(mean_delay_rank <= 1)
WorstDates <- tibble(dates = c(lubridate::as_date(WorstWindow$window_start_date), lubridate::as_date(WorstWindow$window_start_date) + lubridate::duration(1, "days")))
```
Ammend weather data so that weather is an average across three NY locations rather than seperate for each ^[note that a weighted average based on traffic would be more appropriate here because the `mean_delay` values will be weighted by number of flights going through each -- hopefully lack of substantial difference between locatioins means this won't be too impactful...]
```{r}
weather_ammended <- weather %>%
mutate(time_hour = lubridate::make_datetime(year, month, day, hour)) %>%
select(-one_of("origin", "year", "month", "day", "hour")) %>%
group_by(time_hour) %>%
summarise_all(mean, na.rm = TRUE) %>%
ungroup()
```
Filtering join to just times weather for worst 2 days
```{r}
weather_worst <- weather_ammended %>%
mutate(dates = as_date(time_hour)) %>%
semi_join(WorstDates)
```
Plot of hourly weather values across 48 hour time windows.
```{r}
weather_worst %>%
select(-dates) %>%
gather(temp:visib, key = "weather_type", value = "weather_value") %>%
ggplot(aes(x = time_hour, y = weather_value))+
geom_line()+
facet_grid(weather_type ~ ., scales = "free_y")+
labs(title = 'Hourly weather values across worst 48 hours of delays')
```
Patterns:
* `wind_gust` and `wind_speed` are the same.
* See high level of colinearity in spikes and changes, e.g. increase in `precip` corresponds with decrease in `visib` and perhaps uptick in `wind_spee`
Perhaps, we want to view how the average hourly weather values compare on the worst days to average weather days. Create summary of average hourly weather values for worst 48 hour period, for average period, and then append these and plot.
```{r}
bind_rows(
weather_worst %>%
summarise_at(vars(temp:visib), mean, na.rm = TRUE) %>%
mutate(category = "weather on worst 48") %>%
gather(temp:visib, key = weather_type, value = weather_val)
,
weather_ammended %>%
summarise_at(vars(temp:visib), mean, na.rm = TRUE) %>%
mutate(category = "weather on average") %>%
gather(temp:visib, key = weather_type, value = weather_val)
) %>%
ggplot(aes(x = category, y = weather_val, fill = category))+
geom_col()+
facet_wrap(~weather_type, scales = "free_y")+
labs(title = "Hourly average weather values on worst 48 hour window of delays vs. hourly average weather across year",
caption = "Note that delays are based on mean(arr_delay, na.rm = TRUE)")
```
For this to be the worst 48 hour period, the weather doesn't actually seem to be as extreme as I would have guessed.
Let's add-in average `arr_delay` by planned departure time to this to see how the delay times throughout the day varied, to see if there was a surge or change in weather that led to the huge change in delays.
```{r}
flights %>%
mutate(dates = as_date(time_hour)) %>%
semi_join(WorstDates) %>%
group_by(time_hour) %>%
summarise(value = mean(arr_delay, na.rm = TRUE)) %>%
ungroup() %>%
mutate(value_type = "Mean_ArrDelay") %>%
bind_rows(
weather_worst %>%
select(-dates) %>%
gather(temp:visib, key = "value_type", value = "value")
) %>%
mutate(weather_attr = !(value_type == "Mean_ArrDelay"),
value_type = forcats::fct_relevel(value_type, "Mean_ArrDelay")) %>%
ggplot(aes(x = time_hour, value, colour = weather_attr))+
geom_line()+
facet_grid(value_type ~ ., scales = "free_y")+
labs(title = 'Hourly weather and delay values across worst 48 hours of delays')
```
Maybe that first uptick in precipitation corresponded with the increase in delay... but still, looks extreme like an incident caused this. I cheched the news and it looks like a plane was crash landed onto the tarmac at one of the airports on this day https://en.wikipedia.org/wiki/Southwest_Airlines_Flight_345#cite_note-DMN_Aircraft_Totaled_20160808-4 , I checked the incident time and it occurred at 17:45 Jul 22, looks like it overlaps with the time we see the uptick in delays.
I show plots and models of 48 hour time windows in a variety of other contexts and detail in [Appendix]
1. *What does `anti_join(flights, airports, by = c("dest" = "faa"))` tell you? What does `anti_join(airports, flights, by = c("faa" = "dest"))` tell you?*
* `anti_join(flights, airports, by = c("dest" = "faa"))` -- tells me the flight dests missing an airport
* `anti_join(airports, flights, by = c("faa" = "dest"))` -- tells me the airports with no flights coming to them
1. *You might expect that there's an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you've learned above.*
```{r}
tail_carr <- flights %>%
filter(!is.na(tailnum)) %>%
distinct(carrier, tailnum) %>%
count(tailnum, sort=TRUE)
tail_carr %>%
filter(n > 1)
```
You should reject that hypothesis, you can see that 17 `tailnum`s are duplicated on multiple carriers.
Below is code to show those 17 tailnums
```{r}
flights %>%
distinct(carrier, tailnum) %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
mutate(n_tail = n()) %>%
ungroup() %>%
filter(n_tail > 1) %>%
arrange(desc(n_tail), tailnum)
```
## Appendix
### 13.5.1.4
Graph all of these metrics at once using roughly the same method as used on 13.4.6 #4.
```{r}
delays_windows %>%
gather(perc_cancelled, mean_delay, perc_delay, key = value_type, value = val) %>%
mutate(window_start_date = lubridate::as_date(window_start_date)) %>%
ggplot(aes(window_start_date, val))+
geom_line()+
facet_wrap(~value_type, scales = "free_y", ncol = 1)+
scale_x_date(date_labels = "%b %d")+
labs(title = 'Measures of delay across 48 hour time windows')
```
Create 48 hour windows for weather data. Follow exact same steps as above.
```{r}
weather_windows <- weather_ammended %>%
mutate(date_flight = lubridate::as_date(time_hour)) %>%
mutate(startdate_window1 = cut.Date(date_flight, "2 day")) %>%
mutate(date_flight2 = ifelse(!(date_flight == min(date_flight, na.rm = TRUE)), date_flight, NA),
date_flight2 = lubridate::as_date(date_flight2),
startdate_window2 = cut.Date(date_flight2, "2 day")) %>%
select(-date_flight, -date_flight2) %>%
#Second
gather(startdate_window1, startdate_window2, key = "start_window", value = "window_start_date") %>%
filter(!is.na(window_start_date)) %>%
#Third
group_by(window_start_date) %>%
summarise_at(vars(temp:visib), mean, na.rm = TRUE) %>%
ungroup() %>%
select(-wind_gust)
```
Graph using same method as above...
```{r}
weather_windows %>%
gather(temp:visib, key = weather_type, value = val) %>%
mutate(window_start_date = lubridate::as_date(window_start_date)) %>%
ggplot(aes(x = window_start_date, y = val))+
geom_line()+
facet_wrap(~weather_type, ncol = 1, scales = "free_y")+
scale_x_date(date_labels = "%b %d")+
labs(title = 'Measures of weather across 48 hour time windows')
```
Connect delays and weather data
```{r}
weather_delay_joined <- left_join(delays_windows, weather_windows, by = "window_start_date") %>%
select(mean_delay, temp:visib, window_start_date) %>%
select(-dewp) %>% #is almost completely correlated with temp so removed one of them...
na.omit()
```
Plot of 48 hour window of weather scores against mean delay keeping intact order of observations
```{r}
weather_delay_joined %>%
gather(mean_delay, temp:visib, key = value_type, value = val) %>%
mutate(window_start_date = lubridate::as_date(window_start_date),
value_type = forcats::fct_relevel(value_type, "mean_delay")) %>%
ggplot(aes(x = window_start_date, y = val, colour = ! value_type == "mean_delay"))+
geom_line()+
facet_wrap(~value_type, scales = "free_y", ncol = 1)+
labs(colour = "Weather value", title = "Mean delay and weather value in 2-day rolling window")
```
Plot of mean_delay against weather type, each point representing a different 'window'
```{r}
weather_delay_joined %>%
gather(temp:visib, key = weather_type, value = weather_val) %>%
ggplot(aes(x = weather_val, y = mean_delay))+
geom_point()+
geom_smooth()+
facet_wrap(~weather_type, scales = "free_x")
```
In a sense, these plots are not really valid as they obscure the fact that each point is not an independent observation (because there is a high level of association with w/e the value was on a single day with what it was in the previous day). E.g. mean_delay has a correlation of ~ 0.68 with prior days value as shown below... This is often ignored and we can also ignore it for now as it gets into time series and things we don't need to worry about for now... but somthing to be aware...
```{r}
weather_delay_joined %>%
mutate(mean_delay_lag = lag(mean_delay)) %>%
select(mean_delay, mean_delay_lag) %>%
na.omit() %>%
cor()
```
Data is not Independent (as mentioned above) and many problems associated with this... but let's ignore this for now and just look at a few statisitics...
Can see below that raw correlation of `mean_delay` is highest with humid.
```{r}
weather_delay_joined %>%
select(-window_start_date) %>%
cor()
```
When accounting for other variables, see relationship with windspeed seems to emerge as important...
```{r}
weather_delay_joined %>%
select(-window_start_date) %>%
lm(mean_delay ~ ., data = .) %>%
summary()
```
There a variety of reasons^[Especially in cases where your observations are not independent] you may want to evaluate how the change in an attribute relates to the change in another attribute. In the cases below I plot the diffs for example:
*(average value on 2013-02-07 to 2013-02-08) - (average value on 2013-02-08 to 2013-02-09)*
Note that the time windows are not distinct but overlap by 24 hours.
If doing a thorough account of time-series you would do a lot more than I show below...
```{r}
weather_delay_joined %>%
gather(mean_delay, temp:visib, key = value_type, value = val) %>%
mutate(window_start_date = lubridate::as_date(window_start_date),
value_type = forcats::fct_relevel(value_type, "mean_delay")) %>%
group_by(value_type) %>%
mutate(value_diff = val - lag(val)) %>%
ggplot(aes(x = window_start_date, y = value_diff, colour = !value_type == "mean_delay"))+
geom_line()+
facet_wrap(~value_type, scales = "free_y", ncol = 1)+
labs(colour = "Weather value", title = "Plot of diffs in value")
```
Let's plot these diffs as a scatter plot now (no longer looking at the order in which the observations emerged)
```{r}
weather_delay_joined %>%
gather(temp:visib, key = weather_type, value = val) %>%
group_by(weather_type) %>%
mutate(weather_diff = val - lag(val),
delay_diff = mean_delay - lag(mean_delay)) %>%
ungroup() %>%
ggplot(aes(x = weather_diff, y = delay_diff))+
geom_point()+
geom_smooth()+
facet_wrap(~weather_type, scales = "free_x")+
labs(title = "scatter plot of diffs in value")
```
Let's look at the correlatioin and regression against these diffs
```{r}
diff_data <- weather_delay_joined %>%
gather(mean_delay, temp:visib, key = value_type, value = val) %>%
group_by(value_type) %>%
mutate(diff = val - lag(val)) %>%
ungroup() %>%
select(-val) %>%
spread(key = value_type, value = diff)
diff_data %>%
select(-window_start_date) %>%
na.omit() %>%
cor()
diff_data %>%
select(-window_start_date) %>%
lm(mean_delay ~ ., data = .) %>%
summary()
```