This repository has been archived by the owner on Apr 18, 2018. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 8
/
04-runway-usage.Rmd
160 lines (116 loc) · 6.67 KB
/
04-runway-usage.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
title: "Runway Usage at SFO"
author: "Jens Preussner"
date: "Wednesday, September 16, 2015"
output: html_document
---
In this exercise, we will use the R packages `reshape2` and `ggplot2` to tidy and visualize the [Late Night Preferential Runway Use Data](http://www.flysfo.com/media/noise-abatement-data) from San Francisco Airport (SFO).
## Introduction
SFO’s Nighttime Preferential Runway Use program was developed in 1988. Although the program cannot be used 100% of the time because of winds, weather, and other operational factors, the Airport, the Community Roundtable, the FAA, and the Airlines have all worked together to maximize its use when conditions permit. The main focus of this program is to maximize flights over water and minimize flights over land and populated areas between 1am and 6am. Fortunately, because airport activity levels are lower late at night, it is feasible to use over water departure procedures more frequently than would be possible during the day.
## Data download and extraction
Use the Shell to download and extract the data from the web:
```shell
wget http://media.flysfo.com/media/sfo/Late_Night_Preferential_Runway_use.zip
unzip Late_Night_Preferential_Runway_use.zip
```
The accompanying PDF file inside the `Late_Night_Preferential_Runway_use` directory explains the data semantics:
Variable Definition
-------- ----------
Year The year of the aircraft departure
Month The month of the aircraft departure
01L/R The number of aircraft departing on specified runway
10L/R The number of aircraft departing on specified runway
19L/R The number of aircraft departing on specified runway
28L/R The number of aircraft departing on specified runway
01L/R Percent of Departures Percentage of monthly departures on specified runway
10L/R Percent of Departures Percentage of monthly departures on specified runway
19L/R Percent of Departures Percentage of monthly departures on specified runway
28L/R Percent of Departures Percentage of monthly departures on specified runway
Now open the Excel file `Late_Night_Preferential_Runway_Use_Data_200501-201412.xlsx` and switch to the worksheet entitled with *Raw LNPRU Data*. Export the data as CSV.
*Hint*: tbd.
## Exercises
1. Identify variables and observations in the LNPRU data set
2. Draw an outline of a tidy table from Ex. 1.
3. Write an R script to tidy the table in the CSV file you created above.
4. Create visulaizations from the tidy dataset:
+ Plot departure counts for all runways for the course of the year (Jan-Dec). *Hint: Think of a proper way to summarize counts over years.*
## Solution to exercises
### Identifying variables and observations
A tidy datasets is a collection of **values** organised into **variables** and **observations**. Lets start with variables: A **variable** contains all values that measure the same underlying attribute across units. Obviously, there are five attributes that can be spotted easily:
* Year
* Month
* Runway
* Departures
* Percent of Departures
**Observations** contain all values measured on the same unit. The LNPRU data set contains monthly observations ranging from beginning of 2005 to late 2014. This knowledge ultimately leads to the layout of a tidy LNPRU dataset:
Year Month Runway Departures Percent of Departures
---- ----- ------ ---------- ---------------------
2005 1 01L/R 14 6
2005 1 10L/R 164 71
2005 1 19L/R 0 0
2005 1 28L/R 49 21
### Tidy the LNPRU data set in R
Now open RStudio, load the packages we'll use and navigate to the LNPRU data folder:
```{r message=FALSE}
library("dplyr")
library("reshape2")
library("ggplot2")
```
```{r eval=FALSE}
setwd("Late_Night_Preferential_Runway_use/")
```
Start with reading in the CSV file you created from the *Raw LNPRU Data* Excel worksheet and have a look at its structure.
```{r eval=FALSE}
raw_lnpru = read.csv(file = "Late_Night_Preferential_Runway_Use_Data_200501-201412.csv",header = T, sep = ";" )
```
```{r include = FALSE}
source("00-set-data-dir.R")
if(!file.exists(file.path(data_dir, "Late_Night_Preferential_Runway_Use_Data_200501-201412.csv"))) {
download.file(paste0("https://raw.githubusercontent.com/jenzopr/",
"R-tidy-data-LoR/sf-runway-example/04-runway-usage_files/Late_Night_Preferential_Runway_Use_Data_200501-201412.csv"),
destfile = file.path(data_dir, "Late_Night_Preferential_Runway_Use_Data_200501-201412.csv"),
method = "curl")
}
raw_lnpru = read.csv(file = file.path(data_dir, "Late_Night_Preferential_Runway_Use_Data_200501-201412.csv"),header = T, sep = ";" )
```
```{r}
names(raw_lnpru)
```
The `names` command reveals that the column headers are actually values, not variable names. Since we don't need the *precalculated* departure percentages, we exclude them by selecting only the columns containing count values prior to melting:
```{r}
lnpru = select(raw_lnpru, Year, Month, X01L.R, X10L.R, X19L.R, X28L.R)
```
Now we can **melt** the dataset into a tidy version, keeping `Year` and `Month` as id variables and using `Runway` and `Departures` as variable and value names:
```{r}
lnpru = melt(lnpru, id.vars = c("Year", "Month"), variable.name = "Runway", value.name = "Departures")
```
The last command gives us the tidy version of the LNPRU dataset:
```{r echo=FALSE}
head(lnpru)
```
### Create visualizations from a tidy dataset
#### 1. Departure counts for all runways for the course of a year
The `aggregate` function ca be used to summarize counts for each month. We can think of three functions to use along with `aggregate`:
* `mean` results in the mean departure count per month
* `median`results in the median departure count per month
* `sum` results in the sum of all dpeartures in a given month
```{r}
per_month = aggregate(Departures ~ Month * Runway, data = lnpru, FUN = median)
```
This gives us a datafram with aggregated departures per month and runway:
```{r echo=FALSE}
head(per_month)
```
**Creating a barplot**
```{r tidy=TRUE}
ggplot(per_month, aes(x=factor(Month), y=Departures, color=Runway, fill=Runway)) + geom_bar(stat="identity",position="dodge")
```
**Creating a smoothed line plot**
```{r tidy=TRUE}
ggplot(per_month, aes(x=factor(Month),y=Departures,group=Runway,color=Runway)) +
geom_point(shape=1) +
geom_smooth(se=T,method="loess",level=0.95) +
scale_x_discrete(labels=as.character(per_month$Month)) +
theme(axis.title.x = element_blank(), plot.title = element_text(face="bold")) +
ggtitle("Monthly SFO runway usage at night")
```