04-runway-usage.Rmd

---
title: "Runway Usage at SFO"
author: "Jens Preussner"
date: "Wednesday, September 16, 2015"
output: html_document
---

In this exercise, we will use the R packages `reshape2` and `ggplot2` to tidy and visualize the [Late Night Preferential Runway Use Data](http://www.flysfo.com/media/noise-abatement-data) from San Francisco Airport (SFO).

## Introduction

SFO’s Nighttime Preferential Runway Use program was developed in 1988. Although the program cannot be used 100% of the time because of winds, weather, and other operational factors, the Airport, the Community Roundtable, the FAA, and the Airlines have all worked together to maximize its use when conditions permit. The main focus of this program is to maximize flights over water and minimize flights over land and populated areas between 1am and 6am. Fortunately, because airport activity levels are lower late at night, it is feasible to use over water departure procedures more frequently than would be possible during the day.

## Data download and extraction

Use the Shell to download and extract the data from the web:

```shell
wget http://media.flysfo.com/media/sfo/Late_Night_Preferential_Runway_use.zip
unzip Late_Night_Preferential_Runway_use.zip
```

The accompanying PDF file inside the `Late_Night_Preferential_Runway_use` directory explains the data semantics:

Variable                      Definition
--------                      ----------
Year                          The year of the aircraft departure
Month                         The month of the aircraft departure
01L/R                         The number of aircraft departing on specified runway
10L/R                         The number of aircraft departing on specified runway
19L/R                         The number of aircraft departing on specified runway
28L/R                         The number of aircraft departing on specified runway
01L/R Percent of Departures   Percentage of monthly departures on specified runway
10L/R Percent of Departures   Percentage of monthly departures on specified runway
19L/R Percent of Departures   Percentage of monthly departures on specified runway
28L/R Percent of Departures   Percentage of monthly departures on specified runway

Now open the Excel file `Late_Night_Preferential_Runway_Use_Data_200501-201412.xlsx` and switch to the worksheet entitled with *Raw LNPRU Data*. Export the data as CSV.

*Hint*: tbd.

## Exercises

1. Identify variables and observations in the LNPRU data set
2. Draw an outline of a tidy table from Ex. 1.
3. Write an R script to tidy the table in the CSV file you created above.
4. Create visulaizations from the tidy dataset:
    + Plot departure counts for all runways for the course of the year (Jan-Dec). *Hint: Think of a proper way to summarize counts over years.*


## Solution to exercises

### Identifying variables and observations

A tidy datasets is a collection of **values** organised into **variables** and **observations**. Lets start with variables: A **variable** contains all values that measure the same underlying attribute across units. Obviously, there are five attributes that can be spotted easily:

* Year
* Month
* Runway
* Departures
* Percent of Departures

**Observations** contain all values measured on the same unit. The LNPRU data set contains monthly observations ranging from beginning of 2005 to late 2014. This knowledge ultimately leads to the layout of a tidy LNPRU dataset: 

Year   Month   Runway   Departures   Percent of Departures
----   -----   ------   ----------   ---------------------
2005   1       01L/R    14           6
2005   1       10L/R    164          71
2005   1       19L/R    0            0
2005   1       28L/R    49           21

### Tidy the LNPRU data set in R

Now open RStudio, load the packages we'll use and navigate to the LNPRU data folder:

```{r message=FALSE}
library("dplyr")
library("reshape2")
library("ggplot2")
```

```{r eval=FALSE}
setwd("Late_Night_Preferential_Runway_use/")
```

Start with reading in the CSV file you created from the *Raw LNPRU Data* Excel worksheet and have a look at its structure.

```{r eval=FALSE}
raw_lnpru = read.csv(file = "Late_Night_Preferential_Runway_Use_Data_200501-201412.csv",header = T, sep = ";" )
```

```{r include = FALSE}
source("00-set-data-dir.R")
if(!file.exists(file.path(data_dir, "Late_Night_Preferential_Runway_Use_Data_200501-201412.csv"))) {
  download.file(paste0("https://raw.githubusercontent.com/jenzopr/",
                       "R-tidy-data-LoR/sf-runway-example/04-runway-usage_files/Late_Night_Preferential_Runway_Use_Data_200501-201412.csv"), 
                destfile = file.path(data_dir, "Late_Night_Preferential_Runway_Use_Data_200501-201412.csv"),
                method = "curl")
}
raw_lnpru = read.csv(file = file.path(data_dir, "Late_Night_Preferential_Runway_Use_Data_200501-201412.csv"),header = T, sep = ";" )
```

```{r}
names(raw_lnpru)
```

The `names` command reveals that the column headers are actually values, not variable names. Since we don't need the *precalculated* departure percentages, we exclude them by selecting only the columns containing count values prior to melting:

```{r}
lnpru = select(raw_lnpru, Year, Month, X01L.R, X10L.R, X19L.R, X28L.R)
```

Now we can **melt** the dataset into a tidy version, keeping `Year` and `Month` as id variables and using `Runway` and `Departures` as variable and value names:

```{r}
lnpru = melt(lnpru, id.vars = c("Year", "Month"), variable.name = "Runway", value.name = "Departures")
```

The last command gives us the tidy version of the LNPRU dataset:

```{r echo=FALSE}
head(lnpru)
```

### Create visualizations from a tidy dataset

#### 1. Departure counts for all runways for the course of a year

The `aggregate` function ca be used to summarize counts for each month. We can think of three functions to use along with `aggregate`:

* `mean` results in the mean departure count per month
* `median`results in the median departure count per month
* `sum` results in the sum of all dpeartures in a given month

```{r}
per_month = aggregate(Departures ~ Month * Runway, data = lnpru, FUN = median)
```

This gives us a datafram with aggregated departures per month and runway:

```{r echo=FALSE}
head(per_month)
```

**Creating a barplot**

```{r tidy=TRUE}
ggplot(per_month, aes(x=factor(Month), y=Departures, color=Runway, fill=Runway)) + geom_bar(stat="identity",position="dodge")
```

**Creating a smoothed line plot**

```{r tidy=TRUE}
ggplot(per_month, aes(x=factor(Month),y=Departures,group=Runway,color=Runway)) +
  geom_point(shape=1) +
  geom_smooth(se=T,method="loess",level=0.95) +
  scale_x_discrete(labels=as.character(per_month$Month)) +
  theme(axis.title.x = element_blank(), plot.title = element_text(face="bold")) +
  ggtitle("Monthly SFO runway usage at night")
```