Skip to content
This repository has been archived by the owner on Feb 20, 2022. It is now read-only.

Commit

Permalink
add data and cleaning scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
allanbreyes committed Jan 31, 2015
1 parent b937bcb commit 28cf5ae
Show file tree
Hide file tree
Showing 6 changed files with 50,498 additions and 0 deletions.
20 changes: 20 additions & 0 deletions p5/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
## [Something] Data Visualization
by Allan Reyes, in fulfillment of Udacity's [Data Analyst Nanodegree](https://www.udacity.com/course/nd002), Project 5

### Summary

`NotYetImplemented`

### Design

`NotYetImplemented`

### Feedback

`NotYetImplemented`

### Resources

- `NotYetImplemented`
- `NotYetImplemented`
- `NotYetImplemented`
50,008 changes: 50,008 additions & 0 deletions p5/data/334221194_112014_3544_airline_delay_causes.csv

Large diffs are not rendered by default.

103 changes: 103 additions & 0 deletions p5/data/data.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: "Exploratory Data Analysis and Cleaning for RITA Flight Data"
author: "Allan Reyes"
date: "01/30/2015"
output: html_document
---

### About
This data set contains information on United States airline flight delays and performance. The dataset includes all domestic flights from all carriers to/from major airports from June, 2003 through November, 2014.

[Download the data set from RITA.](http://www.transtats.bts.gov/OT_Delay/ot_delaycause1.asp?display=download&pn=0&month=11&year=2014)


### Load and inspect data
```{r}
# setwd('~/Dropbox/moocs/udacity-data-science/p5/data/')
df <- read.csv('334221194_112014_3544_airline_delay_causes.csv')
str(df)
summary(df)
head(df)
```

### Clean data
```{r}
library(dplyr)
# clean up date
df$date <- as.Date(paste(df$year, df$X.month, 1, sep='-'), format="%Y-%m-%d")
summary(df$date)
nrow(table(df$carrier))
# make a new summary table
ef <- df %>%
group_by(date, year, carrier_name) %>%
summarize(arrivals = sum(arr_flights),
delayed = sum(arr_del15),
cancelled = sum(arr_cancelled),
diverted = sum(arr_diverted)) %>%
transform(on_time = 1 - delayed/arrivals)
# stash NA values
ef <- ef[complete.cases(ef),]
```

### Exploratory Plots
```{r}
library(ggplot2)
ggplot(data = ef,
aes(x = date, y = on_time)) +
geom_line(aes(color = carrier_name))
```

### Find Airlines to Subset Data
```{r}
# aggregate by carrier name
agg <- ef %>%
group_by(carrier_name) %>%
summarize(monthly_avg = mean(arrivals),
arrivals = sum(arrivals))
# pull 75th percentile, by monthly average arrivals
selected_carriers <- subset(agg, monthly_avg >= quantile(monthly_avg, 0.81))$carrier_name
selected_carriers
```

### Reshape Data
```{r}
ff <- subset(ef, is.element(carrier_name, selected_carriers)) %>%
group_by(year, carrier_name) %>%
summarize(arrivals = sum(arrivals),
delayed = sum(delayed),
cancelled = sum(cancelled),
diverted = sum(diverted)) %>%
transform(on_time = 1 - delayed/arrivals)
ff <- ff[complete.cases(ff),]
```

### Replot
```{r}
summary(df$year)
library(gridExtra)
p1 <- ggplot(data = ff,
aes(x = year, y = on_time)) +
geom_line(aes(color = carrier_name)) +
scale_x_continuous(limits=c(2003, 2014), breaks=c(2003:2014))
p2 <- ggplot(data = ff,
aes(x = year, y = arrivals)) +
geom_line(aes(color = carrier_name)) +
scale_x_continuous(limits=c(2003, 2014), breaks=c(2003:2014))
grid.arrange(p1, p2, ncol=1)
```

### Export New CSV
```{r}
write.csv(ff, file="data.csv", row.names=FALSE)
```

61 changes: 61 additions & 0 deletions p5/data/data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"year","carrier_name","arrivals","delayed","cancelled","diverted","on_time"
2003,"American Airlines Inc.",309712,60722,5158,821,0.80394043498476
2003,"Delta Air Lines Inc.",267529,47922,2738,463,0.820871755959167
2003,"Southwest Airlines Co.",232412,29846,1876,512,0.871581501815741
2003,"United Air Lines Inc.",265570,44876,2713,498,0.831020070038031
2003,"US Airways Inc.",166652,32277,2547,325,0.806320956244149
2004,"American Airlines Inc.",535501,115374,10273,1328,0.78454942194319
2004,"Delta Air Lines Inc.",478836,108709,7532,897,0.772972374675254
2004,"Southwest Airlines Co.",419488,78774,4044,831,0.812213936989854
2004,"United Air Lines Inc.",463654,86079,5945,861,0.814346473879229
2004,"US Airways Inc.",296370,57538,4988,660,0.805857542936195
2005,"American Airlines Inc.",516641,113270,7958,1429,0.780756850501606
2005,"Delta Air Lines Inc.",460446,100624,12786,962,0.781464058760419
2005,"Southwest Airlines Co.",462515,84881,3655,664,0.816479465530848
2005,"United Air Lines Inc.",410410,83389,5536,749,0.796815379742209
2005,"US Airways Inc.",315210,69314,6359,612,0.780102154119476
2006,"American Airlines Inc.",495581,114656,8059,1450,0.768643269213307
2006,"Delta Air Lines Inc.",375490,83154,6078,891,0.778545367386615
2006,"Southwest Airlines Co.",509294,96129,4051,959,0.811250476149336
2006,"United Air Lines Inc.",423296,99553,8958,949,0.764814692319323
2006,"US Airways Inc.",403203,89505,4673,922,0.778015044530919
2007,"American Airlines Inc.",488059,139629,14115,1859,0.713909588799715
2007,"Delta Air Lines Inc.",361758,79277,5435,861,0.780856263026664
2007,"Southwest Airlines Co.",551668,99522,4699,923,0.819598019098443
2007,"United Air Lines Inc.",410397,110145,10367,853,0.731613535186661
2007,"US Airways Inc.",389957,113568,7526,833,0.708767889792977
2008,"American Airlines Inc.",469834,127912,13817,1693,0.727750652358067
2008,"Delta Air Lines Inc.",348921,77282,5626,999,0.778511468212002
2008,"Southwest Airlines Co.",582546,106532,5980,1422,0.817126887833751
2008,"United Air Lines Inc.",374531,96491,8962,833,0.742368455481656
2008,"US Airways Inc.",369920,68498,5796,812,0.814830233564014
2009,"American Airlines Inc.",435079,89537,7546,1616,0.794205190321758
2009,"Delta Air Lines Inc.",340353,70198,4173,891,0.793749430738087
2009,"Southwest Airlines Co.",561302,91714,4240,1016,0.836604893622328
2009,"United Air Lines Inc.",293241,49269,5173,592,0.831984613338517
2009,"US Airways Inc.",345672,61923,4626,620,0.820861973200028
2010,"American Airlines Inc.",427624,77048,7636,1662,0.819823022094176
2010,"Delta Air Lines Inc.",569124,116511,12047,1289,0.795280114702596
2010,"Southwest Airlines Co.",564941,105611,6212,1163,0.81305835476625
2010,"United Air Lines Inc.",294674,38325,4539,651,0.869941019567386
2010,"US Airways Inc.",340420,51614,5686,573,0.848381411197932
2011,"American Airlines Inc.",428270,80606,11063,1727,0.811786956826301
2011,"Delta Air Lines Inc.",565635,92728,8406,1127,0.836063892793056
2011,"Southwest Airlines Co.",590383,102154,6219,1288,0.826969950015498
2011,"United Air Lines Inc.",268909,48796,4369,508,0.818540844672361
2011,"US Airways Inc.",316900,56474,5525,593,0.821792363521616
2012,"American Airlines Inc.",421678,84397,7906,1523,0.799854391265373
2012,"Delta Air Lines Inc.",561333,72865,3170,1018,0.870192915791518
2012,"Southwest Airlines Co.",584780,93970,5199,1045,0.839307089845754
2012,"United Air Lines Inc.",455573,96177,6819,937,0.78888784014856
2012,"US Airways Inc.",344817,45286,3807,426,0.86866656806364
2013,"American Airlines Inc.",432100,84058,7673,1540,0.805466327239065
2013,"Delta Air Lines Inc.",579437,87503,2036,1071,0.848986171059149
2013,"Southwest Airlines Co.",573008,129990,4521,1317,0.773144528523162
2013,"United Air Lines Inc.",437935,86050,4421,1027,0.803509653259045
2013,"US Airways Inc.",351975,61817,3834,518,0.82437104908019
2014,"American Airlines Inc.",395256,84299,6503,1618,0.786723035197442
2014,"Delta Air Lines Inc.",556903,88229,4821,1062,0.841572051147148
2014,"Southwest Airlines Co.",558117,139551,8348,1671,0.749961029676573
2014,"United Air Lines Inc.",393274,85653,6195,967,0.782205281813697
2014,"US Airways Inc.",325894,56811,6014,627,0.825676446942871
306 changes: 306 additions & 0 deletions p5/data/data.html

Large diffs are not rendered by default.

Binary file added p5/data/target_plots.pdf
Binary file not shown.

0 comments on commit 28cf5ae

Please sign in to comment.