# Lesson 02 - Reading Data
## Presentation of flight data

For these lessons, we will be using flight data [Airline On-Time Statistics](http://stat-computing.org/dataexpo/2009/the-data.html)

We are studying data on flights within the USA, restricting ourselves to all flights departing from Chicago O'Hare airport (ORD) in 2007. The dataset is stored as a `csv` file: each row holds information for a
single flight, and the columns represent:

| Column           | Description                        |
|------------------|------------------------------------|
|**Year**          | year of flight                     |
|**Month**         | month of flight                    |
|DayofMonth        | day of flight                      |
|**DayOfWeek**     | week day of flight                 |
|DepTime           | actual departure time (local hhmm) |
|**CRSDepTime**    | scheduled departure time           |
|ArrTime           | actual arrival time (local hhmm)   |
|CRSArrTime        | scheduled arrival time             |
|UniqueCarrier     | airline code                       |
|FlightNum         | flight number                      |
|TailNum           | plane tail number                  |
|ActualElapsedTime | elapsed time (minutes)             |
|CRSElapsedTime    | scheduled elapsed time (minutes)   |
|AirTime           | time in air (minutes)              |
|ArrDelay          | arrival delay (minutes)            |
|**DepDelay**      | departure delay (minutes)          |
|Origin            | origin airport                     |
|**Dest**          | destination airport                |
|Distance          | distance (miles)                   |
|TaxiIn            | taxi in time (minutes)             |
|TaxiOut           | taxi out time (minutes)            |
|Cancelled         | cancelled (yes/no)                 |
|CancellationCode  | reason (A = carrier, B = weather, C = NAS, D = security)|
|Diverted          | diverted (1=yes/0=no)              |
|CarrierDelay      | delay caused by airline (minutes)  |
|WeatherDelay      | delay caused by weather            |
|NASDelay          | delay caused by NAS                |
|SecurityDelay     | delay caused by security           |
|LateAircraftDelay | delay caused by late aircraft      |

## How to load data from a CSV file
* Use `read.csv()` function; it will return a data frame
  * Use the `str()` function to get information about each column
  * Use the `summary()` function to get statistics for each column

In [None]:
delays07 <- read.csv('data/2007_ORD.csv')
head(delays07)

In [None]:
str(delays07)

## Factors
* Non-numeric data: all values belong to a limited set of "levels"
  * Get levels with `levels()`
  * Get number of levels with `nlevels()`
  * Example: from the [Data Carpentry workshop](http://www.datacarpentry.org/R-ecology-lesson/02-starting-with-data.html#factors)

In [None]:
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))

sprintf("There are %d levels: %s", nlevels(food), paste(levels(food), collapse=", "))

* Note: the `summary()` function will compute quantiles for numeric data, but frequencies for factors
  * Use the `as.factor()` method to convert numeric values to a factor
  * Example: in the `airquality` data, `Month` values should not be numeric data

In [None]:
library(datasets)
summary(airquality$Month)

In [None]:
airquality$Month <- as.factor(airquality$Month)

sprintf("There are %d levels: %s", nlevels(airquality$Month), paste(levels(airquality$Month), collapse=" "))
summary(airquality$Month)

## Exercise 1 - Most Problematic Destinations
* Select the Dest column from delays07
* Use the `head()` function to retrieve the top-10 destinations
  * Hint: the `summary()` function will sort levels in descending order of level frequency

## Tables and Bar-plot
* The function `table()` can sum occurences in a factor and create a table
* The function `as.table()` can convert a data frame or a summary to a table
* The function `barplot()` can display the corresponding data

In [None]:
table(airquality$Month)
barplot(as.table(summary(airquality$Month)), main="Days in months")

## Exercise 2 - Most Problematic Destinations
* From Dest in delays07, show the top-10 most problematic destinations in a bar-plot
  * Hint: convert the `head` of the `summary` to a `table`