## The Wonderful World of ML - Session 6: Data Wrangling in R - Part 1

### Why focus on data wrangling?

In the first of these sessions, we jumped right into a few of the basic ML algorithms: linear regresssion, logistic regression, LDA, QDA, and decision trees.  In this session we'll step back and look at data wrangling.  It's an important topic because the literature suggests [[1]](https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf), it consumes roughly 80% of the time spent doing an analysis.

If it takes this much time just to get the data in a usable form, it makes sense to spend a little time in building some skills in this area.




### Connect to example database

We'll use the data from the dplyr tutorial given by Hadely Wickham here: [[2]](https://www.youtube.com/watch?v=8SGif63VW6E), [[3]](https://www.youtube.com/watch?v=Ue08LVuk790).  The database has been downloaded to the project so we can connect to it locally.  Here's a bit of code to do the connection and list out the tables in the database:

In [6]:
suppressMessages(suppressWarnings(library(DBI)))
install.packages('RSQLite')  # run this and then comment out
suppressMessages(suppressWarnings(library(RSQLite)))
# connect to the sqlite file - currently in the notebooks dir of the project
# so we need to go up 2 levels to get to the data dir
con = dbConnect(SQLite(), dbname="../../data/2014_dplyr_flight_data/hflights.sqlite3")
# get a list of all tables
alltables = dbListTables(con)
alltables

ERROR: Error in library(RSQLite): there is no package called 'RSQLite'


### All roads lead to Rome - Many ways to get the same result

A typical scenario which I've run into when I start some kind of analysis boils down to a series of choice that range from one of the two following extremes:

+ Write (what's often turns out to be) a complex SQL query to give you exactly what you need: both content and structure.
+ Write a simple SQL query that gets you the basic data content, but not in the right structure.  Then do a bunch of wrangling with dplyr to it into the right form.

This trade-off isn't apparent until you have do one or more joins to get the kind of dataframe you need to work on.  We'll show such an example, but for now, we'll keep it simple while we cover the basics.

### SQL Basics

Let's make a simple SQL call from R to our database and take a look at the resulting dataframe.

In [None]:
query_string <- "SELECT * FROM flights"  # get all the records in the flights table
# Fetch all query results into a data frame:
flight_data <- dbGetQuery(con, query_string)
head(flight_data)
nrow(flight_data)

# in case we need the df from the csv
#library(dplyr)
#path_flight_data <- "D:\\dev\\WonderfulML\\data\\2014_dplyr_flight_data\\flights.csv"
#flights_csv <- tbl_df(read.csv(path_flight_data, stringsAsFactors = FALSE))
#flights_csv$date <- as.Date(flights$date)

Let's fix those dates.  This is the kind of stuff you'll typically run into when you start working with a new dataset.  After reading a little of the SQLite doc's on date types [[6]](https://www.sqlite.org/datatype3.html#date_and_time_datatype), we discover that there are no explicit "date" types.  Instead, dates are expressed as either TEXT, INTEGER, or REAL types.

So how do we find out what type what types each field are in?  Again, you need some database-specific info here which you can get of the doc's.  To save a some time, you submit a query using structured like this:

In [None]:
data_types_flights <- dbGetQuery(con, "Pragma table_info (flights)")
data_types_flights

In [None]:
query_string <- paste0("SELECT date(date, '6682 years', '38 days') as DATE, ",
                       "hour, minute, dep, arr, dep_delay, arr_delay, carrier, ",
                       "flight, dest, plane, cancelled, time, dist FROM flights")
flight_data <- dbGetQuery(con, query_string)
head(flight_data)
msg <- paste0("type before conversion: ", class(flight_data$DATE))
msg
suppressMessages(suppressWarnings(library(dplyr)))
flight_data <- mutate(flight_data, DATE = as.Date(DATE))
msg <- paste0("type AFTER conversion: ", class(flight_data$DATE))
msg

Single table dplyr verbs:

+ **filter**: keep rows matching criteria
+ **select**: pick columns by name
+ **arrange**: reorder rows
+ **mutate**: add new variables
+ **summarise**: reduce variables to values (typically used with **group_by**)

Let's look at how to use **filter**.  From page 18 in Hadley's slides [4]:

In [None]:
df <- data.frame(color = c("blue", "black", "blue", "blue", "black"), value = 1:5)
suppressMessages(suppressWarnings(library(dplyr)))
filter(df, color == "blue")

Find all flights:
+ To SFO or OAK
+ In January
+ Delayed by more than an hour
+ That departed between midnight and five am.
+ Where the arrival delay was more than twice the departure delay

In [None]:
# In dplyr: step by step

p1 <- filter(flights, dest %in% c("SFO", "OAK"))
p2 <- filter(flights, date < "2001-02-01")  # or if ANDing together: p2 <- filter(p1, date < "2001-02-01")
p3 <- filter(flights, dep_delay > 60)


In [None]:
# In SQL

query1 <- paste0("SELECT * FROM flights ",
                 "WHERE dest IN ('SFO, 'OAK')") # be careful with quotes here!
p1_sql <- dbGetQuery(con, query1)

Notice how similar this is to using SQL WHERE clauses!

Let's take a quick look at **select**.  From page 23-24 in Hadley's slides [4]:

In [8]:
select(df, color)
select(df, -color)

color
blue
black
blue
blue
black


value
1
2
3
4
5


So it's like SQL Select, but more powerful because we do things like:

In [9]:
select(flights, ends_with("delay"))
select(flights, contains("delay"))

ERROR: Error in select(flights, ends_with("delay")): object 'flights' not found


How would you do this above in SQL?  Hint: It's not trivial!!!

The 3rd dplyr verb from our list is **arrange** which is sort of like an **ORDER BY** clause in SQL:

In [10]:
arrange(df, color)
arrange(df, desc(color))

color,value
black,2
black,5
blue,1
blue,3
blue,4


color,value
blue,1
blue,3
blue,4
black,2
black,5


Exercise from page 29 in Hadley's slides [4]:

+ Order the flights by departure date and time.
+ Which flights were most delayed?
+ Which flights caught up the most time during the flight?

In [None]:
r1 <- arrange(flights, date, hour, minute)
r2 <- arrange(flights, desc(dep_delay))
r3 <- arrange(flights, desc(arr_delay))
r4 <- arrange(flights, desc(dep_delay - arr_delay))

The next dplyr verb is **mutate** which we've had a peek at earlier in this notebook.  This allows us to create new columns or transform existing ones.  As a simple example from page 31 in Hadley's slides [4]:

In [11]:
mutate(df, double = 2 * value)

color,value,double
blue,1,2
black,2,4
blue,3,6
blue,4,8
black,5,10


How would you create a column for the average speed?

In [None]:
# In SQL
query2 <- "SELECT (dist/time) as AVG_SPEED FROM flights"
r2_sql <- dbGetQuery(con, query2)

# In dplyr
r2_dplyr <- mutate(flights, AVG_SPEED = dist/time)

The last dplyr verb on our list is **summerise** (if your Australian) or **summerize** (if you prefer US English).  Either on works exactly the same.  Most basic way to use this verb is without a group_by like this:

In [12]:
summarise(df, total = sum(value))

total
15


If you wanted to get the totals for each color, you'll need to do a group_by like this:

In [14]:
by_color <- group_by(df, color)
summarise(by_color, total = sum(value))

color,total
black,7
blue,8


Nice list of the summarize function on page 39 of the slides:

min(x), median(x), max(x),
quantile(x, p)
n(), n_distinct(), sum(x), mean(x)
sum(x > 10), mean(x > 10)
sd(x), var(x), IQR(x), mad(x)

### Pipes - I didn't like these at first, but they grow on you!

First a little motivation... Can you tell what this is doing?

In [None]:
# Downside of functional interface is that it's hard to read multiple operations:
hourly_delay <- filter(
    summarise(
        group_by(filter(flights, !is.na(dep_delay)),
                 date, hour),
        delay = mean(dep_delay),
        n = n()),
    n > 10
)

### A more readable way to do the above

In [None]:
# Solution: the pipe operator from magrittr
# x %>% f(y) -> f(x, y)

hourly_delay <- flights %>%
                filter(!is.na(dep_delay)) %>%
                group_by(date, hour) %>%
                summarise(delay = mean(dep_delay), n = n()) %>%
                filter(n > 10)

# Hint: pronounce %>% as then

### References

1. Wickham, Hadley, *Tidy Data*, Journal of Statistical Software, August 2014, Volume 59, Issue 10. [https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf](https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf)
2. Hadley Wickham's "dplyr" tutorial at useR 2014 (1/2) [https://www.youtube.com/watch?v=8SGif63VW6E](https://www.youtube.com/watch?v=8SGif63VW6E)
3. Hadley Wickham's "dplyr" tutorial at useR 2014 (2/2) [https://www.youtube.com/watch?v=Ue08LVuk790](https://www.youtube.com/watch?v=Ue08LVuk790)
4. Slides for 2. and 3. are in the repo at **WonderfulML\docs\slides\2014 Wickham-dplyr-tutorial.pdf**
5. Data sets for the 2014 dplyr tutorial, [https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a](https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a)
6. RSQLite documentation: [https://cran.r-project.org/web/packages/RSQLite/RSQLite.pdf](https://cran.r-project.org/web/packages/RSQLite/RSQLite.pdf)
7. SQLite documentation regarding date types: [https://www.sqlite.org/datatype3.html#date_and_time_datatype](https://www.sqlite.org/datatype3.html#date_and_time_datatype)
8. SQLite documentation regarding table structures (pragma): [http://www.sqlite.org/pragma.html#pragma_table_info](http://www.sqlite.org/pragma.html#pragma_table_info) 
9. SQL Zoo Tutorial - http://sqlzoo.net
10.
11.
12.
