# Group_by and working with databases

## Section 9 - Get group-wise insights: group_by

### group_by

* group_by(tbl, column to group by)
* df %>%
      group_by(Group)
      
### Unite and conquer using group_by

* Use group_by() to group hflights by UniqueCarrier.
* summarise() the grouped tbl with two summary variables:
    - p_canc, the percentage of cancelled flights
    - avg_delay, the average arrival delay of flights whose delay does not equal NA.
* Finally, order the carriers in the summary from low to high by their average arrival delay. Use percentage of flights cancelled to break any ties.

In [1]:
library(hflights)
library(dplyr)

# hflights is in the workspace as a tbl, with translated carrier names

# Make an ordered per-carrier summary of hflights
hflights %>%
  group_by(UniqueCarrier) %>%
  summarise(p_canc = mean(Cancelled == 1) * 100,
            avg_delay = mean(ArrDelay, na.rm = TRUE)) %>%
  arrange(avg_delay, p_canc)

"package 'dplyr' was built under R version 3.4.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



UniqueCarrier,p_canc,avg_delay
US,1.1268986,-0.6307692
AA,1.8495684,0.8917558
FL,0.9817672,1.8536239
AS,0.0,3.1923077
YV,1.2658228,4.0128205
DL,1.5903067,6.0841374
CO,0.6782614,6.0986983
MQ,2.904475,7.1529751
EV,3.4482759,7.2569543
WN,1.5504047,7.587143


### Combine group_by with mutate

* rank(): takes a group of values and calculates the rank of each value within the group
    - rank(c(21, 22, 24, 23))
    - As with arrange(), rank() ranks values from the smallest to the largest


* filter() the hflights tbl to only keep observations for which ArrDelay is not NA and positive.
* Use group_by() on the result to group by UniqueCarrier.
* Next, use summarise() to calculate the average ArrDelay per carrier. Call this summary variable avg.
* Feed the result into a mutate() call: create a new variable, rank, calculated as rank(avg).
* Finally, arrange by this new rank variable

In [5]:
# Both the dplyr and hflights packages are loaded into workspace
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")

# Add the Carrier column to hflights
hflights$Carrier <- lut[hflights$UniqueCarrier]

hflights %>%
    filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
    group_by(UniqueCarrier) %>%
    summarise(avg = mean(ArrDelay)) %>%
    mutate(rank = rank(avg)) %>%
    arrange(rank)

UniqueCarrier,avg,rank
YV,18.67568,1
F9,18.68683,2
US,20.70235,3
CO,22.13374,4
AS,22.91195,5
OO,24.14663,6
XE,24.19337,7
WN,25.2775,8
FL,27.85693,9
AA,28.4974,10


### Advanced group_by exercises

* How many airplanes flew to only one destination? The tbl you print out should have a single column, named nplanes and a single row.
* Find the most visited destination for each carrier. The tbl you print out should contain four columns:
    - UniqueCarrier and Dest,
    - n, how often a carrier visited a particular destination,
    - rank, how each destination ranks per carrier. rank should be 1 for every row, as you want to find the most visited destination for each carrier.

In [23]:
# dplyr and hflights (with translated carrier names) are pre-loaded

# How many airplanes only flew to one destination?
hflights %>%
    group_by(TailNum) %>%
    summarise(ndest = n_distinct(Dest)) %>%
    filter(ndest == 1) %>%
    summarise(nplanes = n())


# Find the most visited destination for each carrier
hflights %>%
    group_by(UniqueCarrier, Dest) %>%
    summarise(n = n()) %>%
    mutate(rank = rank(desc(n))) %>%
    filter(rank == 1)

nplanes
1526


UniqueCarrier,Dest,n,rank
AA,DFW,2105,1
AS,SEA,365,1
B6,JFK,695,1
CO,EWR,3924,1
DL,ATL,2396,1
EV,DTW,851,1
F9,DEN,837,1
FL,ATL,2029,1
MQ,DFW,2424,1
OO,COS,1335,1


## Section 10 - dplyr and databases

### databases

* Structures that dplyr can be applied
    - data frame
    - data table
    - database
    - tbl_dt    
    - tbl

### dplyr deals with different types

In [24]:
library(data.table)

hflights2 <- as.data.table(hflights)

# hflights2 is pre-loaded as a data.table

# Use summarise to calculate n_carrier
hflights2 %>%
  summarise(n_carrier = n_distinct(UniqueCarrier))

"package 'data.table' was built under R version 3.4.3"
Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last



n_carrier
15


### dplyr and mySQL databases

In [30]:
# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr", 
                   host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                   port = 3306, 
                   user = "dplyr",
                   password = "dplyr")

ERROR: Error: Condition message must be a string
