# Course Description

In this interactive tutorial, you will learn how to perform sophisticated dplyr techniques to carry out your data manipulation with R. First you will master the five verbs of R data manipulation with dplyr: select, mutate, filter, arrange and summarise. Next, you will learn how you can chain your dplyr operations using the pipe operator of the magrittr package. In the final section, the focus is on practicing how to subset your data using the group_by function, and how you can access data stored outside of R in a database. All said and done, you will be familiar with data manipulation tools and techniques that will allow you to efficiently manipulate data.


# A) Introduction to dplyr and tbls
Introduction to the dplyr package and the tbl class. Learn the philosophy that guides dplyr, discover some useful applications of the dplyr package, and meet the data structures that dplyr uses behind the scenes.

dplyr introduces a grammar of data manupulation and tha same time you´ll also learn to use dplyr´s tbl structure and the piping operator; two features that can save you tons of time.

## 1) Load the dplyr and hflights package
Welcome to the interactive exercises part of your dplyr course. Here you will learn the ins and outs of working with dplyr. dplyr is an R package, a collection of functions and data sets that enhance the R language.

Throughout this course you will use dplyr to analyze a data set of airline flight data containing flights that departed from Houston. This data is stored in a package called hflights.

Both dplyr and hflights are already installed on DataCamp's servers, so loading them with library() will get you up and running.

    # Load the dplyr package
    library("dplyr")

    #Load the hflights package. A variable called hflights will become available, a data.frame representing the data set.
    # Load the hflights package
    library("hflights")

    # Call both head() and summary() on hflights
    head(hflights)
    summary(hflights)

## 2) tbl, a special type of data.frame
if we started to work with our data frame "hflights" is possible that we have some problems with them as it´s very long.

dplyr provides a new data structure for r **tbl** , a tbl is a special data frame but R know how to display it properly, for this we need :

    hflights2<-tbl_df(hflights)
    
with this, you can display the amount of data that will fit in your console window.

However, you can use the function **glimpse** shows you data type and the initial values of each column in the dataset 

    glimpse(hflights2)

And if you don´t like the tbl format, you can change your structure back with somenthing like as **as.data.frame(dataset)**


### 2.1) Convert data.frame to tibble
As Garrett explained, a tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using as_tibble().

The tbl format changes how R displays your data, but it does not change the data's underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame. In other words, you can do anything with the hflights tbl that you could do with the hflights data.frame.

    # Both the dplyr and hflights packages are loaded

    # Convert the hflights_df data.frame into a hflights tbl
    as_tibble(hflights)

    # Display the hflights tbl
    hflights

    # Create the object carriers saving the column UniqueCarrier
    carriers<-hflights$UniqueCarrier
    
### 2.2) Changing labels of hflights, part 1 of 2

A bit of cleaning would be a good idea since the UniqueCarrier variable of hflights uses a confusing code system.

To do this, let's work with a lookup table, that comes in the form of a named vector. When you subset the lookup table with a character string (like the character strings in UniqueCarrier), R will return the values of the lookup table that correspond to the names in the character string. To see how this works, run following code in the console:

    two <- c("AA", "AS")
    lut <- c("AA" = "American", 
             "AS" = "Alaska", 
             "B6" = "JetBlue")
    two <- lut[two]
    two    


Exercise: 

1. Add a new Carrier column to hflights by combining lut with the UniqueCarrier column of hflights.
2. It's rather hard to see if you did things right, since the Carrier variable does not appear when you print hflights. Use the glimpse() function on hflights instead.


    # Both the dplyr and hflights packages are loaded into workspace
    lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
             "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
             "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
             "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")

    # Add the Carrier column to hflights
    hflights$Carrier <- lut[hflights$UniqueCarrier]

    # Glimpse at hflights
    glimpse(hflights)

### 2.3) Changing labels of hflights, part 2 of 2

Let's try a similar thing, but this time to change the labels in the CancellationCode column. This column lists reasons why a flight was cancelled using a non-informative alphabetical code. Execute

    unique(hflights$CancellationCode)
    
A lookup table lut has already been created for you, that converts the alphabetical codes into more meaningful strings.

for example, tha sam way that last exercise we need to add a new column through a look up 

    # The hflights tbl you built in the previous exercise is available in the workspace.

    # The lookup table
    lut <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")

    # Add the Code column
    hflights$Code <- lut[hflights$CancellationCode]

    # Glimpse at hflights
    glimpse(hflights)

# B) Select and mutate
Get familiar with dplyr's manipulation verbs. Meet the five verbs and then practice using the mutate and select verbs.

## 1) The five verbs and select in more detail
dbplyr does more than provide a new data structure, it provides a complete grammar for data manipulation.

    a. select: which removes columns from a dataset i.e it returns a subset of the columns
    b. filter: which removes rows i.e it  returns a subset of the rows
    c. arrange: which reorders the rows in a dataset
    d. mutate: which uses data to build new columns of values
    e. summarize: which calculates summary statistics

### 1.1) The select verb
To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select().

As an example, take the following call, that selects the variables var1 and var2 from the data frame df.

    select(df, var1, var2)
    
You can also use **:** to select a range of variables and **-** to exclude some variables, similar to indexing a data.frame with square brackets. You can use both variable's names as well as integer indexes. This call selects the four first variables except for the second one of a data frame df:

    select(df, 1:4, -2)
    
select() does not change the data frame it is called on; you have to explicitly assign the result of select() to a variable to store the result.   
 
    # hflights is pre-loaded as a tbl, together with the necessary libraries.
    # Print out a tbl with the four columns of hflights related to delay
    select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)

    # Print out the columns Origin up to Cancelled of hflights
    select(hflights, 14:19)

    # Answer to last question: be concise! Find the most concise way to select: columns Year up to and including DayOfWeek, columns ArrDelay up to and including Diverted. You can examine the order of the variables in hflights with names(hflights) in the console.

    names(hflights)
    select(hflights, 1:4, 12:21)

### 1.2) Helper functions for variable selection
dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:

    starts_with("X"): every name that starts with "X",
    ends_with("X"): every name that ends with "X",
    contains("X"): every name that contains "X",
    matches("X"): every name that matches "X", where "X" can be a regular expression,
    num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,
    one_of(x): every name that appears in x, which should be a character vector.

Pay attention here: When you refer to columns directly inside select(), you don't use quotes. If you use the helper functions, you do use quotes.

    # As usual, hflights is pre-loaded as a tbl, together with the necessary libraries.

    # Print out a tbl containing just ArrDelay and DepDelay
    select(hflights, ends_with("Delay"))

    # Print out a tbl as described UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode, using both helper functions and variable names
    select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel"))

    # Print out tbl as described DepTime, ArrTime, ActualElapsedTime, AirTime, ArrDelay, DepDelay, using only helper functions.
    select(hflights, contains("Tim"), contains("Del"))
    
### 1.3) Comparison to base R
To see the added value of the dplyr package, it is useful to compare its syntax with base R. Up to now, you have only considered functionality that is also available without the use of dplyr. The elegance and ease-of-use of dplyr is a great plus though.

    # both hflights and dplyr are available

    # Finish select call so that ex1d matches ex1r
    ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
    ex1d <- select(hflights, contains("Taxi"), Distance)

    # Finish select call so that ex2d matches ex2r
    ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
    ex2d <- select(hflights, Year:ArrTime, -DayofMonth)

    # Finish select call so that ex3d matches ex3r
    ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
    ex3d <- select(hflights, starts_with("T"))
    
## 2)Mutate      
it´s add new variable tp a data set, because it builds the new variable out of data that already exist in the dataset, its sintaxis is very similar as select:

    mutate(dataset, new_column_name = expression )
    
Example:

    # hflights and dplyr are loaded and ready to serve you.

    # Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
    g1<-mutate(hflights, ActualGroundTime = ActualElapsedTime-AirTime)

    # Add the new variable GroundTime to g1. Save the result as g2.
    g2<-mutate(g1, GroundTime = TaxiIn + TaxiOut)

    # Add the new variable AverageSpeed to g2. Save the result as g3.
    g3<-mutate(g2, AverageSpeed = 60*Distance/AirTime)

    # Print out g3
    g3
    
### 2.1) Add multiple variables using mutate
So far you've added variables to hflights one at a time, but you can also use mutate() to add multiple variables at once. To create more than one variable, place a comma between each variable that you define inside mutate().

mutate() even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x is directly reused to create the new variable y:

    mutate(my_df, x = a + b, y = x + c)
    
Example:

    # Add a second variable loss_ratio to the dataset: m1
    m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_ratio = loss/DepDelay)

    # Add the three variables as described in the third instruction: m2
    m2<-mutate(hflights, TotalTaxi = TaxiIn+TaxiOut, ActualGroundTime = ActualElapsedTime-AirTime, Diff = TotalTaxi-ActualGroundTime)

# c) Filter and arrange
Learn how to search through the observations in your data set (and extract useful observations) with the filter function. Rearrange the observations in your data set with the arrange verb.

## 1) the third of five verbs: filter
if we want to use this verb we need to know its estructure:

    filter(source, logical)

for this verb it's necessary to know the logical operators, for this we can type ?Comparison like :`>, >= , <, <=, == , !=, is.na or !is.na `

### 1.1) Logical operators
R comes with a set of logical operators that you can use inside filter():

    x < y, TRUE if x is less than y
    x <= y, TRUE if x is less than or equal to y
    x == y, TRUE if x equals y
    x != y, TRUE if x does not equal y
    x >= y, TRUE if x is greater than or equal to y
    x > y, TRUE if x is greater than y
    x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

The following example filters df such that only the observations for which a is positive, are kept:

    filter(df, a > 0)
    
for example:

    # hflights is at your disposal as a tbl, with clean carrier names
    str(hflights)
    # All flights that traveled 3000 miles or more
    filter(hflights, Distance>= 3000)

    # All flights flown by one of JetBlue, Southwest, or Delta
    filter(hflights, UniqueCarrier %in% c("JetBlue", "Southwest", "Delta"))

    # All flights where taxiing took longer than flying
    filter(hflights, (TaxiIn + TaxiOut) > AirTime)

### 1.2) Combining tests using boolean operators
R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following two calls are completely equivalent:

    filter(df, a > 0 & b > 0)
    filter(df, a > 0, b > 0)
Next, is.na() will also come in handy. This example keeps the observations in df for which the variable x is not NA:

    filter(df, !is.na(x))
    
Examples:

    # hflights is at your service as a tbl!
    str(hflights)

    # All flights that departed before 5:00 am (500) or arrived after 10:00 pm (2200)
    filter(hflights, DepTime < 500 | ArrTime > 2200)

    # All flights that departed late but arrived ahead of schedule
    filter(hflights, DepDelay > 0 & ArrDelay < 0)

    # All flights that were cancelled after being delayed
    filter(hflights, DepDelay > 0 & Cancelled == 1)
    
### 1.3) Excercise  - Blend together what you've learned!
So far, you have learned three data manipulation functions in the dplyr package. Time for a summarizing exercise. You will generate a new dataset from the hflights dataset that contains some useful information on flights that had JFK airport as their destination. You will need select(), mutate() and filter().

    # hflights is already available in the workspace
    str(hflights)
    # Select the flights that had JFK as their destination: c1
    c1<-filter(hflights, Dest == "JFK")

    # Combine the Year, Month and DayofMonth variables to create a Date column: c2
    c2<-mutate(c1,Date = paste(Year, Month, DayofMonth, sep = "-"))

    # Print out a selection of columns of c2
    select(c2, Date, DepTime, ArrTime, TailNum)
    
## 2) Almost there: the arrange verb
to use arrange onlye need to do:

    arrange(source, columns to arrange)
    
### 2.1)Arranging your data
arrange() can be used to rearrange rows according to any type of data. If you pass arrange() a character variable, for example, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).

dtc has already been defined on the right. It's up to you to write some arrange() expressions to display its contents appropriately

    # dplyr and the hflights tbl are available

    # Definition of dtc
    dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))
    str(dtc)
    # Arrange dtc by departure delays
    arrange(dtc, DepDelay)

    # Arrange dtc so that cancellation reasons are grouped
    arrange(dtc, CancellationCode)

    # Arrange dtc according to carrier and departure delays
    arrange(dtc, UniqueCarrier, DepDelay)
    
### 2.2) Reverse the order of arranging
By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behavior with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange().    

    # dplyr and the hflights tbl are available
    str(hflights)
    # Arrange according to carrier and decreasing departure delays
    arrange(hflights, UniqueCarrier, desc(DepDelay))


    # Arrange flights by total delay (normal order).
    arrange(hflights, ArrDelay + DepDelay)


# c) Summarize and the pipe operator
Master the data manipulation verb summarize, and practice combining the five verbs to solve advanced data manipulation tasks. Learn to chain the operators together with the piping operator.

## 1) summarize 
Summarise uses your data to create a new dataset of summary statistics that describre the data, the syntax of summarise follows that a mutate 

    summarise(df, new_column_name = expression )
    examples
    summarise(df, sum = sum(A),
                  avg = mean(B),
                  var = var(B))

### 1.1) The syntax of summarize
summarize(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarize() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

    # hflights and dplyr are loaded in the workspace
    str(hflights)
    # Print out a summary with variables min_dist and max_dist
    summarize(hflights, min_dist = min(Distance), max_dist = max(Distance))

    # Print out a summary with variable max_div
    summarize(filter(hflights, Diverted == 1), max_div = max(Distance))

### 1.2) Aggregate functions
You can use any function you like in summarize() so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them:

    min(x) - minimum value of vector x.
    max(x) - maximum value of vector x.
    mean(x) - mean value of vector x.
    median(x) - median value of vector x.
    quantile(x, p) - pth quantile of vector x.
    sd(x) - standard deviation of vector x.
    var(x) - variance of vector x.
    IQR(x) - Inter Quartile Range (IQR) of vector x.
    diff(range(x)) - total range of vector x. 
    

Example

    # hflights is available

    # Remove rows that have NA ArrDelay: temp1
    temp1 <- filter(hflights, !is.na(ArrDelay))

    # Generate summary about ArrDelay column of temp1
    summarize(temp1, 
              earliest = min(ArrDelay), 
              average = mean(ArrDelay), 
              latest = max(ArrDelay), 
              sd = sd(ArrDelay))

    # Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
    temp2 <- filter(hflights, !is.na(TaxiIn) & !is.na(TaxiOut))

    #Print out a summary of temp2, with one variable, max_taxi_diff: the biggest absolute difference in time between TaxiIn and TaxiOut for a single flight.

    # Print the maximum taxiing difference of temp2 with summarize()
    summarize(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))
    
### 1.3) dplyr aggregate functions
dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

    first(x) - The first element of vector x.
    last(x) - The last element of vector x.
    nth(x, n) - The nth element of vector x.
    n() - The number of rows in the data.frame or group of observations that summarize() describes.
    n_distinct(x) - The number of unique values in vector x.

Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean(). A logical test returns a vector of TRUE's and FALSE's. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. sum() then represents the total number of observations that passed the test; mean() represents the proportion.    

    # hflights is available with full names for the carriers

    # Generate summarizing statistics for hflights
    summarize(hflights, 
              n_obs = n(), 
              n_carrier = n_distinct(UniqueCarrier), 
              n_dest = n_distinct(Dest))

    # All American Airline flights
    aa <- filter(hflights, UniqueCarrier == "American")

    # Generate summarizing statistics for aa 
    # Print out a summary of aa with the following variables:
        #n_flights: the total number of flights (each observation is a flight),
        #n_canc: the total number of cancelled flights,
        #avg_delay: the average arrival delay of flights whose delay is not NA (na.rm = TRUE).
    
    summarize(aa, 
              n_flights = n(), 
              n_canc = sum(Cancelled == 1),
              avg_delay = mean(ArrDelay, na.rm = TRUE))

## 2) pipe operator 
The pipe is an operator that you place between an object and a function

    object %>% function (___ ,arg2, arg3, ...)
    
The pipe takes the object on its left and passes it to the function on its right as the first argument of the function 


### 2.1) Overview of syntax
As another example of the %>%, have a look at the following two commands that are completely equivalent:

    mean(c(1, 2, 3, NA), na.rm = TRUE)
    c(1, 2, 3, NA) %>% mean(na.rm = TRUE)
    
The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving *the Dagwood sandwich problem*.

for example: 

    #Use dplyr functions and the pipe operator to transform the following English sentences into R code:
    #Take the hflights data set and then ...
    #Add a variable named diff that is the result of subtracting TaxiIn from TaxiOut, and then ...
    #Pick all of the rows whose diff value does not equal NA, and then ...
    #Summarize the data set with a value named avg that is the mean diff value.
    
    # hflights and dplyr are both loaded and ready to serve you
    str(hflights)
    # Write the 'piped' version of the English sentences.
    
    hflights %>%
    mutate(diff = TaxiOut-TaxiIn) %>%
    filter(!is.na(diff)) %>%
    summarize(avg = mean(diff) )
    
Other Example:

    #mutate() the hflights dataset and add two variables:
        #RealTime: the actual elapsed time plus 100 minutes (for the overhead that flying involves) and
        #mph: calculated as 60 times Distance divided by RealTime, then
    #filter() to keep observations that have an mph that is not NA and that is below 70, finally
    #summarize() the result by creating four summary variables:
        #n_less, the number of observations,
        #n_dest, the number of destinations,
        #min_dist, the minimum distance and
        #max_dist, the maximum distance.
 
    hflights %>%
    mutate(RealTime = ActualElapsedTime + 100, mph = Distance*60/RealTime) %>%
    filter(!is.na(mph) & mph < 70) %>%
    summarize(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance))
    
Other Example:

    #filter() the result of mutate to:
    #keep observations that have an mph under 105 or for which Cancelled equals 1 or for which Diverted equals 1.
    #summarize() the result by creating four summary variables:
        #n_non, the number of observations,
        #n_dest, the number of destinations,
        #min_dist, the minimum distance and
        #max_dist, the maximum distance.
        
      
      hflights %>%
      mutate(
        RealTime = ActualElapsedTime + 100, 
        mph = 60 * Distance / RealTime
      ) %>%
      filter(mph < 105 | Cancelled == 1 | Diverted == 1)  %>% 
      summarize(
        n_non = n(),
        n_dest = n_distinct(Dest),
        min_dist = min(Distance),
        max_dist = max(Distance)
      )
      
 Other Example
 
     #filter() the hflights tbl to keep only observations whose DepTime is not NA, whose ArrTime is not NA and for which DepTime exceeds ArrTime.
     #Pipe the result into a summarize() call to create a single summary variable: num, that simply counts the number of observations.
    
        # hflights and dplyr are loaded
        #str(hflights)
        # Count the number of overnight flights
        hflights %>%
          filter(!is.na(DepTime) & !is.na(ArrTime) & DepTime > ArrTime) %>%
          summarize(num= n())

# d) Group_by and working with databases

Complete your mastery of data manipulation with group-wise operations and databases. Learn to use group_by to group your data into subsets of observations, and use dplyr to access data stored outside of R in a database.

## 1) Group_by and working with databases
the sintax of group_by is:

    group_by(df, group) or
    
    df %>%
    group_by(group)

### 1.1) Unite and conquer using group_by
As we explained, group_by() lets you define groups within your data set. Its influence becomes clear when calling summarize() on a grouped dataset: summarizing statistics are calculated for the different groups separately.

Intruccions:

Use group_by() to group hflights by UniqueCarrier.
1. summarize() the grouped tbl with two summary variables:
    * p_canc, the percentage of cancelled flights.
    * avg_delay, the average arrival delay of flights whose delay does not equal NA.
2. Finally, order the carriers in the summary from low to high by their average arrival delay. Use percentage of flights
cancelled to break any ties.


    # hflights is in the workspace as a tbl, with translated carrier names
    str(hflights)
    # Make an ordered per-carrier summary of hflights
    hflights %>%
      group_by(UniqueCarrier) %>%
      summarize(
        p_canc = 100 * mean(Cancelled == 1),
        avg_delay = mean(ArrDelay,  na.rm = TRUE)
      ) %>%
      arrange(avg_delay , p_canc)
      
### 1.2) Combine group_by with mutate
You can also combine group_by() with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within-group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

    rank(c(21, 22, 24, 23))
has output

    [1] 1 2 4 3
As with arrange(), rank() ranks values from the smallest to the largest.  

Example, Instrucctions: 

1. filter() the hflights tbl to only keep observations for which ArrDelay is not NA and positive.
2. Use group_by() on the result to group by UniqueCarrier.
3. Next, use summarize() to calculate the average ArrDelay per carrier. Call this summary variable avg.
4. Feed the result into a mutate() call: create a new variable, rank, calculated as rank(avg).
5. Finally, arrange by this new rank variable


        #dplyr is loaded, hflights is loaded with translated carrier names
        str(hflights)
        #Ordered overview of average arrival delays per carrier
        hflights %>%
        filter(!is.na(ArrDelay) & ArrDelay > 0) %>%
        group_by(UniqueCarrier) %>%
        summarize(
          avg = mean(ArrDelay)
        ) %>%
        mutate(rank=rank(avg)) %>%
        arrange(rank)

Last Exercise:

1. How many airplanes flew to only one destination? The tbl you print out should have a single column, named nplanes and a single row.
2. Find the most visited destination for each carrier. The tbl you print out should contain four columns:
  * UniqueCarrier and Dest,
  * n, how often a carrier visited a particular destination,
  * rank, how each destination ranks per carrier. rank should be 1 for every row, as you want to find the most visited destination for each carrier.
 
        #dplyr and hflights (with translated carrier names) are pre-loaded

        #How many airplanes only flew to one destination?
        hflights %>%
          group_by(TailNum) %>%
          summarize(ndest = n_distinct(Dest)) %>%
          filter(ndest == 1) %>%
          summarize(nplanes = n())


        #Find the most visited destination for each carrier
        hflights %>% 
          group_by(UniqueCarrier, Dest) %>%
          summarize(n = n()) %>%
          mutate(rank = rank(desc(n))) %>%
          filter(rank == 1)
          
## 2) dplyr and databases


### 2.1) dplyr deals with different types
hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:

    library(data.table)
    hflights2 <- as.data.table(hflights)
    
hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.

Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.

### 2.2) dplyr and mySQL databases
DataCamp hosts a mySQL database with data about flights that departed from New York City in 2013. The data is similar to the data in hflights, but it does not contain information about cancellations or diversions. With the tbl() function, we already created a reference to a table in this information.

Although nycflights is a reference to data that lives outside of R, you can use the dplyr commands on them as usual. Behind the scenes, dplyr will convert the commands to the database's native language (in this case, SQL), and return the results. This allows you to pull data that is too large to fit in R: only the fraction of the data that you need will actually be downloaded into R, which will usually fit into R without memory issues.

Example:

1. Try to understand the code that creates nycflights, a reference to a MySQL table.
2. Use glimpse() to check out nycflights. Although nycflights is a reference to a tbl in a remote database, there is no difference in syntax. Look carefully: the variable names in nycflights differ from the ones in hflights!
2. Group nycflights data by carrier, then summarize() with two variables: n_flights, the number of flights flown by each carrier and avg_delay, the average arrival delay of flights flown by each carrier. Finally, arrange the carriers by average delay from low to high.


    #Set up a connection to the mysql database
    my_db <- src_mysql(dbname = "dplyr", 
                       host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                       port = 3306, 
                       user = "student",
                       password = "datacamp")

    #Reference a table within that source: nycflights
    nycflights <- tbl(my_db, "dplyr")

    #glimpse at nycflights
    glimpse(nycflights)

    #Ordered, grouped summary of nycflights
      nycflights %>%
      group_by(carrier) %>%
      summarize(n_flights = n(), avg_delay = mean(arr_delay)) %>%
      arrange(avg_delay)
