# Welcome to the Introduction to R Workshop 

This workshop is intended to serve as an introduction to the R programming language for those with little to no experience in its use. 

## In this workshop you will learn:
- Basics of R (objects, variables, data classes, vectors)
- How to write R functions 
- How to import and export your own files 
- How to install and load R and Bioconductor packages

### This workshop uses JupyterHub, which provides an environment to run Jupyter notebooks for Python, Julia, R, and other languages without the need to install any software or packages. Two helpful shortcuts:
- Shift + Enter to run a cell and move to the next cell
- Press 'b' to create a cell below (won't work if you cursor is showing up inside the cell itself).
- See other shortcuts here https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330.

### If you want to continue using R outside of a Jupyter notebook, you'll need to install R - https://www.r-project.org/ and optionally Rstudio - https://rstudio.com/.

# What is R?
- R is a language and environment for statistical computing and graphics (https://www.r-project.org/about.html). 
- It is free and can be tailored to your needs by installing and using specific packages (https://cran.r-project.org/web/packages/available_packages_by_name.html).
- You can use R interactively (e.g., enter commands into the R console or R Studio https://rstudio.com/) or into new cells in your Jupyter notebook.
- You can write and run R scripts or generate reports and documents using R notebooks (https://blog.rstudio.com/2016/10/05/r-notebooks/) and markdown files (https://rmarkdown.rstudio.com/).

# Basics - Interacting with R
- You can use R interactively and have it do some simple math:

In [None]:
2 - 1

- It is generally more useful to assign your values to **variables**, which are R objects.
- You can assign values to R variables using the assignment operator **'<-'**, which assigns the value on the right to the variable on the left.
- There are other operators as well (like the `=` sign) but I would suggest you stick with the `<-` operator for now.
- Everything in R (including variables) is an object (https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects)

<div class="alert alert-block alert-info">
<b>Tip:</b> We will use the words `object` and `variable` interchangably in this workshop.
</div>

Let's assign a variable called `a`:

In [None]:
a <- 1

R won't print anything when you assign a value to a variable. We can look at the output of our assignment by typing `a`.

In [None]:
a

- R also comes with a `print()` function that we can use to look at our variables.
- We will talk more about functions later in this workshop, but a **function** is a series of statements that work together to perform a specific task.
- All functions need pieces of information (or **arguments**) to perform their particular function, these arguments can be required or optional. 
- `print()` takes a single required argument -- the thing you want to print.

<div class="alert alert-block alert-info">
<b>Tip:</b> You can use `?function` in R to learn more about a particular function (for example: `?print`).
</div>

- Let's print variable `a`:

In [None]:
print(a)

We can name our variables any combination of letters, numbers, or underscores (`_`) with a few exceptions: (Add a warning box for variable name rules) 
- R has a few reserved words that can't be used as variable names in R:
    - `if`, `else`, `repeat`, `while`, `function`, `for`, `in`
    - see the whole list of reserved words here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Reserved.html). 
- Variables can't start with a number or an underscore
- You can technically use `.` in your variable names, but this is best avoided for now.

For example:

In [None]:
bar <- 11
cat_1 <- 'cat' 
dog_ <- TRUE
egg <- (1L)
foo <- 2i

Now that we have assigned some values to variables, we can start using them:

In [None]:
print(bar)

In [None]:
print(bar * 2)

In [None]:
print(a)

In [None]:
print(a + bar)

# Data classes

- The variables you assign have some sort of data class associated with them.
- Data classes impact how functions will interact with your variables.
- R has 5 basic data classes, including:

1. `character`, which is a character 
2. `numeric`, which can be `real` (a rounded number) or `decimal` (a number including a decimal point).
3. `integer`, which can be a rounded number (but not a decimal) 
4. `logical`, which can be either `TRUE` or `FALSE`
5. `complex`, allows you to use imaginary numbers 

- You can also have missing data, which we will also talk about later in this workshop.
- The difference between an integer and real numeric is subtle and probably isn't worth worrying about at this point.

Let's look at the data classes of our variables we have just assigned:

In [None]:
class(a)
class(bar)
class(cat_1)
class(dog_)
class(egg)
class(foo)

- The `L` we included when we assigned `egg` tells R that this object is an integer.
- The `i` we included when we assigned `foo` indicates an imaginary number, making `foo` complex data class object.

In [1]:
#?class

- The `#` before `?class` in the above cell means that this piece of code is actually a comment and won't be run.
- You can remove the `#` to run the cell and learn more about `class()`.
- The `#` comment characters are helpful -- you can use them to explain your code to future you (or other users of your code).

# R data structures
- R objects can also contain more than one element.
- Objects that contain more than one element are organized into different data structures.
- Data structures in R include vectors (also referred to as 'atomic vectors' in R), lists, matrices, arrays, and data frames.

### Vectors
- Probably the simplest R object that contains more than one element is a **vector**. 
- You can create a vector using the concatenate function, `c()`, or directly assigning them. 
- The `c()` function will coerce all of the arguments to a common data type and combine them to form a vector. 
- Here's a few examples of how you can assign vectors:

In [None]:
numeric_vector <- c(1,2,3,4,5) 
character_vector <- c('one', 'two', 'three', 'four', 'five') 
integer_vector <- (6:12) 
logical_vector <- c(TRUE, TRUE, FALSE)
character_vector_2 <- c('a', 'pug', 'is', 'not', 'a', 'big', 'dog')

Note that I used `:` when assigning `integer_vector`, which just generates a list from 6 through 12.

In [None]:
print(numeric_vector)
print(character_vector)
print(integer_vector)
print(logical_vector) 
print(character_vector_2)

Vectors also have class:

In [None]:
print(class(numeric_vector))
print(class(character_vector))
print(class(integer_vector))
print(class(logical_vector)) 
print(class(character_vector_2))

You can combine vectors using `c()`

In [None]:
combined_vector <- c(numeric_vector, integer_vector)

In [None]:
print(combined_vector)

You can use the `length()` function to see how long your vectors are:

In [None]:
print(length(combined_vector))

You can also access elements of the vector based on the index (or its position in the vector):

In [None]:
print(combined_vector[2])

You can combine these operations, but note that R code evaluates from the inside out:

In [None]:
print(combined_vector[length(combined_vector)])

Here, R is reading `length(combined_vector)` first. The value returned by the `length()` function is then used to access the last entry in the `combined_vector` vector.

You can also name vector elements and then access them by their names:

In [None]:
names(numeric_vector) <- c('one', 'two', 'three', 'four', 'five')
print(numeric_vector)

In [None]:
print(numeric_vector['three'])

We can use `-c` to remove vector elements:

In [None]:
print(combined_vector[-c(4)])

If a vector is numerical, we can also perform some math operations on the entire vector. Here, we can calculate the sum of a vector:

In [None]:
print(combined_vector)
print(sum(combined_vector))

In [None]:
print(combined_vector/sum(combined_vector))

Use the `round()` function to specify you only want 3 digits reported and assign it to a variable called `rounded`

In [None]:
rounded <- round((combined_vector/sum(combined_vector)), digits = 3)
print(rounded)

You can also perform math operations on two vectors...

In [None]:
print(rounded + combined_vector)

but you'll get weird results if the vectors are different lengths:

In [None]:
print(combined_vector)
print(numeric_vector)

In [None]:
print(combined_vector + numeric_vector)

It looks like R will give you an error message and then go back to the start of the shorter vector.

**Coercing between classes**

Let's say you're trying to import some data into R, maybe a vector of measurements:

In [None]:
your_data <- c('6','5','3','2','11','0','9','9')
class(your_data)

You vector is a character vector because the elements of the vector are in quotes. You can coerce them back into numeric values using `as.numeric()`:

In [None]:
your_new_data <- as.numeric(your_data)
print(your_new_data)
class(your_new_data)

What happens if we try to `as.numeric` things that aren't numbers?

In [None]:
as.numeric(character_vector_2)

<div class="alert alert-block alert-warning">
<b>Example:</b> <b>&ltNA&gt</b> indicates that these are missing values, so be careful when converting between classes.
</div>

# Missing values
Missing values can result from things like inappropriate coersion, Excel turning everything into a date, encoding format problems, etc.

In [None]:
here_is_a_vector <- as.numeric(c(4/61, 35/52, '19-May', 3/40))

We can use the `is.na()` function to see if our vector has any `<NA>` values in it:

In [None]:
is.na(here_is_a_vector)

You can combine this with the `table()` function to see some tabulated results from `is.na()`:

In [None]:
table(is.na(here_is_a_vector))

You might also encounter an `NaN`, which means 'not a number' and is the result of invalid math operations:

In [None]:
0/0

`NULL` is another one you might encounter, and it is the result of trying to query a parameter that is undefined for a specific object. For example, you can use the `names()` function to retrieve names assigned to an object. What happens when you try to use this function on an object you haven't named?

In [None]:
names(here_is_a_vector)

You might also see `Inf` or `-Inf` which are positive or negative infinity, which result from dividing by zero or operations that do not converge:

In [None]:
1/0

# Matrices
- A matrix in R is a collection of elements organized into rows and columns.
- All columns must be the same data type and be the same length.
- Generate a matrix using the following general format:

```
my_matrix <- matrix(
    vector, 
    nrow = r, 
    ncol = c, 
    byrow = FALSE)
```

For example:

In [None]:
my_matrix <- matrix(
    c(1:12), 
    nrow = 3, 
    ncol = 4, 
    byrow = FALSE)

print(my_matrix)

In the above code, we made `my_matrix`, we specified it should be populated by the vector `c(1:12)`, with 3 rows (`nrow = 3`) and 4 columns (`ncol = 4`) and be populated by column, not by row (`byrow = FALSE`)

We can access the rows and columns by their numerical index using a `[row, column]` format.
For example, here's how we access row 3 and column 4:

In [None]:
my_matrix[3,4]

Access entire row 3:

In [None]:
print(my_matrix[3,])

Access entire column 4:

In [None]:
print(my_matrix[,4])

You can also name the rows and columns and then access them by name. 
For example, lets name the rows and columns of `my_matrix`

In [None]:
dimnames(my_matrix) <- list(
    c('row_1', 'row_2', 'row_3'), 
    c('column_1', 'column_2', 'column_3', 'column_4'))

You can also name the rows and columns separately using `rownames()` and `colnames()` 

In [None]:
rownames(my_matrix) <- c('row_1', 'row_2', 'row_3')
colnames(my_matrix) <- c('column_1', 'column_2', 'column_3', 'column_4')

In [None]:
print(my_matrix)

In [None]:
print(my_matrix['row_2',])

In [None]:
print(my_matrix[,'column_2'])

# Arrays (Put this in appendix as additional) 
- An array can be similar to a matrix but they can have more than 2 dimensions (e.g., more than rows and columns)
- So, an array with 1 dimension is similar to a vector and an array with 2 dimensions is similar to a matrix.
- Generate an array with the following generic format:

```
my_array <- array(vector),dim = c(rows, columns, other_dims))
```

In [None]:
my_col_array <- array(
    c(1:12),
    dim = c(12,1,1))
print(my_col_array)

In [None]:
my_row_array <- array(
    c(1:12),
    dim = c(1,12,1))
print(my_row_array)

In [None]:
my_array <- array(
    c(1:12),
    dim = c(3,4,1))
print(my_array)

In [None]:
another_array <- array(
    c(1:24),
    dim = c(3,4,2))
print(another_array)

Access elements of arrays like this `[row, column, other_dims]`

In [None]:
print(another_array[3,2,1])

In [None]:
print(another_array[3,2,2])

You can also give your array some `dimnames()`:

In [None]:
dimnames(another_array) <- list(
    c('row_1', 'row_2', 'row_3'), 
    c('column_1', 'column_2', 'column_3', 'column_4'),
    c('matrix_1', 'matrix_2'))
print(another_array)

Then access your array elements by name:

In [None]:
print(another_array['row_3', 'column_2', 'matrix_1'])

In [None]:
print(another_array['row_3',,'matrix_1'])

# Lists

- Lists in R are very flexible, they are collections of elements that can be different classes, structures, whatever. You can even have lists of lists.
- You make lists using the `list()` function (or by coersion using `as.list()`.

In [None]:
my_list <- list(character_vector, my_array, my_matrix)
print(my_list)

Use `[[]]` to access list elements:

In [None]:
print(my_list[[3]])

Add more brackets to access sub-elements of a list:

In [None]:
print(my_list[[3]][1])

In [None]:
print(my_list[[3]][1,])

Name the list elements:

In [None]:
names(my_list) <- c('character_vector', 'my_array', 'my_matrix')
print(my_list)

Use `unlist()` if you want to convert a list to a vector, let's make a new list (`list_1`)

In [None]:
list_1 <- list(1:5)
print(list_1)

Use `str()` to look at the structure

In [None]:
str(list_1)

Then `unlist()` and look at the structure

In [None]:
print(unlist(list_1))
str(unlist(list_1))

# Data Frames (Move this down to data wrangling section) - can briefly show a little then move on and say this is in Tidyverse section; instead focus on if/then control flow 

- A data frame is another way to organize a collection of rows and columns.
- It is a collection of lists organized into columns.
- It is similar to a matrix, except data frames allow different data types in different columns.
- We can use the `data.frame()` function to create a data frame from vectors using the following format:

```
dataframe <- data.frame(column_1, column_2, column_3)
```

In [None]:
example_df <- data.frame(
    c('a','b','c'), 
    c(1, 3, 5), 
    c(TRUE, TRUE, FALSE))

print(example_df)

Use `names()` or `colnames()` to name columns,  `rownames()` to name rows, or `dimnames()` to assign both column and row names to the data frame:

In [None]:
colnames(example_df) <- c('letters', 'numbers', 'boolean')
rownames(example_df) <- c('first', 'second', '')
print(example_df)

In [None]:
names(example_df) <- c('_letters_', '_numbers_', '_boolean_')
print(example_df)

In [None]:
dimnames(example_df) <- list(c('__first', '__second', '__third'), c('__letters', '__numbers', '__boolean'))
print(example_df)

We can use the `attributes()` and `str()` functions to get some information about our data frame:

In [None]:
attributes(example_df)

In [None]:
str(example_df)

# Adding columns to a data frame

Let's make a new example dataframe to work with:

In [None]:
patients_1 <- data.frame(
    c('Boo','Rex','Chuckles'), 
    c(1, 3, 5), 
    c('dog', 'dog', 'dog'))
print(patients_1)

Use `names()` or `colnames()` to name columns,  `rownames()` to name rows, or `dimnames()` to assign both column and row names to the data frame.
Here we will use `names()` to name the columns:

In [None]:
names(patients_1) <- c('name', 'number_of_visits', 'type')
print(patients_1)

We can use the column names to extract a single column using the notation `dataframe$column`, e.g.:

In [None]:
print(patients_1$name)

The `cbind()` function can be used to add more columns to a dataframe:

In [None]:
column_4 <- c(4, 2, 6)
patients_1 <- cbind(patients_1, column_4)
print(patients_1)

We can also rename individual columns of the dataframe using index notation, lets rename the 4th column we just added:

In [None]:
colnames(patients_1)[4] <- 'age_in_years'
print(patients_1)

We can also use the `dataframe$column` notation to add a new column and name it at the same time:

In [None]:
patients_1$weight_in_pounds <- c(35, 75, 15)
print(patients_1)

Let's use `str()` and `attributes()` functions to look at the structure and attributes of this data frame:

In [None]:
str(patients_1)

In [None]:
attributes(patients_1$name)

**Notice that `patients_1$name` is a factor with three levels...**

# Factors
- In some situations, you might be dealing with categorical variable, which is known as a factor variable in R. 
- A factor is a type of variable that has a set number of distinct categories into which all observations fall, which are the levels.

*Factor variables are important because R's default behavior when reading in text files is to convert that text into a factor variable rather than a character variable, which can often lead to weird behavior if the user is trying to e.g. search that text.*
- In addition to `cbind()` for adding columns, there is another function in R called `rbind()`, which adds new rows to a data frame.
- Let's see what happens when we try to add a new row to our data frame:

In [None]:
patients_1_rbind <- rbind(patients_1, c('Fluffy', 2, 'dog', 8, 105))
print(patients_1_rbind)

- The `patients_1$name` column is classed as a `factor`, and the factors levels are `Boo`, `Chuckles`, and `Rex`. 
- Recall that a factor is a type of variable that has a **set number of distinct categories into which all observations fall, which are the levels.**
- R isn't sure what to do with the new level we are trying to add (`Fluffy`), so we have to turn those factors into strings.

We can convert the `patients_1$name` column to a character as follows:

In [None]:
patients_1$name <- as.character(patients_1$name)
str(patients_1)

Now we can use `rbind()` to add a new row:

In [None]:
patients_1 <- rbind(patients_1, c('Fluffy', 2, 'dog', 8, 105))
print(patients_1)

# Re-ordering factor levels
- You might have ordinal data, like the following:

In [None]:
sizes <- factor(c('extra small', 'small', 'large', 'extra large', 'large', 'small', 'medium', 'medium', 'medium', 'medium', 'medium'))

Use the `table()` function to look at the vector:

In [None]:
table(sizes)

We might not necessarily want the factor levels in alphabetical order. You can re-order them like so:

In [None]:
sizes_sorted <- factor(sizes, levels = c('extra small', 'small', 'medium', 'large', 'extra large'))
table(sizes_sorted)

You can also use the `relevel()` function to specify that there's a single factor you'd like to use as the reference factor, which will now be the first factor:

In [None]:
sizes_releveled <- relevel(sizes, 'medium')
table(sizes_releveled)

You can also coerce a factor to a character:

In [None]:
character_vector <- as.character(sizes)
class(character_vector)
print(character_vector)

Notice that print doesn't return the `Levels` and each element of the vector is now in quotes.
It is also possible to convert a factor into a numeric vector if you want to:

In [None]:
print(sizes)
numeric_vector <- as.numeric(sizes)
print(numeric_vector)

This assigns numerical values based on alphabetical order of `sizes`

In [None]:
print(sizes_sorted)
ordered_numeric_vector <- as.numeric(sizes_sorted)
print(ordered_numeric_vector)

This assigns numerical values based on the levels you set when you created `sizes_sorted`

# Data frame merging
- Data is often spread across more than one file, reading each file into R will result in more than one data frame. 
- If the data frames have some common identifying column, we can use that common ID to combine the data frames. 

For example:

In [None]:
print(patients_1)

Let's make another data frame:

In [None]:
patients_2 <- data.frame(
    c('Fluffy', 'Smokey', 'Kitty'), 
    c(1, 1, 2), 
    c('cat', 'dog', 'cat'),
    c(1, 3, 5))
colnames(patients_2) <- c('name', 'number_of_visits', 'type', 'age_in_years')
print(patients_2)

We can use the `merge()` function to combine them:

In [None]:
patients_df <- merge(patients_1, patients_2, all = TRUE)
print(patients_df)

- Using `all = TRUE` will fill in blank values if needed (for example, the weight of any of the animals in `patients_2`).
- Using the `all.x = TRUE` argument will return all values in the `patients_1` dataframe, as well as any entries with the same ID column(s) from `patients_2`.

In [None]:
patients_df <- merge(patients_1, patients_2, all.x = TRUE)
print(patients_df)

- Using the `all.y = TRUE` argument will return all values in the `patients_2` dataframe, as well as any entries with the same ID column(s) from `patients_1`.

In [None]:
patients_df <- merge(patients_1, patients_2, all.y = TRUE)
print(patients_df)

You can also specify which columns to join on:

In [None]:
patients_df <- merge(patients_1, patients_2, by = c('name', 'type', 'number_of_visits', 'age_in_years'), all = TRUE)
print(patients_df)

# Built-in functions
- We have already used a few functions in this workshop (like print). 
- Functions are a series of statements that work together to form a specific task.
- All functions need pieces of information (or arguments) to perform their particular function. 
- Sometimes arguments are required, sometimes arguments are optional -- for example, `print()` requires only one argument -- the thing you want to print.
- R comes with some pre-loaded data sets -- you see the list by typing `print(data())`, but it is quite long.

Load the `DNase` data and turn it into a data frame:


In [None]:
data(DNase)
DNase <- data.frame(DNase)

Let's use the `dim()`, `nrow()`, and `ncol()` functions to get the number of rows (`nrow()`), number of columns (`nrow()`), and number of both rows and columns (`dim()`)

In [None]:
dim(DNase)

In [None]:
nrow(DNase)

In [None]:
ncol(DNase)

We can use the `head()` function to look at the first few lines of the data frame:

In [None]:
head(DNase)

You can use the `n` argument to look at a different number of lines

In [None]:
head(DNase, n = 3)

We can use the `tail()` function to look at the last few lines of the data frame:

In [None]:
tail(DNase, n = 5)

The summary function, which can be applied to either a vector or a data frame (in the latter case, R applies it separately to each column in the data frame) yields a variety of summary statistics about each variable. 

In [None]:
summary(DNase)

`summary()` is informative for numerical data, but not so helpful for factor data, as in the `Run` column. 
Let's make a smaller subset of the `DNase` data to work with:

In [None]:
DNase_subset <- DNase[1:20, ]
DNase_subset

We can also sort our data. Let's look at the `conc` column:

In [None]:
print(DNase_subset$conc)

Use the `order()` function to figure out the ascending rankings of the values

In [None]:
order(DNase_subset$conc)

We can assign this ordering to a vector:

In [None]:
reorder_vector <- order(DNase_subset$conc)

And use it to reorder our data frame:

In [None]:
DNase_subset[reorder_vector, ]

Data frames can be classified into two broad categories: wide format and long format. All data frames shown so far have been presented in wide format. A wide format data frame has each row describe a sample and each column describe a feature. Here is a short example of a data frame in wide format, tabulating counts for three genes in three patients:

In [None]:
wide_df <- data.frame(c("A", "B", "C"), c(1, 1, 2), c(5, 6, 7), c(0, 1, 0))
colnames(wide_df) <- c("id", "gene.1", "gene.2", "gene.3")
wide_df

Long format stacks features on top of one another; each row is the combination of a sample and a feature.  One column exists to denote the feature in question, and another column exists to denote that feature' value:

In [None]:
long_df <- data.frame(c("A", "A", "A", "B", "B", "B", "C", "C", "C"), c("gene.1", "gene.2", "gene.3", "gene.1", "gene.2", "gene.3", "gene.1", "gene.2", "gene.3"), c(1, 5, 0, 1, 6, 1, 2, 7, 0))
colnames(long_df) <- c("id", "gene", "count")
long_df

These formats both contain the exact same data but represent it in different ways. Various functions exist to convert between wide and long format but these are beyond the scope of today's discussion. You can look up the `reshape2` or `tidyr` packages if you're interested in learning more about converting between long and wide formats -- alternatively, check out our tidyverse workshop.

# User defined functions
- In addition to the already available functions in R, you can also create your own functions. 
- Generally, if you find yourself re-writing the same pieces of code over and over again, it might be time to write a function. 

Functions take the following basic format:

```
myfunction <- function(argument_name){
  stuff <- this is the body of the function(
    it contains statements that use argument_names
    to do things and make stuff)
  return(stuff)
}
```

More formally, R functions are broken up into 3 pieces:
1. formals() - the list of arguments
2. body() - code inside the function
3. environment() - how the function finds the values associated with function names

Here's an example of a function called `roll()` that rolls any number of 6-sided dice:

In [None]:
roll <- function(number_of_dice){
    rolled_dice <- sample(
        x = 6, 
        size = number_of_dice, 
        replace = TRUE)
    return(rolled_dice)
}

- The built-in R function `sample()` is nested inside our `roll()` function.
- `roll()` uses the argument `number_of_dice` as the `size`, `x` is the number of sides on the die, which we have hard-coded as `6`, and `replace = TRUE` means that we are sampling the space of all potential die roll outcomes with replacement.
- Lastly, we tell the function what it should return (`rolled_dice`).

To call that function and print the output:

In [None]:
print(roll(number_of_dice = 10))

Lets look at the `formals()`

In [None]:
formals(roll)

What about `body()`?

In [None]:
body(roll)

What about `environment()`? 

In [None]:
environment(roll)

So, the function itself is called `roll`, it takes the argument or formals `number_of_dice` and the body of the function uses the built-in `sample` function in R to simulate dice rolls (use ?sample to learn more about the `sample()` function). 

## Anonymous functions (Appendix) 

- You can also have something called an **anonymous function**, where you write the function but don't assign it to an object.
- The general format is:
```
(function(
     argument_name) 
     statements that use argument_name to create an object
  )(
    argument_name = argument)
```

In [None]:
(function(
    anonymous_dice) 
    sample(
        x = 6, 
        size = anonymous_dice, 
        replace = TRUE)
 )(
    anonymous_dice = 5)

## More on user defined functions
- We can also have functions that take more than one argument. 
- Lets say we want to roll different numbers of dice (`number_of_dice`) and we want to change the size of the dice we roll (`number_of_sides`).

In [None]:
roll <- function(
    number_of_dice, 
    number_of_sides){
    rolled_dice <- sample(
        x = number_of_sides, 
        size = number_of_dice, 
        replace = TRUE)
    return(rolled_dice)
}

- The new `roll()` uses the `sample()` function again, but this time it uses the `number_of_dice` and `number_of_sides`

In [None]:
print(roll(number_of_dice = 5, number_of_sides = 20))

- Lets say we want to roll different numbers of dice (`number_of_dice`) and we want to change the size of the dice we roll (`number_of_sides`) as well as tweak the number of times we roll the dice (`number_of_rolls`).
- We can use `replicate()` and `sample()`

In [None]:
roll <- function(
    number_of_rolls, 
    number_of_sides, 
    number_of_dice){
    rolled_dice <- replicate(
        number_of_rolls, 
        sample(
            x = number_of_sides, 
            size = number_of_dice, 
            replace = TRUE))
    return(rolled_dice)
}

- So, in the above function we use `number_of_dice`, `number_of_sides`, and `number_of_rolls` as arguments. 
- The `sample()` function takes the arguments `number_of_sides` and `number_of_dice`
- The `replicate()` function is takine `number_of_rolls` as an argument.

In [None]:
rolled_dice <- roll(number_of_rolls = 10, number_of_sides = 20, number_of_dice = 5)
print(rolled_dice)

You can use `colSums()` or `rowSums()` to calculate the sum of the columns and rows:

In [None]:
print(colSums(rolled_dice))

In [None]:
print(rowSums(rolled_dice))

We can make `rolled_dice` into an **anonymous function**: #(Cut off from here down - appendix) 

In [None]:
print(
    (function(
        number_of_dice, 
        number_of_sides, 
        number_of_rolls) 
        replicate(
            number_of_dice, 
            sample(
                1:number_of_sides, 
                number_of_rolls, 
                replace = TRUE))
     )(
        number_of_dice = 5, 
        number_of_rolls = 10, 
        number_of_sides = 20))

Lets make another anonymous function that makes a boxplot of our dice rolls:

In [None]:
(function(number_of_dice, 
          number_of_sides, 
          number_of_rolls) 
    boxplot((
        replicate(
            number_of_dice, 
            sample(
                1:number_of_sides, 
                number_of_rolls, 
                replace = TRUE))))
 )(
    number_of_dice = 5, 
    number_of_rolls = 10, 
    number_of_sides = 20)

We can give the boxplot a title:

In [None]:
(function(number_of_dice, 
          number_of_sides, 
          number_of_rolls) 
    boxplot((
        replicate(
            number_of_dice, 
            sample(
                1:number_of_sides, 
                number_of_rolls, 
                replace = TRUE))), 
            main = 'here is a boxplot of some dice rolls')
 )(
    number_of_dice = 5, 
    number_of_rolls = 10, 
    number_of_sides = 20)

We can use `paste()` to pass the function arguments as parts of the title for the figure, by adding `main = paste('the ' , number_of_dice, ' ', number_of_sides, '-sided dice were rolled ', number_of_rolls, ' times', sep='')`

In [None]:
(function(number_of_dice, 
          number_of_sides, 
          number_of_rolls) 
    boxplot((
        replicate(
            number_of_dice, 
            sample(
                1:number_of_sides, 
                number_of_rolls, 
                replace = TRUE))), 
            main = paste(
        'the ' , number_of_dice, ' ', number_of_sides, '-sided dice were rolled ', number_of_rolls, ' times', 
                sep=''))
 )(
    number_of_dice = 5, 
    number_of_rolls = 10, 
    number_of_sides = 20)

We can add some colors to the figure by adding `col = c(1:number_of_dice)`, this will generate enough colors so that each bar has a different color:

In [None]:
(function(number_of_dice, 
          number_of_sides, 
          number_of_rolls) 
    boxplot((
        replicate(
            number_of_dice, 
            sample(
                1:number_of_sides, 
                number_of_rolls, 
                replace = TRUE))), 
            main = paste(
        'the ' , number_of_dice, ' ', number_of_sides, '-sided dice were rolled ', number_of_rolls, ' times', 
                sep=''), 
           col = c(1:number_of_dice))
 )(
    number_of_dice = 5, 
    number_of_rolls = 10, 
    number_of_sides = 20)

# Importing and Exporting files
- There are a few different ways to read and write files in R.
- We will use `read.table()` and `write.table()`.
- Lets use some of the pre-loaded data that comes with R. 
- First, let's import the `iris` data as a data frame and use `head()` to look at the first few lines

In [None]:
iris <- data.frame(iris)
head(iris)

You can write the output to a file using `write.table`:

In [None]:
write.table(iris, file = '~/iris_table.txt')

Use `read.table()` to pull data into R:

In [None]:
iris_table_2 <- read.table('~/iris_table.txt')

In [None]:
head(iris_table_2)

Notice that the `Species` column is a factor (`<fct>`). If we'd like text strings to be characters instead of factors when we import we can use `stringsAsFactors = FALSE`:

In [None]:
iris_table_3 <- read.table('~/iris_table.txt', stringsAsFactors = FALSE)

In [None]:
head(iris_table_3)

Notice that the `Species` column is a character (`<chr>`)
To convert back into a factor:

In [None]:
iris_table_3$Species <- as.factor(iris_table_3$Species)

In [None]:
head(iris_table_3)

Another convenient function is `list.files()`, which you can use with a wildcard (`*`) to return a list of all files in a directory (specified in `path =`) that start with `iris_`:

In [None]:
list.files(path = '~', pattern = 'iris_*')

# R packages
- Although R comes with many built in functions, you will probably want to install and use various R packages.
- You can install the packages using `install.packages('package_name_here')` (where you would replace 'package_name_here' with your package of choice, in quotes). 
- This will download the package and any additional required dependencies. 
- Run the next cell to install the `ggplot2` package:

In [None]:
install.packages('ggplot2')

Before you can actually use the package, you have to load it as follows:

In [None]:
library('ggplot2')

### Most R packages are found in CRAN - the central repository for R packages. However, packages can be found in different places. Many of the packages of interest for biologists will be in Bioconductor. 

There are two steps to downloading a package from Bioconductor -- first, install BiocManager.

In [None]:
install.packages("BiocManager")

Then, load `BiocManager` and use `BiocManager::install()` to install a package.

In [None]:
library('BiocManager')
BiocManager::install("org.Hs.eg.db")

Use the `sessionInfo()` function to see more information about your loaded R packages and namespace:

In [None]:
print(sessionInfo())

# If we have time, we can talk about some `apply()` functions: (Appendix) 

## The `apply()` functions
- R uses a family of `apply()` functions to repetitively manipulate objects while avoiding for loops. 
- How you use them will depend on the format of your data and what operations you're trying to perform.
- We will talk about `apply()`, `lapply()`, and `sapply()`.
- There is also `mapply()`, `vapply()`, `rapply()`, and `tapply()`, but we won't talk about those today.

- **`apply()`** Applies a function to an array (or matrix) and returns an array (or matrix)
- **`lapply()`** Applies a function to each element of a list or vector and returns a list
- **`sapply()`** Applies a function to each element of a list or vector and returns a vector


### `apply()` 
- `apply()` applies a function to an array (or matrix) and returns an array (or matrix)
- The general format of an `apply()` call is as follows:

```
apply(X, MARGIN, FUN, ...)

```
- `X` is the array or matrix to apply the function
- `MARGIN` is where the function should be applied - `1` is for rows, `2` is for columns, `c(1,2)` is rows and columns, can also be a character vector of dimension names if `X` has dimnames.
- `FUN` Function to be applied
Let's go back to the dice rolling function:

In [None]:
rolled_dice <- roll(number_of_rolls = 10, number_of_sides = 20, number_of_dice = 5)

In [None]:
print(rolled_dice)

In [None]:
class(rolled_dice)

I'm going to name the rows and columns using the `paste()` and `dimnames()`:

In [None]:
dimnames(rolled_dice) <- list(
paste('roll', 1:5, sep = '_'),
    paste('die', 1:10, sep = '_'))

print(rolled_dice)

Let's try using `apply()` to increase every value by 1:

In [None]:
add_one <- apply(rolled_dice, c(1,2), function(element) element + 1)
class(add_one)
print(add_one)

- The `c(1,2)` argument to `apply()` means that the function should apply to all rows and columns.
- What about if we use apply to calculate sums for each row and column?

If we use `1` it will apply the function to each row:

In [None]:
row_sums <- apply(rolled_dice, 1, function(element) sum(element))
print(row_sums)       

In [None]:
If we use `2` it will apply the function to each column:

In [None]:
col_sums <- apply(rolled_dice, 2, function(element) sum(element))
print(col_sums) 

### `lapply()`

- `lapply()` works on lists and returns a list.
- Since a data frame is a series of lists, if you apply it to a data frame it will execute the function on each column of the data frame.
- The general format is as follows:

```
lapply(X, FUN)

```
- `X`  A vector or an object
- `FUN` Function applied to each element of X


In [None]:
rolled_dice_df <- as.data.frame(rolled_dice)
class(rolled_dice_df)

In [None]:
print(rolled_dice_df)

In [None]:
col_sums_df <- lapply(rolled_dice_df, sum)

In [None]:
str(col_sums_df)

In [None]:
rolled_dice

However, if you use `lapply()` to calculate sums on the `rolled_dice` matrix, you get back a very long list (since `lapply()` wants to return a list).

In [None]:
class(rolled_dice)

In [None]:
col_sums <- lapply(rolled_dice, sum)
str(col_sums)

### `sapply`

In [None]:
- `sapply()` is similar to `lapply()`, but it returns a vector rather than a list.
- The general format for an `sapply()` call is as follows:

```
sapply(X, FUN)

```
- `X`  A vector or an object
- `FUN` Function applied to each element of x

In [None]:
col_sums_df <- sapply(rolled_dice_df, sum)

In [None]:
class(col_sums_df)
is.vector(col_sums_df)

In [None]:
print(col_sums_df)

# Data Wrangling and Tidyverse (Different notebook!) 

In this section you'll learn principles behind exploratory data analysis and visualization, including tidying and transforming data to answer questions you might want to ask. 

## Some useful notes

With Jupyter Notebook you can get a nice popup of function definitions just like you can in RStudio. Simply navigate to a cell or start a new one, and enter in ?function like you would normally. A popup will appear.

You should see an Insert dropdown menu and Run button at the top which lets you add cells as well as run code or render Markdown in the cells, but these are very useful keyboard shortcuts for the same functions: 

- Shift+Enter: Run code or render Markdown in the current cell you're on
- Esc+a: Add a cell above
- Esc+b: Add a cell below
- Esc+dd: Delete a cell

## Package prerequisites 

Packages that required in this workshop are tidyverse, which includes the packages ggplot2, dplyr, purrr, and others, gridExtra which helps with arranging plots next to each other, ggrepel which helps with plot labels and maps which is for map data. 

In [None]:
library(tidyverse)
library(gridExtra)
library(ggrepel)
library(maps)

If you get an error message “there is no package called ‘xyz’” then you need to install the packages first. (They should have been preloaded on your notebooks but if not it's ok, it won't take long.)

In [10]:
#install.packages('tidyverse')
#install.packages('pillar')
#install.packages('gridExtra')
#install.packages('ggrepel')
#install.packages('map')

# Tidying Data

Most datasets are data frames made up of rows and columns. However, talking about data frames just in terms of what rows and columns it has is not enough.

 * **Variable:** quantity, quality, property that can be measured.
 * **Value:** State of variable when measured.
 * **Observation:** Set of measurements made under similar conditions
 * **Tabular data:** Set of values, each associated with a variable and an observation.

Tidy data:
 * Each variable is its own column
 * Each observation is its own row
 * Each value is in a single cell
 
Benefits:
 * Easy to manipulate
 * Easy to model
 * Easy to visualize
 * Has a specific and consistent structure
 * Stucture makes it easy to tidy other data
 
Cons:
 * Data frame is not as easy to look at

Consider the following tables:

In [13]:
table1 <- data.frame(makemodel=c("audi a4","audi a4","chevrolet corvette","chevrolet corvette","honda civic","honda civic"),
                    year=rep(c(1999,2008),3),
                    cty=c(18,21,15,15,24,25),
                    hwy=c(29,30,23,25,32,36))
table1

makemodel,year,cty,hwy
<chr>,<dbl>,<dbl>,<dbl>
audi a4,1999,18,29
audi a4,2008,21,30
chevrolet corvette,1999,15,23
chevrolet corvette,2008,15,25
honda civic,1999,24,32
honda civic,2008,25,36


This is tidy data, because each column is a variable, each observation is a row, and each value is in a single cell

Next we will look at some non-tidy data and operations from the **tidyr** package (part of **tidyverse**) to make the data tidy. Many of you might be more used to using operations from **reshape2**, like melting and casting. It's a very useful package with more functionality including aggregating data, but syntax with **tidyr** commands is more simpler and intuitive for the purposes of tidying data.

## Gathering

In [14]:
table2a <- data.frame(makemodel=c("audi a4","chevrolet corvette","honda civic"),`1999`=c(18,15,24),'2008'=c(21,15,25),check.names=FALSE)
table2b <- data.frame(makemodel=c("audi a4","chevrolet corvette","honda civic"),`1999`=c(29,23,32),'2008'=c(30,25,36),check.names=FALSE)
table2a
table2b

makemodel,1999,2008
<chr>,<dbl>,<dbl>
audi a4,18,21
chevrolet corvette,15,15
honda civic,24,25


makemodel,1999,2008
<chr>,<dbl>,<dbl>
audi a4,29,30
chevrolet corvette,23,25
honda civic,32,36


`table2a` column names `1999` and `2008` represent values of `year` variable. `table2b` is the same. Each row represents 2 observations, not 1. Need to gather columns into new pair of variables.

Parameters:
 * Set of columns that represent values, not variables.
 * `key`: name of variable whose values are currently column names.
 * `value`: name of variable whose values are currently spread out across multiple columns.

Experiments often report data in the format of `table4a` and `table4b`. One reason is for presentation purposes it's very easy to look at. Another is storage is efficient for completely crossed designs and can allow matrix operations.

In [None]:
tidy2a <- gather(table2a,`1999`,`2008`,key="year",value="cty")
tidy2a

In [None]:
tidy2b <- gather(table2b, `1999`, `2008`, key = "year", value = "hwy")
tidy2b

Merge tables using `left_join()` (many other types of [table joins](https://dplyr.tidyverse.org/reference/join.html) as well)

In [None]:
right_join(tidy2a,tidy2b)

## Spreading

In [None]:
table3 <- data.frame(makemodel=c(rep("audi a4",4),rep("chevrolet corvette",4),rep("honda civic",4)),
                    year=rep(c(1999,1999,2008,2008),3),
                    type=rep(c("cty","hwy"),6),
                     mileage=c(18,29,21,30,15,23,15,25,24,32,25,36))
table3

`table3` has each observation in two rows. Need to spread observations across columns with appropriate variable names instead.

Parameters:
 * `key`: Column that contains variable names.
 * `value`: Column that contains values for each variable.

In [None]:
spread(table3, key=type,value=mileage)

### Exercises

1. Consider the example below, where we spread and then gather observations from the same columns in `stocks`. Why are `gather()` and `spread()` not perfectly symmetrical? (Hint: look at the variable types and column names)

In [15]:
stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)

stocks

stocks %>% 
  spread(year, return) %>% 
  gather("year", "return", `2015`:`2016`)

ERROR: Error in tibble(year = c(2015, 2015, 2016, 2016), half = c(1, 2, 1, 2), : could not find function "tibble"


2. Why does the code below fail?

In [None]:
table4a %>% 
  gather(1999, 2000, key = "year", value = "cases")

3. Why does spreading this tibble fail? How could you fix it?

In [None]:
people <- tribble(
  ~name,             ~key,    ~value,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

## Separating

In [None]:
table4 <- data.frame(makemodel=c("audi a4","audi a4","chevrolet corvette","chevrolet corvette","honda civic","honda civic"),
                     year=rep(c(1999,2008),3),
                    mileages=c('18/29','21/30','15/23','15/25','24/32','25/36'))
table4

`table4` has `mileages` column that actually contains two variables (`cty` and `hwy`). Need to separate into two columns.

Parameters:
 * column/variable that needs to be separated.
 * `into`: columns to split into
 * `sep`: separator value. Can be regexp or positions to split at. If not provided then splits at non-alphanumeric characters.

In [17]:
separate(table4, mileages, into = c("cty", "hwy"), sep="/")

ERROR: Error in separate(table4, mileages, into = c("cty", "hwy"), sep = "/"): could not find function "separate"


In [None]:
sep <- separate(table4, makemodel, into = c("make", "model"), sep = ' ')
sep

## Uniting

Now `sep` has `make` and `model` columns that can be combined into a single column. In other words, we want to unite them.

Parameters:
 * Name of united column/variable
 * Names of columns/variables to be united
 * `sep`: Separator value. Default is '_'



In [None]:
unite(sep, new, make, model)

In [None]:
unite(sep, makemodel, make, model, sep=' ')

## Piping

**dplyr** from **tidyverse** contains the 'pipe' (`%>%`) which allows you to combine multiple operations, directly taking output from a funtion as input to the next. Can save time and memory as well as make code easier to read. Can think of it this way: `x %>% f(y)` becomes `f(x,y)`, and `x %>% f(y) %>% g(z)` becomes `g(f(x,y),z)`, etc.

In [None]:
unite(sep, makemodel, make, model, sep=' ') %>%
    separate(mileages, into=c("cty","hwy"))

### Exercises

1. What do the `extra` and `fill` arguments do in `separate()`? Experiment with the various options for the following two toy datasets.

In [None]:
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))

## Not all data should be tidy

Matrices, phylogenetic trees (although `ggtree` and `treeio` have tidy representations that help with annotating trees), etc.

# Transforming (Tidy) Data

Now we know how to get tidy data. At this point we can already start visualizing our data. However in many cases we will need to further transform our data to narrow down variables and observations we are really interested in or to create new variables that are functions of our existing variables and data. This is known as **transforming** data.

 * `filter()` to pick observations (rows) by their values
 * `arrange()` to reorder rows, default is by ascending value
 * `select()` to pick variables (columns) by their names
 * `mutate()` to create new variables with functions of existing variables
 * `summarise()` to collapes many values down to a single summary
 * `group_by()` to set up functions to operate on groups rather than the whole data set
 * `%>%` propagates the output from a function as input to another. eg: x %>% f(y) becomes f(x,y), and x %>% f(y) %>% g(z) becomes g(f(x,y),z).
 
All functions have similar structure:
 1. First argument is data frame
 2. Next arguments describe what to do with data frame using variable names
 3. Result is new data frame
 
Will be working with data frame **mpg** for rest of workshop which comes with the **tidyverse** library.

In [None]:
head(mpg)

## `filter()` rows/observations

As name suggests filters out rows. First argument is name of data frame, next arguments are expressions that filter the data frame.

In [None]:
# filter out 2seater cars
no_2seaters <- filter(mpg, class != "2seater")
head(no_2seaters)

In [None]:
# filter out audis, chevys, and hondas
mpg %>% filter(!manufacturer %in% c("audi","chevrolet","honda")) %>% head

## `arrange()` rows/observations

Changes order of rows. First argument is name of data frame, next arguments are column names (or more complicated expressions) to order by. Default column ordering is by ascending order, can use `desc()` to do descending order. Missing values get sorted at the end regardless of what column ordering is chosen.

In [None]:
# arrange/reorder mpg by class
arrange(mpg, class) %>% head

In [None]:
# arrange/reorder data frame with 2seaters filtered out by class
# 2seaters does not appear which is as it should be
arrange(no_2seaters, class) %>% head

What kinds of cars have the best highway and city gas mileage?

In [19]:
# arrange mpg so that first hwy mileage is by descending order, then cty mileage is by descending order
arrange(mpg, desc(hwy), desc(cty)) %>% head

ERROR: Error in arrange(mpg, desc(hwy), desc(cty)) %>% head: could not find function "%>%"


Example of missing data getting placed at bottom.

In [20]:
df <- data.frame(x=c(5,2,NA,6))
df

x
<dbl>
5.0
2.0
""
6.0


In [21]:
# arrange df by ascending order, NA will be at bottom
arrange(df, x)

ERROR: Error in arrange(df, x): could not find function "arrange"


In [22]:
# arrange df by descending order, NA will be at bottom
arrange(df, desc(x))

ERROR: Error in arrange(df, desc(x)): could not find function "arrange"


In [23]:
# rest of the values are unsorted because they are all T for !is.na(x)
arrange(df,!is.na(x))

ERROR: Error in arrange(df, !is.na(x)): could not find function "arrange"


In [24]:
# can arrange by x again to get ascending order
arrange(df,!is.na(x),desc(x))

ERROR: Error in arrange(df, !is.na(x), desc(x)): could not find function "arrange"


## `select()` columns/variables

Selects columns, which can be useful when you have hundreds or thousands of variables in order to narrow down to what variables you're actually interested in. First argument is name of data frame, subsequent arguments are columns to select. Can use `a:b` to select all columns between `a` and `b`, or use `-a` to select all columns *except* a.

In [25]:
# select manufacturer, model, year, cty, hwy
select(mpg, manufacturer, model, year, cty, hwy) %>% head

ERROR: Error in select(mpg, manufacturer, model, year, cty, hwy) %>% head: could not find function "%>%"


In [26]:
# select all columns model thru hwy
select(mpg, model:hwy) %>% head
head(mpg)

ERROR: Error in select(mpg, model:hwy) %>% head: could not find function "%>%"


In [27]:
# select all columns except cyl thru drv and class
select(mpg, -(cyl:drv), -class) %>% head

ERROR: Error in select(mpg, -(cyl:drv), -class) %>% head: could not find function "%>%"


## `mutate()` to add new variables or `transmute()` to keep only new variables

Adds new columns that are functions of existing columns. First argument is name of data frame, next arguments are of the form `new_column_name = f(existing columns)`.

In [28]:
# add a new column that takes average mileage between city and highway
mutate(mpg, avg_mileage = (cty+hwy)/2) %>% head

ERROR: Error in mutate(mpg, avg_mileage = (cty + hwy)/2) %>% head: could not find function "%>%"


In [29]:
# keep only average mileage between city and highway
transmute(mpg,cty,avg_mileage=(cty+hwy)/2) %>% head

ERROR: Error in transmute(mpg, cty, avg_mileage = (cty + hwy)/2) %>% head: could not find function "%>%"


## `summarise()` and `group_by()` for grouped summaries

`summarise()` collapses a data frame into a single row, and `group_by()` changes analysis from entire data frame into individual groups.

In [30]:
# get average mileage grouped by engine cylinder
m <- mutate(mpg, avg_mileage=(cty+hwy)/2)
# behavior is actually different in R/RStudio compared to notebooks
m %>% group_by(cyl) %>%
    summarise(avg=mean(avg_mileage)) %>%
    head

ERROR: Error in mutate(mpg, avg_mileage = (cty + hwy)/2): could not find function "mutate"


**Note:** If you look at the output of `group_by` in R/RStudio you will actually be able to see what your groupings are as well as how many of them you have. For example if we did `group_by(mpg, cyl)` the output would include `cyl [4]` which shows that our grouping is by `cyl` and there are 4 groups. Jupyter notebook doesn't display this for reasons having to do with [how data frames are outputted](https://github.com/IRkernel/repr/issues/113). Some other differences exist between how certain objects from **tidyverse** are displayed as well.

In [31]:
group_by(m, drv) %>%
    summarise(avg=mean(avg_mileage))

ERROR: Error in group_by(m, drv) %>% summarise(avg = mean(avg_mileage)): could not find function "%>%"


In [32]:
# df after group_by would show that we have 9 groups
drv_cyl <- group_by(m, drv, cyl) %>%
    summarise(avg=mean(avg_mileage)) %>%
    arrange(desc(avg))
drv_cyl

ERROR: Error in group_by(m, drv, cyl) %>% summarise(avg = mean(avg_mileage)) %>% : could not find function "%>%"


Can also run `ungroup` to ungroup your observations.

In [33]:
drv_cyl %>% summarise(max=max(avg))

ERROR: Error in drv_cyl %>% summarise(max = max(avg)): could not find function "%>%"


In [34]:
ungroup(drv_cyl) %>% summarise(max=max(avg))

ERROR: Error in ungroup(drv_cyl) %>% summarise(max = max(avg)): could not find function "%>%"


### Exercises (Appendix - also add a base R example to compare it to Tidyverse and show how clumsy it is) 

1. Find all cars that:
    * Have an average gas mileage greater than 20mpg
    * Have an automatic transmission
    * Are 4-wheel drive
    
2. Earlier we computed the average mileage with an explicit formula by taking city mileage + highway mileage and dividing the sum by two. How can you do this without typing out the exact formula? What happens if there are `NA`s? Feel free to experiment on the dataframe `ex2_df` to arrive at an answer.
    
3. Find all cars grouped by manufacturer, model, cylinder, and auto/manual transmission that improved on gas mileage (either city, highway, or both, you choose) by at least 1mpg between 1999 and 2008. (This one might take some time, if you just look at city mileage you should end up with 26 rows in your data frame.)

In [35]:
ex2_df <- data.frame(x=c(5,2,NA,6),y=c(NA,5,10,3))