## Prelab - Plotting with ggplot2 (R - III)

Now that we can easily manipulate our data, let's start making graphs! ggplot allows you to build graphs easily in a modular fashion. The basic format for a ggplot command includes calling ggplot and then adding the type of graph or feature you want to include.

In this module, you will have a first introduction to ggplot2 - a crash run through syntax and some of the capabilities. If this looks strange and unusual - don't panic! In subsequent modules, we will have further directed work with ggplot2 to show you how it works a bit more and for you to get familiar with the syntax, etc.

## Loading packages and data

As you will remember from our previous R modules, we need to load the libraries we will be using before we begin any analysis.

In [None]:
library(tidyverse)
options(repr.plot.width=10, repr.plot.height=3) #set size for plots in this notebook

For this prelab, we will again use some cancer incidence statistics from 2014 (obtained from https://www.cdc.gov/cancer/). This dataset contains statistics for a set of seven types of cancer, stratified by year, race, and sex. 

In [None]:
data = read.table("Cancer_Incidence.txt",header=T,sep="\t")

To get started: the below code makes a tbl called "rates", which is filtered to include only rates for all races and male and females combined. 

Then we call `ggplot(rates)` to say what data we want to plot, and call `geom_point()` to add points to our plot by specifying variables for the X and Y coordinates. 

Anytime we want to plot something where each row is a data point, as in this plot, we put the variables we are using in the aesthetic call (`aes()`). So here, we are telling ggplot to plot a point for each row, with the X value as the YEAR and the Y value as the AGE_ADJUSTED_RATE.

In [None]:
rates = data %>% filter(RACE=="All Races",SEX=="Male and Female") #filter data to all races, male and female
head(rates)
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE)) #make a plot of the year and adjusted rates!

Looking at the data in rates, we have multiple types of cancers (SITE), but we plotted them all together without telling ggplot to differentiate them in any way. Let's add an additional `color` aesthetic, which will color the points by whatever variable we provide.

In [None]:
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=SITE)) #make a plot of the year and adjusted rates!

Now we can see each site separately! And ggplot has automatically added a nice legend for us. This will happen any time you use `color` or another aesthetic (in the `aes()` call) to differentiate by a variable.

Notice what happens when we put the `color` call outside `aes()`, below. We get an error because anything specified by a variable within the data must be called within `aes()`.

In [None]:
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE),color=SITE) #make a plot of the year and adjusted rates!

If you just want to make everything the same color regardless of its variables, you can use color outside `aes()`, and assign a particular color.

In [None]:
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE),color="red") #make a plot of the year and adjusted rates!

Now let's connect the points with lines. ggplot is "buildable", meaning if you want to add something you can just add an extra command.

In [None]:
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=SITE)) + geom_line(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=SITE))

The rates reported in our dataset include confidence intervals. Let's try plotting the confidence intervals in our point and line plot. To do this, we simply add another call of `geom_point()` for the additional variables. To keep our plot from getting too busy, we'll filter the data down to just lymphomas in females. Notice how we have changed the size of our points and lines using the `size` argument.

In [None]:
lymphFem = data %>% filter(SITE=="Lymphomas",SEX=="Female",RACE!="All Races")
ggplot(lymphFem) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=RACE),size=3) + geom_line(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=RACE),size=1) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_CI_LOWER,color=RACE),size=1)  + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_CI_UPPER,color=RACE),size=1)

Now, let's add line segments to connect our confidence intervals. For line segments we need to specify where the line starts and stops on both the `x` and `y` axes, thus there are now 4 variables going into the aesthetic call in addition to `color`.

In [None]:
ggplot(lymphFem) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=RACE),size=3) + geom_line(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=RACE),size=1) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_CI_LOWER,color=RACE),size=1)  + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_CI_UPPER,color=RACE),size=1) + geom_segment(aes(x=YEAR,xend=YEAR,y=AGE_ADJUSTED_CI_LOWER,yend=AGE_ADJUSTED_CI_UPPER,color=RACE))

ggplot can make many types of plots. Let's try another type, a boxplot. Let's plot the range of rates for each site/sex combination in a boxplot:

In [None]:
ggplot(data) + geom_boxplot(aes(x=SITE,y= AGE_ADJUSTED_RATE,color=SEX))

Different plot types use different aesthetics. For boxplots, if we want to color the whole box, we use `fill`. We can also combine a `filter()` command with our ggplot:

In [None]:
ggplot(data %>% filter(RACE!="All Races")) + geom_boxplot(aes(x=SITE,y= AGE_ADJUSTED_RATE,fill=RACE))

What if we want to use our own colors instead of the ones ggplot chooses automatically? We can do that too, by adding another command to our code:

In [None]:
myColors = c("red","orange","purple","chartreuse","magenta")
ggplot(data %>% filter(RACE!="All Races")) + geom_boxplot(aes(x=SITE,y= AGE_ADJUSTED_RATE,fill=RACE)) + scale_fill_manual(values=myColors)

We also might want to change the axes labels instead of using the names of the variables being plotted. To do that we can use `xlab()` and `ylab()`, again simply adding on to our existing code.

In [None]:
ggplot(data %>% filter(RACE!="All Races")) + geom_boxplot(aes(x=SITE,y= AGE_ADJUSTED_RATE,fill=RACE)) + scale_fill_manual(values=myColors) + xlab("Cancer Site") + ylab("Rate per 100,000")

**Filtering with AND (&), OR (|) and %in% operators**

As one is plotting (and performing filtering), it turns out that there are some useful 'operators' that allow us to require/filter data meet conditions that we desire. 

For example, perhaps we want data to meet a criteria in one column *AND* (`&`) a second column. 

Perhaps instead we a single column has multiple labels, but we want one value *OR* (`|`) another value.

Alternatively, we have a given set of values in a vector, and want to include data if the element is *in* that set (`%in%`) 

Each of these logical 'operations' can be achieved symbolically in R:

- "AND" --> %
- "OR" --> |
- "is an element of" --> %in%

You can think of these like a 'test' of the given vector: if the value of the element meets the condition (`TRUE`), the element is passed forward. If the condition is not met (`FALSE`), the element is not passed forward. (Note that it is  possible to generate the *logical vector* of TRUE/FALSE, rather than the subset of data that meets the "TRUE" condition, if you wanted). 

In most cases, however, you simply want the resulting data. Let's try some examples to demonstrate.

Let's say that we would like to take the subset of data for black women for pancreatic cancer. 

We can use the & operator to achive that.

In [None]:
data_filter = data %>% filter((RACE=="Black") & (SEX=="Female") & (SITE=="Pancreas"))
data_filter

Now what is we want both Pancreas and Stomach Cancer data? We can add the OR (`|`) operator to the previous line of code, with some additional parentheticals:

"Black" and "Female" and ("Pancreas" OR "Stomach")

In [None]:
data_filter <- data %>% filter((RACE=="Black") & (SEX=="Female") & ((SITE=="Pancreas") | (SITE=="Stomach")))
data_filter

If you were filtering for multiple entries, you could do something very cumbersome

for example, let's say you want to filter for Pancreas, Stomach, and Gallbladder cancer. You could write:

In [None]:
data_filter <- data %>% filter(SITE=="Pancreas" | SITE=="Stomach" | SITE=="Gallbladder")
data_filter

But as you can see, if you have many entries, the code for this might get a bit awkward. 

Instead, you can create sets and use `%in%` to achieve the same result, with code that is much more readable:

In [None]:
cancers <- c("Pancreas", "Stomach", "Gallbladder")
data_filter <- data %>% filter(SITE %in% cancers)
data_filter

You can see that the "specification" of data is placed ahead of the data processing, so that when you are reading the code, you are separating out those two steps (and what they mean) clearly, rather than merging them into a single (longer) step. 

It is also more readable because there are fewer operations: above, you had to enumerate multiple logical `|` steps. Imagine if you had a set of 200 genes that you wanted to select! Much easier / cleaner to make a list of gene names, then use `%in%` rather than having to type of multiple conditions to check!