# Multiple Hypothesis Testing and You: Just do it

## Instructions

In this module, you are going to analyze gene expression data from an experiment aimed at finding genes that change abundance in respose to ER stress. Human fibroblast cells were treated with the drug tunicamycin, a drug that inhibits N-linked glycosylation resulting in the accumulation of unfolded proteins in the ER. We already provided the results of an analysis pipeline that used three biological repeats to estimate the fold change and p-value for each gene.

First, let's load the file data.csv and take a look at the first few lines.

**Execute the following code below.**

In [None]:
library(tidyverse)
data = read.csv('data.csv')
head(data)

**Q1.** Let's take a look at these results in more detail.

- Sort the table based on `pvalue`. 
- How many genes returned `pvalue` < 0.05?

**Provide and Execute your code below.**

**Q2.** What are the top 5 most significant genes?

**Q3.** Above, you indicated that many genes returned a `pvalue` < 0.05. 
- Do you think that all of these genes are *truly* differentially expressed in response to treatment with the drug tunicamycin? Why or why not? 
- If the drug had absolutely *zero effect* on *any and all genes*, how many genes would you expect to return a `pvalue < 0.05`? 

**Q4.** Let's examine the distribution of p-values that you found, and compare it to the distribution of p-values where every case we know H0 (the null hypothesis) always holds. We can easily generate such distribution with 'random' number generating functions in R. 

Below, we do this using a t-test comparing two groups which we know where drawn from the same distribution. 

Take a look at the code below and try to understand it (group effort!). In the pre-lab for today, we introduced you to loops -- so take a look and make sure you understand what we're doing here! Note that loops are super useful, and we will be using these later in the course too...

In [None]:
sims = vector("numeric", 10000) 
for (i in 1:10000) {
  x = rnorm(50, 0, 2) 
  y = rnorm(50, 0, 2) 
  sims[i] = t.test(x,y)$p.value
}
hist(p.vals)

**Q5.** How many of the 10000 t.tests you ran above gave you a `p.value` < 0.05? How many would you have expected? What does the p-value distribution look like? Is this what you expected?

**Q6.** Now, plot the distribution of `pvalue` in the data provided to you in **Q1**, above.

**Provide and Execute your code below.**

**Q7.** Does it look like there are significant differentially expressed genes? Why or why not?

**Q8.** Rather than skewing toward P=0, there are sometime cases where the p-value distribution can skew the *other* direction P=1.

What situation(s) might you imagine could result in this behavior?

Next, we will determine which genes if any are differentially expressed and mark it on a `volcano plot`: this is a dot plot of the p-value as a function of (log) fold change in expression. 

First, let's do this without any multiple hypothesis correction.

**Execute the code below.**

In [None]:
data = read.csv('data.csv')
data <- data %>% mutate(significant = pvalue < 0.05)
print(paste('number of significant genes is',length(which(data$pvalue < 0.05))))
ggplot(data,aes(x=log2FoldChange,y=-log10(pvalue),col=significant)) + geom_point()

**Q9.** It turns out that R has a handy function that allows you to adjust for multiple testing, `p.adjust()`. Here's a quick link to a [manual page](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/p.adjust) about how to use it. Using this function, generate:

- p-values adjustments based on [Bonferroni correction](https://www.youtube.com/watch?v=HLzS5wPqWR0)
- p-values adjustments based on a [False Discovery Rate (FDR) correction](https://www.youtube.com/watch?v=K8LQSvtjcEo)

Links provided if you need a quick "stats refresher" on the intuition of these approaches.

(Hint: you could use `mutate()` in tidyverse to add these columns to your table `data` very easily!)

**Provide and Execute your code below.**

**Q10.** Next, recreate the volcano plot created above usin:

- The bonferroni corrected p-values
- FDR corrected p-values

**Provide and Execute your code below.**