finish advanced inference

genomicsclass · Mar 26, 2016 · 3d8f3ce · 3d8f3ce
1 parent dd686d1
commit 3d8f3ce
Show file tree

Hide file tree

Showing 4 changed files with 20 additions and 18 deletions.
diff --git a/advinference/eda_for_highthroughput.Rmd b/advinference/eda_for_highthroughput.Rmd
@@ -35,7 +35,7 @@ randomData <- matrix(rnorm(n*m),m,n)
 nullpvals <- rowttests(randomData,g)$p.value
 ```
 
-As we described earlier, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data, we can visualize the results by making a _volcano plot_. The idea behind a volcano plot is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis we plot the effect size. By using - log (base 10), the "highly significant" features appear at the top of the plot. Using log also permits us to better distinguish between small and very small p-values, for example 0.01 and $10^6$.  Here is the volcano plot for our results above:
+As we described earlier, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data, we can visualize the results by making a _volcano plot_. The idea behind a volcano plot is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis we plot the effect size. By using -log (base 10), the "highly significant" features appear at the top of the plot. Using log also permits us to better distinguish between small and very small p-values, for example 0.01 and $10^6$.  Here is the volcano plot for our results above:
 
 ```{r volcano_plot, fig.align="Volcano plot. Negative log (base 10) p-values plotted against effect size estimates."}
 plot(results$dm,-log10(results$p.value),
@@ -46,7 +46,7 @@ Many features with very small p-values, but small effect sizes as we see here, a
 
 #### p-value Histograms
 
-Another plot we can create to get an overall idea of the results is to make histograms of p-values. When we generate completely null data the histogram follows a uniform distribution. With our original data set we see a higher frequency of smaller p-values. 
+Another plot we can create to get an overall idea of the results is to make histograms of p-values. When we generate completely null data the histogram follows a uniform distribution. With our original dataset we see a higher frequency of smaller p-values. 
 
 ```{r pval-hist, fig.cap="P-value histogram. We show a simulated case in which all null hypotheses are true (left) and p-values from the gene expression described above.",fig.width=10.5,fig.height=5.25}
 library(rafalib)
@@ -55,7 +55,7 @@ hist(nullpvals,ylim=c(0,1400))
 hist(pvals,ylim=c(0,1400))
 ```
 
-When we expect most hypothesis to be null and don't see a uniform p-value distribution, it might be indicative of unexpected properties, such as correlated samples. 
+When we expect most hypotheses to be null and don't see a uniform p-value distribution, it might be indicative of unexpected properties, such as correlated samples. 
 
 If we permute the outcomes and calculate p-values then, if the samples are independent, we should see a uniform distribution. With these data we do not:
 
@@ -76,7 +76,7 @@ library(Biobase)
 library(GSE5859) 
 data(GSE5859) 
 ge <- exprs(e) ##ge for gene expression
-ge[,49] <- ge[,49]/log2(exp(1)) ##immitate error
+ge[,49] <- ge[,49]/log2(exp(1)) ##imitate error
 ```
 
 A quick look at a summary of the distribution using boxplots immediately highlights the mistake:

diff --git a/advinference/inference_for_highthroughput.Rmd b/advinference/inference_for_highthroughput.Rmd
@@ -96,7 +96,7 @@ The qq-plots show that the data is well approximated by the normal approximation
 t.test(e[g==1],e[g==0])$p.value
 ```
 
-To answer the question for each gene, we simply do repeat the above for each gene. Here we will define our own function and use `apply`:
+To answer the question for each gene, we simply repeat the above for each gene. Here we will define our own function and use `apply`:
 
 ```{r}
 myttest <- function(x) t.test(x[g==1],x[g==0],var.equal=TRUE)$p.value

diff --git a/advinference/intro_to_highthroughput_data.Rmd b/advinference/intro_to_highthroughput_data.Rmd
@@ -11,11 +11,11 @@ library(knitr)
 opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
 ```
 
-# Inference For High Dimensional Data
+# Inference for High Dimensional Data
 
 ## Introduction
 
-High-throughput technologies have changed basic biology and the biomedical sciences from data poor disciplines to data intensive ones. A specific example comes from research fields interested in understanding gene expression. Gene expression is the process in which DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. In the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper or extracting a few numbers from standard curves. With high-throughput technologies, such as microarrays, this suddenly changed to sifting through tens of thousands of numbers. More recently, RNA-Sequencing has further increased data complexity. Biologists went from using their eyes or simple summaries to categorize results, to having thousands (and now millions) of measurements per sample to analyze. In this chapter, we will focus on statistical inference in the context of high-throughput measurements. Specifically, we focus on the problem of detecting differences in groups using statistical tests and quantifying uncertainty in a meaningful way. We also introduce exploratory data analysis techniques that should be used in conjunction with inference when analyzing high-throughput data. In later chapters, we will study the statistics behind clustering, machine learning, factor analysis and multi-level modeling. 
+High-throughput technologies have changed basic biology and the biomedical sciences from data poor disciplines to data intensive ones. A specific example comes from research fields interested in understanding gene expression. Gene expression is the process in which DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. In the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper or extracting a few numbers from standard curves. With high-throughput technologies, such as microarrays, this suddenly changed to sifting through tens of thousands of numbers. More recently, RNA sequencing has further increased data complexity. Biologists went from using their eyes or simple summaries to categorize results, to having thousands (and now millions) of measurements per sample to analyze. In this chapter, we will focus on statistical inference in the context of high-throughput measurements. Specifically, we focus on the problem of detecting differences in groups using statistical tests and quantifying uncertainty in a meaningful way. We also introduce exploratory data analysis techniques that should be used in conjunction with inference when analyzing high-throughput data. In later chapters, we will study the statistics behind clustering, machine learning, factor analysis and multi-level modeling. 
 
 Since there is a vast number of available public datasets, we use several gene expression examples. Nonetheless, the statistical techniques you will learn have also proven useful in other fields that make use of high-throughput technologies. Technologies such as microarrays, next generation sequencing, fRMI, and mass spectrometry all produce data to answer questions for which what we learn here will be indispensable. 
 
@@ -34,7 +34,7 @@ install_github("genomicsclass/GSE5859Subset")
 
 #### The three tables
 
-Most of the data we use as examples in this book are created with high-throughput technologies. These technologies measure thousands of _features_. Examples of feature are genes, single base locations of the genome, genomic regions, or image pixel intensities. Each specific measurement product is defined by a specific set of features. For example, a specific gene expression microarray product is defined by the set of genes that it measures. 
+Most of the data we use as examples in this book are created with high-throughput technologies. These technologies measure thousands of _features_. Examples of features are genes, single base locations of the genome, genomic regions, or image pixel intensities. Each specific measurement product is defined by a specific set of features. For example, a specific gene expression microarray product is defined by the set of genes that it measures. 
 
 A specific study will typically use one product to make measurements on several experimental units, such as individuals. The most common experimental unit will be the individual, but they can also be defined by other entities, for example different parts of a tumor. We often call the experimental units _samples_ following experimental jargon. It is important that these are not confused with samples as referred to in previous chapters, for example "random sample". 
 

diff --git a/advinference/multiple_testing.Rmd b/advinference/multiple_testing.Rmd
@@ -33,16 +33,14 @@ In the context of high-throughput data we can make several type I errors and sev
 |Alternative True   | $S$  | $m_1-S$    | $m_1$    |   
 |True   | $R$  |  $m-R$ | $m$  |
 
-To describe the entries in the table, let's use as an example a dataset representing measurements from 10,000 genes, which means that the total number of tests that we are conducting is: $m=10,000$. The number of genes for which the null hypothesis is true, which in most cases represent the "non-interesting" genes, is $m_0$, while the number of genes for which the null hypothesis is false is $m_1$. For this we can also say that the _alternative hypothesis_ is true. In general, we are interested in _detecting_ as many as the cases for which the alternative hypothesis is true (true positives), without incorrectly detecting cases for which the null hypothesis is true (false positives). For most high-throughput experiments, we assume that $m_0$ is much greater than $m_1$. For example, we test 10,000 expecting 100 genes or less to be _interesting_. This would imply that $m_1 \leq 100$ and $m_0 \geq 19,900$. 
+To describe the entries in the table, let's use as an example a dataset representing measurements from 10,000 genes, which means that the total number of tests that we are conducting is: $m=10,000$. The number of genes for which the null hypothesis is true, which in most cases represent the "non-interesting" genes, is $m_0$, while the number of genes for which the null hypothesis is false is $m_1$. For this we can also say that the _alternative hypothesis_ is true. In general, we are interested in _detecting_ as many as possible of the cases for which the alternative hypothesis is true (true positives), without incorrectly detecting cases for which the null hypothesis is true (false positives). For most high-throughput experiments, we assume that $m_0$ is much greater than $m_1$. For example, we test 10,000 expecting 100 genes or less to be _interesting_. This would imply that $m_1 \leq 100$ and $m_0 \geq 19,900$. 
 
 Throughout this chapter we refer to _features_ as the units being tested. In genomics, examples of features are genes, transcripts, binding sites, CpG sites, and SNPs. In the table, $R$ represents the total number of features that we call significant after applying our procedure, while $m-R$ is the total number of genes we don't call significant. The rest of the table contains important quantities that are unknown in practice.
 
 * $V$ represents the number of type I errors or false positives. Specifically, $V$ is the number of features for which the null hypothesis is true, that we call significant.
 * $S$ represents the number of true positives. Specifically, $S$ is the number of features for which the alternative is true, that we call significant. 
 
-This implies that there are $m_1-S$ type II errors or _false negatives_ and $m_0-V$ true negatives. 
-
-Keep in mind that if we only ran one test, a p-value is simply the probability that $V=1$ when $m=m_0=1$. Power is the probability of $S=1$ when $m=m_1=1$. In this very simple case, we wouldn't bother making the table above, but now we show how defining the terms in the table helps in practice the high-dimensional context.
+This implies that there are $m_1-S$ type II errors or _false negatives_ and $m_0-V$ true negatives. Keep in mind that if we only ran one test, a p-value is simply the probability that $V=1$ when $m=m_0=1$. Power is the probability of $S=1$ when $m=m_1=1$. In this very simple case, we wouldn't bother making the table above, but now we show how defining the terms in the table helps for the high-dimensional setting.
 
 
 #### Data example
@@ -75,11 +73,13 @@ Although in practice we do not know the fact that no diet works, in this simulat
 ```{r}
 sum(pvals < 0.05) ##This is R
 ```
+
 These many false positives are not acceptable in most contexts.
 
 Here is more complicated code showing results where 10% of the diets are effective with an average effect size of $\Delta= 3$ ounces. 
-Studying this code carefully will help understand the meaning of the table above.
+Studying this code carefully will help us understand the meaning of the table above.
 First let's define _the truth_:
+
 ```{r}
 alpha <- 0.05
 N <- 12
@@ -92,6 +92,7 @@ delta <- 3
 ```
 
 Now we are ready to simulate 10,000 tests, perform a t-test on each, and record if we rejected the null hypothesis or not:
+
 ```{r}
 set.seed(1)
 calls <- sapply(1:m, function(i){
@@ -197,12 +198,13 @@ pvals <- sapply(1:m, function(i){
 ```{r}
 sum(pvals < 0.05/10000)
 ```
-are called significant after applying the Bonferroni procedure, despite having 1000 diets that work. 
+
+are called significant after applying the Bonferroni procedure, despite having 1,000 diets that work. 
 
 
 ## False Discovery Rate 
 
-There are many situations for which requiring an FWER of 0.05 does not make sense as it is much too strict. For example, consider the very common exercise of running a preliminary small study to determine a handful of candidate genes. This is referred to as a _discovery_ driven project or experiment. We may be in search of an unknown causative gene and more than willing to perform follow up studies with many more samples on just the candidates. If we develop a procedure that produces, for example, a list of 10 genes of which 1 or 2 pan out as important, the experiment is a resounding success. With a small sample size, the only way to achieve a FWER $\leq$ 0.05 is with an empty list of genes. We already saw in the previous section that despite 1,000 diets being effective, we ended up with a list with just 2. Change the sample size to 6 and you very likely get 0:
+There are many situations for which requiring an FWER of 0.05 does not make sense as it is much too strict. For example, consider the very common exercise of running a preliminary small study to determine a handful of candidate genes. This is referred to as a _discovery_ driven project or experiment. We may be in search of an unknown causative gene and more than willing to perform follow-up studies with many more samples on just the candidates. If we develop a procedure that produces, for example, a list of 10 genes of which 1 or 2 pan out as important, the experiment is a resounding success. With a small sample size, the only way to achieve a FWER $\leq$ 0.05 is with an empty list of genes. We already saw in the previous section that despite 1,000 diets being effective, we ended up with a list with just 2. Change the sample size to 6 and you very likely get 0:
 
 ```{r}
 set.seed(1)
@@ -282,7 +284,7 @@ polygon(c(0,0.05,0.05,0),c(0,0,h$counts[1],h$counts[1]),col="grey")
 abline(h=m0/20)
 ```
 
-The first bar (grey) on the left represents cases with p-values smaller than 0.05. From the horizontal line we can infer that about 1/2 are false positives. This is in agreement with an FDR of 0.50.  If we look at the bar for 0.01, we can see a lower FDR, as expected, but would call less features significant.
+The first bar (grey) on the left represents cases with p-values smaller than 0.05. From the horizontal line we can infer that about 1/2 are false positives. This is in agreement with an FDR of 0.50.  If we look at the bar for 0.01, we can see a lower FDR, as expected, but would call fewer features significant.
 
 ```{r pval_hist2, fig.cap="Histogram of p-values with breaks at every 0.01. Monte Carlo simulation was used to generate data with m_1 genes having differences between groups."}
 h <- hist(pvals,breaks=seq(0,1,0.01))
@@ -352,7 +354,7 @@ FDR=mean(Qs)
 print(FDR)
 ```
 
-The FDR is lower than 0.05. This is to be expected because we need to be conservative to assure the FDR $\leq$ 0.05 for any value of $m_0$, such as for the extreme case where every hypothesis tested is null: $m=m_0$. If you re-do the simulation above for this case, you will find that the FDR increases. 
+The FDR is lower than 0.05. This is to be expected because we need to be conservative to ensure the FDR $\leq$ 0.05 for any value of $m_0$, such as for the extreme case where every hypothesis tested is null: $m=m_0$. If you re-do the simulation above for this case, you will find that the FDR increases. 
 
 We should also note that in ...
 ```{r}
@@ -411,7 +413,7 @@ However, instead of doing this, we compute a _q-value_ for each test. If a featu
 
 In R, this can be computed with the `qvalue` function in the `qvalue` package:
 
-```{r qval_vs_pval,fig.cap="Q-values versus p-values.",warning=FALSE}
+```{r qval_vs_pval,fig.cap="q-values versus p-values.",warning=FALSE}
 library(qvalue)
 res <- qvalue(pvals)
 qvals <- res$qvalues