Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up| --- | |
| title: "Enrichment Analyses" | |
| date: "`r Sys.Date()`" | |
| output: rmarkdown::html_vignette | |
| vignette: > | |
| %\VignetteIndexEntry{Enrichment Analyses} | |
| %\VignetteEngine{knitr::rmarkdown} | |
| %\usepackage[utf8]{inputenc} | |
| --- | |
| ## Performing `Phylostratum` and `Divergence Stratum` Enrichment Analyses | |
| `Phylostratum` and `Divergence Stratum` enrichment analyses have been performed by several | |
| studies to correlate organ or metabolic pathway evolution with the origin of organ or pathway specific genes (Sestak and Domazet-Loso, 2015). | |
| In detail, `Phylostratum` and `Divergence Stratum` enrichment analyses can be performed analogously to [Gene Ontology](http://geneontology.org/page/go-enrichment-analysis) and [Kegg](http://www.genome.jp/kegg/) enrichment analyses to study the enrichment of evolutionary age or sequence divergence in a set of selected genes against the entire genome/transcriptome. In case specific age categories are significantly over- or underrepresented in the selected gene set, assumptions or potential correlations between the evolutionary origin of a particular organ or metabolic pathway can be implied. | |
| In this vignette we will use the data set published by Sestak and Domazet-Loso, 2015 to demonstrate how to perform enrichment analyses using `myTAI`. | |
| ## Enrichment Analyses using `PlotEnrichment()` | |
| The `PlotEnrichment()` function implemented in `myTAI` computes and visualizes the significance of enriched (over- or underrepresented) `Phylostrata` or `Divergence Strata` within an input set of process/tissue specific genes. In detail this function takes the `Phylostratum` or `Divergence Stratum` distribution of all genes stored in the input `ExpressionSet` as background set and the `Phylostratum` or `Divergence Stratum` distribution of the specific gene set and performs a Fisher's exact test for each `Phylostratum` or `Divergence Stratum` to quantify the statistical significance of over- or under-represented `Phylostrata` or `Divergence Strata` within the set of selected genes. In other words, the frequency distribution of `Phylostrata` or `Divergence Strata` in the complete sample is compared with the frequency distribution of `Phylostrata` or `Divergence Strata` in the set of selected genes and over- or under-representation is visualized by log-odds (or odds), where a log-odd of zero means that both frequency distributions are equal (see also Sestak and Domazet-Loso, 2015). | |
| ### Example Data Set Retrieval | |
| Before using the `PlotEnrichment()` function, we need to download the example data set from | |
| Sestak and Domazet-Loso, 2015. | |
| Download the `Phylostratigraphic Map` of _D. rerio_: | |
| ```r | |
| # download the Phylostratigraphic Map of Danio rerio | |
| # from Sestak and Domazet-Loso, 2015 | |
| download.file( url = "http://mbe.oxfordjournals.org/content/suppl/2014/11/17/msu319.DC1/TableS3-2.xlsx", | |
| destfile = "MBE_2015a_Drerio_PhyloMap.xlsx" ) | |
| ``` | |
| Read the `*.xlsx` file storing the `Phylostratigraphic Map` of _D. rerio_ and format it for the use with `myTAI`: | |
| ```r | |
| # install the readxl package | |
| install.packages("readxl") | |
| # load package readxl | |
| library(readxl) | |
| # read the excel file | |
| DrerioPhyloMap.MBEa <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 1, skip = 4) | |
| # format Phylostratigraphic Map for use with myTAI | |
| Drerio.PhyloMap <- DrerioPhyloMap.MBEa[ , 1:2] | |
| # have a look at the final format | |
| head(Drerio.PhyloMap) | |
| ``` | |
| ``` | |
| Phylostrata ZFIN_ID | |
| 1 1 ZDB-GENE-000208-13 | |
| 2 1 ZDB-GENE-000208-17 | |
| 3 1 ZDB-GENE-000208-18 | |
| 4 1 ZDB-GENE-000208-23 | |
| 5 1 ZDB-GENE-000209-3 | |
| 6 1 ZDB-GENE-000209-4 | |
| ``` | |
| Now, `Drerio.PhyloMap` stores the `Phylostratigraphic Map` of _D. rerio_ which is used | |
| as background set to perform enrichment analyses with `PlotEnrichment()`. | |
| ### Enrichment Analyses | |
| Now, the `PlotEnrichment()` function visualizes the over- and underrepresented `Phylostrata` of brain specific genes when compared with the total number of genes stored in the `Phylostratigraphic Map` of _D. rerio_. | |
| ```{r,eval=FALSE} | |
| # read expression data (organ specific genes) from Sestak and Domazet-Loso, 2015 | |
| Drerio.OrganSpecificExpression <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 2, skip = 3) | |
| # select only brain specific genes | |
| Drerio.Brain.Genes <- unique(na.omit(Drerio.OrganSpecificExpression[ , "brain"])) | |
| # visualize enriched Phylostrata of genes annotated as brain specific | |
| PlotEnrichment(Drerio.PhyloMap, | |
| test.set = Drerio.Brain.Genes, | |
| measure = "log-foldchange", | |
| use.only.map = TRUE, | |
| legendName = "PS") | |
| ``` | |
| Here, the first argument is either a standard `ExpressionSet` object (in case `use.only.map = FALSE`: default) or a `Phylostratigraphic Map` or `Divergence Map` (in case `use.only.map = TRUE`; see [Introduction](Introduction.html) for details). The second argument `test.set` specifies the gene ids also stored in the corresponding `ExpressionSet` or `Phylostratigraphic Map/ Divergence Map` for which enrichment shall be quantified and visualized. | |
| To visualize the odds or log-odds of over- or underrepresented genes within the `test.set` the following procedure is performed: | |
| - $N_{ij}$ denotes the number of genes in group j and deriving from PS $i$, with $i = 1, .. , n$ and where $j = 1$ denotes the background set and $j = 2$ denotes the `test.set` | |
| - $N_{i.}$ denotes the total number of genes within PS $i$ | |
| - $N_{.j}$ denotes the total number of genes within group $j$ | |
| - $N_{..}$ is the total number of genes within all groups $j$ and all PS $i$ | |
| - $f_{ij}$ = $N_{ij}$ / $N_{..}$ and $g_{ij}$ = $f_{ij}$ / $f_{.j}$ denote relative frequencies between groups | |
| - $f_{i.}$ denotes the between group sum of $f_{ij}$ | |
| The result is the __fold-change value__ (odds; `measure = "foldchange"`) denoted as $C_2 = g_{i2} / f_{i.}$ which is visualized above and below zero or the __log fold-change__ value (log-odds; `measure = "log-foldchange"`), where $log_2$(C) = $log_2$($g_{i2}$) - $log_2$($f_{i.}$) which is visualized symmetrically above and below zero by `PlotEnrichment()`. Analogously, $C_1 = g_{i1} / f_{i.}$ but is not visualized by this function. | |
| Internally, `PlotEnrichment()` performs a Fisher's exact test for each `Phylostratum` or `Divergence Stratum` separately, to quantify the significance of over- or under-representation of corresponding `Phylostrata` or `Divergence Strata` within the `test.set` when compared with the entire `ExpressionSet`. `PlotEnrichment()` visualizes significantly enriched (over- or underrepresented) `Phylostrata` or `Divergence Strata` with asterisks '*'. | |
| Notation: | |
| - '*' = P-Value $\leq$ 0.05 | |
| - '**' = P-Value $\leq$ 0.005 | |
| - '***' = P-Value $\leq$ 0.0005 | |
| Users will notice that when performing the `PlotEnrichment()` function, the p-values and the | |
| enrichment matrix (storing $C_1$ and $C_2$) will be returned. | |
| ```{r,eval = FALSE} | |
| PlotEnrichment(Drerio.PhyloMap, | |
| test.set = Drerio.Brain.Genes, | |
| measure = "log-foldchange", | |
| use.only.map = TRUE, | |
| legendName = "PS") | |
| ``` | |
| ``` | |
| $p.values | |
| PS1 PS2 PS3 PS4 PS5 PS6 | |
| 8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01 | |
| PS7 PS8 PS9 PS10 PS11 PS12 | |
| 2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01 | |
| PS13 PS14 | |
| 3.929949e-03 1.593834e-05 | |
| $enrichment.matrix | |
| BG_Set Test_Set | |
| PS1 -0.001132832 0.007668216 | |
| PS2 0.023733936 -0.172380714 | |
| PS3 -0.040879607 0.250587496 | |
| PS4 -0.048920465 0.294399729 | |
| PS5 -0.114888949 0.603817643 | |
| PS6 0.008678915 -0.060350168 | |
| PS7 -0.062948352 0.367240944 | |
| PS8 0.115630474 -1.206210187 | |
| PS9 -0.007353969 0.048964218 | |
| PS10 -0.031971192 0.200141519 | |
| PS11 0.039742253 -0.303363314 | |
| PS12 -0.002418079 0.016311853 | |
| PS13 0.101449988 -0.984621732 | |
| PS14 0.098211044 -0.938724783 | |
| ``` | |
| In case users are only interested in the p-values of the Fisher test and the enrichment matrix without illustrating the bar plot, | |
| they can specify the `plot.bars = FALSE` argument to only retrieve the numeric results. | |
| ```{r,eval = FALSE} | |
| # specify plot.bars = FALSE to retrieve only numeric results | |
| EnrichmentResult <- PlotEnrichment(Drerio.PhyloMap, | |
| test.set = Drerio.Brain.Genes, | |
| measure = "log-foldchange", | |
| use.only.map = TRUE, | |
| legendName = "PS", | |
| plot.bars = FALSE) | |
| # access p-values quantifying the enrichment for each Phylostratum | |
| EnrichmentResult$p.values | |
| ``` | |
| ``` | |
| PS1 PS2 PS3 PS4 PS5 PS6 | |
| 8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01 | |
| PS7 PS8 PS9 PS10 PS11 PS12 | |
| 2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01 | |
| PS13 PS14 | |
| 3.929949e-03 1.593834e-05 | |
| ``` | |
| ```{r, eval = FALSE} | |
| # access enrichment matrix storing C_1 and C_2 | |
| EnrichmentResult$enrichment.matrix | |
| ``` | |
| ``` | |
| BG_Set Test_Set | |
| PS1 -0.001132832 0.007668216 | |
| PS2 0.023733936 -0.172380714 | |
| PS3 -0.040879607 0.250587496 | |
| PS4 -0.048920465 0.294399729 | |
| PS5 -0.114888949 0.603817643 | |
| PS6 0.008678915 -0.060350168 | |
| PS7 -0.062948352 0.367240944 | |
| PS8 0.115630474 -1.206210187 | |
| PS9 -0.007353969 0.048964218 | |
| PS10 -0.031971192 0.200141519 | |
| PS11 0.039742253 -0.303363314 | |
| PS12 -0.002418079 0.016311853 | |
| PS13 0.101449988 -0.984621732 | |
| PS14 0.098211044 -0.938724783 | |
| ``` | |
| ### Defining the Background Set | |
| The Fisher test which is performed inside `PlotEnrichment()` assumes | |
| that all genes stored in the input `ExpressionSet` or `Phylostratigraphic Map`/`Divergence Map` are used to define the background set for constructing the test statistic. However, | |
| since in most cases the `test.set` is an subset of the input `ExpressionSet` or `Phylostratigraphic Map`/`Divergence Map` one could also specify the `complete.bg` argument to remove all `test.set` genes from the background set when performing the Fisher test and visualization. | |
| The following two examples allow users to compare the results when retaining all genes as background set compared with the option to remove `test.set` genes from the background set. | |
| ```{r,eval = FALSE} | |
| # complete.bg = TRUE (default) -> retain test.set genes in background set | |
| PlotEnrichment(Drerio.PhyloMap, | |
| test.set = Drerio.Brain.Genes, | |
| measure = "log-foldchange", | |
| complete.bg = TRUE, | |
| use.only.map = TRUE, | |
| legendName = "PS") | |
| ``` | |
| ```{r,eval = FALSE} | |
| # complete.bg = FALSE -> remove test.set genes from background set | |
| PlotEnrichment(Drerio.PhyloMap, | |
| test.set = Drerio.Brain.Genes, | |
| measure = "log-foldchange", | |
| complete.bg = FALSE, | |
| use.only.map = TRUE, | |
| legendName = "PS") | |
| ``` | |
| Users will notice that although some p-values change, the qualitative result did not change. | |
| In border line cases however, the results might influence whether or not some `Phylostrata` | |
| or `Divergence Strata` are denoted as significantly enriched or not. So always be aware of | |
| the interpretation when retaining or removing the `test.set` from the background set, | |
| because both options are valid and have advantages and disadvantages and depend on a valid interpretation. | |
| ## Interpretation of Enrichment Results | |
| For the _D. rerio_ brain genes example you can see that PS4, PS5, and PS7 are significantly | |
| over-represented in the set of brain specific genes. | |
| ```{r, eval = FALSE} | |
| PlotEnrichment(Drerio.PhyloMap, | |
| test.set = Drerio.Brain.Genes, | |
| measure = "foldchange", | |
| complete.bg = TRUE, | |
| use.only.map = TRUE, | |
| legendName = "PS") | |
| ``` | |
| Again, we retrieve the _D. rerio_ specific taxonomy represented by PS1-14 using the `taxonomy()` funtion (see [Introduction](Introduction.html) and [Taxonomy](Taxonomy.html) for details). | |
| ```{r, eval=FALSE} | |
| # retrieve the taxonomy of D. rerio | |
| taxonomy(organism = "Danio rerio") | |
| ``` | |
| ``` | |
| id name rank | |
| 1 cellular organisms no rank 131567 | |
| 2 Eukaryota superkingdom 2759 | |
| 3 Opisthokonta no rank 33154 | |
| 4 Metazoa kingdom 33208 | |
| 5 Eumetazoa no rank 6072 | |
| 6 Bilateria no rank 33213 | |
| 7 Deuterostomia no rank 33511 | |
| 8 Chordata phylum 7711 | |
| 9 Craniata subphylum 89593 | |
| 10 Vertebrata no rank 7742 | |
| 11 Gnathostomata no rank 7776 | |
| 12 Teleostomi no rank 117570 | |
| 13 Euteleostomi no rank 117571 | |
| 14 Actinopterygii superclass 7898 | |
| 15 Actinopteri class 186623 | |
| 16 Neopterygii subclass 41665 | |
| 17 Teleostei infraclass 32443 | |
| 18 Osteoglossocephalai no rank 1489341 | |
| 19 Clupeocephala no rank 186625 | |
| 20 Otomorpha no rank 186634 | |
| 21 Ostariophysi no rank 32519 | |
| 22 Otophysa no rank 186626 | |
| 23 Cypriniphysae superorder 186627 | |
| 24 Cypriniformes order 7952 | |
| 25 Cyprinoidea superfamily 30727 | |
| 26 Cyprinidae family 7953 | |
| 27 Danio genus 7954 | |
| 28 Danio rerio species 7955 | |
| ``` | |
| Sestak and Domazet-Loso, 2015 collapsed these 28 taxonomic nodes into 14 taxonomic nodes | |
| (see `Figure 2` in Sestak and Domazet-Loso, 2015) and labelled them as phylostrata 1 to phylostrata 14, where PS1 represents `cellular organisms` and PS14 represents `D. rerio` specific genes. | |
| Based on the phylostratum categorization of Sestak and Domazet-Loso, 2015, PS4 represents `Holozoa (= Metazoa + allies)`, PS5 represents `Metazoa`, and PS7 represents `Bilateria`. | |
| Now, the over-representation results of brain specific genes returned by `PlotEnrichment()` | |
| provide evidence, that brain specific genes might indeed have originated during the emergence of the nervous system at the metazoan-eumetazoan transition leading to the interpretation that the vertebrate brain has a step wise adaptive history where most of its extant organization was already present in the chordate ancestor as argued by Sestak and Domazet-Loso, 2015. | |
| This example shall illustrate how the `PlotEnrichment()` function can be used to trace | |
| the evolutionary origin of tissue or process specific genes by investigating their age enrichment. | |
| In case users have an `ExpressionSet` storing the `Phylostratigraphic Map` of _D. rerio_ | |
| as well as an expression set, they can furthermore use the `PlotGeneSet()` function implemented in `myTAI` to visualize the expression levels of brain specific genes which | |
| have been shown to be significantly enriched in `Metazoa` specific phylostrata. | |
| Example: | |
| ```{r, eval = FALSE} | |
| # the best parameter setting to visualize this plot: | |
| # png("DrerioBrainSpecificGeneExpression.png",700,400) | |
| PlotGeneSet(ExpressionSet = DrerioPhyloExpressionSet, | |
| gene.set = Drerio.Brain.Genes, | |
| plot.legend = FALSE, | |
| type = "l", | |
| lty = 1, | |
| lwd = 4, | |
| xlab = "Ontogeny", | |
| ylab = "Expression Level") | |
| # dev.off() | |
| ``` | |
| Here `DrerioPhyloExpressionSet` denotes a hypothetical `ExpressionSet` of _D. rerio_ development. | |
| Additionally, the `SelectGeneSet()` function allows users to obtain the `ExpresisonSet` subset of selected genes (`gene.set`) for subsequent analyses. | |
| ```{r, eval = FALSE} | |
| # select the ExpressionSet subset of Brain specific genes | |
| Brain.PhyloExpressionSet <- SelectGeneSet( ExpressionSet = DrerioPhyloExpressionSet, | |
| gene.set = Drerio.Brain.Genes ) | |
| head(Brain.PhyloExpressionSet) | |
| ``` | |
| ### Adjust P-values for Multiple Comparisons | |
| In case a large number of Phylostrata or Divergence Strata is included in the input `ExpressionSet`, p-values returned by `PlotEnrichment()` should be adjusted for multiple comparisons. For this purpose `PlotEnrichment()` includes the argument `p.adjust.method`. | |
| Here, all methods implemented in `?p.adjust` can be specified: | |
| ```{r,eval = FALSE} | |
| # adjust p-values for multiple comparisons with Benjamini & Hochberg (1995) | |
| PlotEnrichment(Drerio.PhyloMap, | |
| test.set = Drerio.Brain.Genes, | |
| measure = "log-foldchange", | |
| complete.bg = FALSE, | |
| use.only.map = TRUE, | |
| legendName = "PS", | |
| p.adjust.method = "BH") | |
| ``` | |
| Please consult these reviews ([Biostatistics Handbook](http://www.biostathandbook.com/multiplecomparisons.html), [Gelman et al., 2008](http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf), and [Slides](http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf)) to decide whether or not to apply p-value adjustment | |
| to your own dataset. | |
| ## References | |
| Sestak MS and Domazet-Loso T. __Phylostratigraphic Profiles in Zebrafish Uncover Chordate Origins of the Vertebrate Brain__. _Mol. Biol. Evol._ (2015) 32 (2): 299-312. | |