# What is BioConductor?
Bioconductor is an ecosystem of R packages for analysing genomics data. According to bioconductor.org:

```
"Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, and an active user community."
```

Bioconductor is an excellent source for annotation resources. I'll walk through using two of the most popular packages here -- `AnnotationHub` and `biomaRt`. `AnnotationHub` is a convenient interface to annotations from different resources and in many different formats. This includes annotations that are gene-centric (like OrgDbs) or genome-centric (like TxDbs), while `biomaRt` is mostly geared toward annotations from Ensembl. We will dig into these packages in more detail and these distinctions might make a little more sense as we move along.

For the purposes of this tutorial, we will be using the `airway` data. Four human airway smooth muscle cell lines were left untreated or treatreated with dexamethasone for 18 hours (Himes et al. 2014). I have already started procesing this data using DESeq2, we will load these objects and use them to look at a few different ways to query public annotation data and run some downstream analyses.

Much of this workshop is pulled from these resources:

https://www.bioconductor.org/packages/release/workflows/vignettes/annotation/inst/doc/Annotation_Resources.html                   
http://yulab-smu.top/clusterProfiler-book/           
https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf     
https://bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf            


Let's load some packages:

In [None]:

suppressPackageStartupMessages(library("tidyverse"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("BiocManager"))
suppressPackageStartupMessages(library("gridExtra"))
suppressPackageStartupMessages(library("ggrepel"))
suppressPackageStartupMessages(library("airway"))
suppressPackageStartupMessages(library("AnnotationHub"))
suppressPackageStartupMessages(library("clusterProfiler"))
suppressPackageStartupMessages(library("enrichplot"))
suppressPackageStartupMessages(library("biomaRt"))
suppressPackageStartupMessages(library("DESeq2"))
suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("tidyr"))
suppressPackageStartupMessages(library("GenomicRanges"))
suppressPackageStartupMessages(library("GenomicFeatures"))
#suppressPackageStartupMessages(library("BSgenome"))
#suppressPackageStartupMessages(library("BSgenome.Hsapiens.UCSC.hg19"))

#BiocManager::install("GenomicRanges")
#BiocManager::install("GenomicFeatures")
#BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
#BiocManager::install("BSgenome")



In [None]:
#BiocManager::install(c('GenomicFeatures','Rhtslib', 'Rsamtools', 'GenomicAlignments', 'Biostrings', 'rtracklayer'))

Lets import some data to work with -- `res` is the `results` object and `rld` is the rlog transformed counts from DESeq2 differential expression analysis run on the `airway` data. We are comparing the dexamethasone treatment conditions, comparing treated to untreated.


In [None]:
res <- readRDS("res.rds")

In [None]:
res

In [None]:
rld <- readRDS("rld.rds")

In [None]:
rld

Let's work with the `res` results table. We can see that each row is a gene (`ENSG...`) and each column gives us some information about the differential expression analysis. These gene IDs are not particularly informative, but we can use biomaRt to fix that.

# Using biomaRt

The [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) package makes it easy to query public repositories of biological data. We can use biomaRt to query Ensembl for annotations so that we can look for 'housekeeping genes' which are typically considered to be stably expressed and shouldn't show large variations across different samples. We have selected a list of genes based on two publications that queried public cancer genome data to find housekeeping genes for use with RNA-seq from cancer cell lines (https://doi.org/10.1186/s12859-019-2809-2, https://doi.org/10.3389/fgene.2019.00097). 

First, let's load biomaRt and make a vector of the gene symbols from the published data:

In [None]:
housekeeping <- c('PCBP1','RER1', 'RPN1', 'PUM1', 'IPO8')

Then we can see what BioMarts are available:

In [None]:
listMarts()

Let's use `ENSEMBL_MART_ENSEMBL` (you might get an error that says `Ensembl site unresponsive, trying uswest mirror`, run `?useEnsembl` to get more information about available options).

In [None]:
ensembl <- useEnsembl(biomart = 'ENSEMBL_MART_ENSEMBL', mirror = 'uswest')

You can see a list of all available datasets within the mart if you run `listDatasets(ensembl)` -- there are many (~200 of them), so let's narrow it down a little and look only for human data.

In [None]:
searchDatasets(mart = ensembl, pattern = 'hsapiens')

Now we can put it all together to create a BioMart object: (you might get an error that says `Ensembl site unresponsive, trying uswest mirror`)

In [None]:
ensembl <- useEnsembl(biomart = 'ENSEMBL_MART_ENSEMBL', dataset='hsapiens_gene_ensembl', mirror = 'uswest')

Later, we will use the `getBM()` function to query BioMart (this is the main function of biomaRt). This function takes the followingarguments:

`attributes`: the attributes you want to retrieve                     
`filters`: the filters that should be used in the query                    
`values`: the values of the filters                    
`mart`: the mart object you want to use.   

We can use the `listAttributes` function to see what information is available in `ensembl` (limiting it here to the first 5)

In [None]:
attributes = listAttributes(ensembl)
attributes[1:5,]

Note that there are ~3000 attributes for this mart! We only care a about two -- `ensembl_gene_id` and `hgnc_symbol`.

We can use the `listFilters` function to see what our filtering options are (limiting it here to the first 5)

In [None]:
filters = listFilters(ensembl)
filters[1:5,]

We can use `getBM` to query the BioMart object                   

In [None]:
ensembl_bm <- getBM(
    attributes = c('ensembl_gene_id','hgnc_symbol'),
    filters = 'hgnc_symbol',
    values = housekeeping, 
    mart = ensembl)
ensembl_bm

Let's look at the `rlog` normalized counts for our housekeeping genes:

In [None]:
housekeeping_rld <- data.frame(assay(rld)[ensembl_bm$ensembl_gene_id, ])
head(housekeeping_rld)

The `ensembl_gene_id` is currently stored as the rownames. Let's go ahead and turn it into a column in the data frame:

In [None]:
housekeeping_rld$ensembl_gene_id <- rownames(housekeeping_rld)
head(housekeeping_rld)

Then we use the `gather` function to convert the data to a long format.

In [None]:
housekeeping_rld_tidy <- gather(housekeeping_rld, key = 'sample', value = 'rlog_counts', SRR1039508:SRR1039521)
head(housekeeping_rld_tidy)

Let's add the annotation information we pulled from biomaRt:

In [None]:
housekeeping_rld_tidy <- inner_join(ensembl_bm, housekeeping_rld_tidy, by = 'ensembl_gene_id')
head(housekeeping_rld_tidy)

Let's look at the expression of our housekeeping genes to see if they look stably expressed in our data:

In [None]:
options(repr.plot.width=10, repr.plot.height=5)

ggplot(housekeeping_rld_tidy, aes(x=sample, y=rlog_counts)) + 
geom_bar(stat="identity") +
facet_wrap(~hgnc_symbol, nrow = 1) +
theme(axis.text.x = element_text(angle = 90))

These housekeeping genes look stably expressed across each sample.

# Using AnnotationHub

Now we can try using AnnotationHub to do something similar to what we just did with `biomaRt`.
Many of the data tyles we will work with from AnnotationHub are based on the `AnnotationDb` object class -- including OrgDb, TxDb, and many others. This means that they have many functions and methods in common (http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/AnnotationDbi/html/AnnotationDb-class.html).               

First, let's connect to the hub using `AnnotationHub` and look at the output.

In [None]:
ah <- AnnotationHub()
ah

This is one of the very nice things about using AnnotationHub -- there's many data providers, data classes, and organisms represented in the hub. You can access these elements using `$` accessor:

In [None]:
head(unique(ah$dataprovider))
length(unique(ah$dataprovider))

In [None]:
unique(ah$rdataclass)


## OrgDb objects

One of the options you can see here is `OrgDb`, which is an organism-specific, genome wide annotation. We can use it to map between different gene ID types using a central identifier (usually Entrez gene ID). 

OrgDb names are always of the form: org.<Ab>.<id>.db (e.g.org.Sc.sgd.db) where <Ab> is a 2-letter abbreviation of the organism and <id> is an abbreviation (in lower-case) describing the type of central identifier (`eg` for Entrez Gene ids).

Let's see what our options are for `Homo sapiens` and `OrgDb`:

In [None]:
AnnotationHub::query(ah, pattern = c("Homo sapiens", "OrgDb"))

So you can see here that there is an OrgDb for Homo sapiens that uses Entrez gene ID as the central identifier.

In [None]:
orgdb <- AnnotationHub::query(ah, pattern = c("Homo sapiens", "OrgDb"))[[1]]

In [None]:
orgdb

What types of data can we retrieve from the OrgDb? Let's use `keytypes()` to find out. 

The likely use case is that you are hoping to convert between different ID types (like we did with biomaRt). One way to do this is the `select()` function. AnnotationHub imports this function from AnnotationDbi so you can run `?AnnotationDbi::select` to view the help. As I said before, OrgDbs are based on the AnnotationDb object base class and the `select`, `columns`, `keys`, and `keytypes` arguments are used together to query AnnotationDb objects.

`select` will retrieve the data as a data.frame based on parameters for selected keys columns and keytype arguments.

`columns` shows which kinds of data can be returned for the AnnotationDb object.

`keys` returns keys for the database contained in the AnnotationDb object. 

`keytypes` allows the user to discover which keytypes can be passed in to select or keys and the keytype argument.

We can view columns and keytypes -- note that these can be the same but are not always the same.

In [None]:
columns(orgdb)

In [None]:
keytypes(orgdb)

Let's look at a few examples of what the key entries look like

In [None]:
head(keys(orgdb, keytype="SYMBOL"))

We can try running `select` to look for the housekeeping genes in the OrgDb to retrieve their ENSEMBL and ENTREZIDs:

In [None]:
ens_entr_orgdb <- select(orgdb, keys=housekeeping, 
       columns=c("ENSEMBL","ENTREZID"), 
       keytype="SYMBOL")
ens_entr_orgdb

As you can see, this returned a 1:1 mapping between keys and columns, but this might not always be the case. What happens if we use "GO" as one of the columns?

In [None]:
go_orgdb <- select(orgdb, keys=housekeeping, 
       columns=c("ENSEMBL","GO"), 
       keytype="SYMBOL")
head(go_orgdb)

This might not be the ideal outcome for you. Another approach is to use the `mapIds` function. `mapIds` is similar to `select` in that it uses `keys` and `keytypes` but it uses `column` instead of `columns` and can only return one column type, 

In [None]:
mapped_go <- mapIds(orgdb, keys=housekeeping, 
       column="GO", 
       keytype="SYMBOL")
head(mapped_go)

By default, `mapIds` will return the first match. If you really want all of the GO terms, you can specify the `multiVals` argument. Here's the options for `multiVals`:

first:

    This value means that when there are multiple matches only the 1st thing that comes back will be returned. This is the default behavior
list:

    This will just returns a list object to the end user
filter:

    This will remove all elements that contain multiple matches and will therefore return a shorter vector than what came in whenever some of the keys match more than one value
asNA:

    This will return an NA value whenever there are multiple matches
CharacterList:

    This just returns a SimpleCharacterList object
FUN:

    You can also supply a function to the multiVals argument for custom behaviors. The function must take a single argument and return a single value. This function will be applied to all the elements and will serve a 'rule' that for which thing to keep when there is more than one element. So for example this example function will always grab the last element in each result:  last <- function(x){x[[length(x)]]} 


Let's specify that we want `multiVals="list"`

In [None]:
mapped_go <- mapIds(orgdb, keys=housekeeping, 
       column="GO", 
       keytype="SYMBOL",
       multiVals="list")
head(mapped_go)

## TxDB Objects

One of the other options in AnnotationHub is`TxDb`. They are also based on the AnnotationDb class and use similar methods.

A TxDb object connects a set of genomic coordinates to transcript-oriented features. It also contains feature IDs for transcripts and genes so TxDb objects can be used to link gene IDs and transcipt IDs.
Let's work with the human TxDb object:

In [None]:
AnnotationHub::query(ah, pattern = c("Homo sapiens", "TxDb", "hg19"))

We can query the AnnotationHub and specify which record we'd like to use:

In [None]:
txdb <- AnnotationHub::query(ah, pattern = c("Homo sapiens", "TxDb", "hg19"))[['AH52258']]

In [None]:
txdb

Just like how we did with the OrgDb, we can look at what keytypes are available to us

In [None]:
keytypes(txdb)

We can also use `select` in a similar way:

In [None]:
select(txdb, keys = c("2597"), columns=c("TXNAME", "TXID", "CDSNAME"), keytype="GENEID")

Or `mapIds`

In [None]:
mapIds(txdb, keys = c("2597"), column="TXNAME", keytype="GENEID", multiVals="list")

We can look at all the transcripts available in the txdb using the `transcripts()` function:

In [None]:
transcripts(txdb)

We get back a GRanges object the location of each transcript, as well as its `tx_name` and `tx_id`. GRanges objects are just a way to show genomic locations (or Genomic Ranges) (https://www.bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html).           

              
We can also look at `exons()`, `cds()`, `genes()` and `promoters()`.         
You can also look at transcripts grouped by the genes that they are associated with:

In [None]:
txby <- transcriptsBy(txdb, by="gene")

In [None]:
txby

Similar functions include `exonsBy()`, `cdsBy()`, `intronsByTranscript()`, `fiveUTRsByTranscript()`, and `threeUTRsByTranscript()`. 

We can also use `seqlevelsStyle` function  (exported from `GenomeInfoDb`) to get the current seqlevels style of an object and to rename its seqlevels according to a given style. 

In [None]:
seqlevelsStyle(txdb)
seqinfo(txdb)

We can convert to 'NCBI' style:

In [None]:
seqlevelsStyle(txdb) <- "NCBI"
seqinfo(txdb)

We can see what styles are supported using `genomeStyles`.

In [None]:
head(genomeStyles("Homo_sapiens"))

Let's convert back to `UCSC` format:

In [None]:
seqlevelsStyle(txdb) <- "UCSC"

You could filter the object to only look at a particular chromosome if you wanted to:

In [None]:
seqlevels(txdb) <- "chr15"

# BSGenome

BSGenome is one option if you want to use R to search for actual sequence data. BSGenomes are `Biostrings-based` genomes, meaning that they use the package `BioStrings` to organize the data and facilitate access (https://bioconductor.org/packages/release/bioc/html/Biostrings.html)

We can see which genomes are available:

In [None]:
head(available.genomes())

We have already loaded `BSgenome.Hsapiens.UCSC.hg19` or `Hsapiens` into the environment, so we can just quickly confirm txdb and Hsapiens BSGenome are the same genome assembly (they are both hg19).

In [None]:
Hsapiens
txdb

We can extract the exon ranges from `txdb` grouped by transcript:

In [None]:
transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)

Then we can extract the transcript sequences from the genome (we'll just use the first transcript to make it faster).

In [None]:
tx_seqs <- extractTranscriptSeqs(Hsapiens, transcripts[1])

Then we can look and see that we have a `DNAStringSet` as the output

In [None]:
tx_seqs

# Ontology Analysis

Once we are at the step where we have genes that are differentially expressed, we can see if there is any enrichment in any functional gene groups. Two commonly used methods to look for enrichment are overrepresentation analysis (ORA) or gene set enrichment analysis (GSEA).          
- **Over Representation Analysis (ORA)** looks for functions or processes that are over-represented (= enriched) in an experimentally-derived gene list. The background used by default is all of the genes that have an annotation. This will find genes where the difference is large, but will not detect a situation where the difference is small but coordinated across a set of genes.      

- **Gene Set Enrichment (GSEA)** aggregates per-gene statistics across genes in a set. It takes a ranked list of genes and determines whether members of a gene set are randomly distributed throughout that list or if they are found primarily at the top or bottom of the list. GSEA will calculate an enrichment score based on whether a gene set is over-represented at the top or bottom fo the list, estimate the significance of the enrichment, and adjust for multiple hypothesis testing.       

There are many packages for running these types of analyses ([gage](https://www.bioconductor.org/packages/release/bioc/html/gage.html), [EnrichmentBrowser](https://www.bioconductor.org/packages/release/bioc/html/EnrichmentBrowser.html)) and many of them will use similar approaches to test for enrichment. We will use [clusterProfiler](https://www.bioconductor.org/packages/release/bioc/html/clusterProfiler.html).          

We will use [gene ontologies](http://geneontology.org/docs/ontology-documentation/) to organize the genes into groups based on their role in an organism. Gene Ontology loosely organize genes into three hierarchical graphs that correspond to three large umbrella categories -- **Molecular Function, Cellular Component, and Biological Process**. You can read the formal descriptions of these categories in the documentation linked above. A quote from the documentation illustrates an example of how these categories are related:        

```
In an example of GO annotation, the gene product “cytochrome c” can be described by the molecular function oxidoreductase activity, the biological process oxidative phosphorylation, and the cellular component mitochondrial matrix.
```

We can use our previously made `orgdb` object to run the enrichment analysis on `res`, which is the `results` object from DESeq2 differential expression analysis run on the `airway` data. We are comparing the dexamethasone treatment conditions, comparing treated to untreated.       

We will use the functions `gseGO` and `enrichGO` from clusterProfiler.      

- `gseGO` is a GSEA method, it takes a order ranked geneList as input and uses a Kolmogorov Smirnov test to run Gene Set Enrichment Analysis (GSEA) [Subramanian et al. 2005](https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/16199517/). GSEA is useful in scenarios where the fold changes are subtle but modules of genes are regulated in a coordinated way.    
- `enrichGO` is an ORA method and takes a list of genes (does not neet to be ranked) and uses Fisher's exact test with a hypergeometric distribution to run Enrichment Analysis [Boyle et al. 2004](https://academic.oup.com/bioinformatics/article/20/18/3710/202612).     


Might need to re-load some packages at this point:

In [1]:
suppressPackageStartupMessages(library("tidyverse"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("BiocManager"))
suppressPackageStartupMessages(library("gridExtra"))
suppressPackageStartupMessages(library("ggrepel"))
suppressPackageStartupMessages(library("airway"))
suppressPackageStartupMessages(library("AnnotationHub"))
suppressPackageStartupMessages(library("clusterProfiler"))
suppressPackageStartupMessages(library("enrichplot"))
suppressPackageStartupMessages(library("biomaRt"))
suppressPackageStartupMessages(library("DESeq2"))
suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("tidyr"))
suppressPackageStartupMessages(library("GenomicRanges"))
suppressPackageStartupMessages(library("GenomicFeatures"))
res <- readRDS("res.rds")
ah <- AnnotationHub()
orgdb <- AnnotationHub::query(ah, pattern = c("Homo sapiens", "OrgDb"))[[1]]

“package ‘tibble’ was built under R version 4.0.3”
“package ‘readr’ was built under R version 4.0.3”
“package ‘dplyr’ was built under R version 4.0.3”
“package ‘forcats’ was built under R version 4.0.3”
“package ‘dbplyr’ was built under R version 4.0.3”
using temporary cache /tmp/Rtmp061IIM/BiocFileCache

snapshotDate(): 2020-10-27

downloading 1 resources

retrieving 1 resource

loading from cache



Let's make sure we don't have any NA entries in the `res` object:

In [2]:
res <- tidyr::drop_na(data.frame(res))

We'll run `gseGO` first. This is a GSEA method and it needs a ranked gene list as input. Let's make that list now. First, get the log2FoldChange -- this is what we will use to rank the genes.

In [3]:
gene_list <- res$log2FoldChange

Now we add names to the gene list:

In [4]:
names(gene_list) <- c(rownames(res))

Then we can sort the gene list

In [5]:
gene_list <- sort(gene_list, decreasing = TRUE)

Then we can run gseGO. We are setting a seed and using the seed = TRUE argument because we want gseGO to deal with ties consistently -- otherwise we might get different data every time we run the analysis since gseGO will arbitrarily break ties in the rankings. The ties shouldn't present a huge issue as long as the ties percentage in your data is low.

In [None]:
set.seed(42)
gsea_out <- gseGO(
    geneList = gene_list,
    OrgDb = orgdb,
    ont = 'ALL',
    keyType = 'ENSEMBL',
    seed = TRUE)

- By using `keyType = 'ENSEMBL'` we are telling the function that our gene IDs are in `ENSEMBL` format and by setting `ont = 'ALL'` we are indicating we want to look at all three of the ontologies -- `Biological Process`, `Cellular Component`, and `Molecular Function`. Run `?gseGO` for a full account of the function and its arguments      

- There are many options for visualizing the enrichment, you can see more details [here](http://yulab-smu.top/clusterProfiler-book/chapter12.html) -- let's start with a dotplot:

In [None]:
dotplot(down_ego, showCategory = 5) + ggtitle('Down regulated in dexamethasone')

In [None]:
dotplot(gsea_out)

The size of the dot indicates how many members of the group are represented in the enrichment and the adjusted p-value is the Benjamini-Hochberg corrected p-value. `GeneRatio` is `k/n`, where for a given category (e.g. 'receptor regulator activity') `k` is the overlap of 'receptor regulator activity' genes in `gene_list` compared to all 'receptor regulator activity' genes in the org.db, where `n` is the overlap of all genes in `gene_list` compares to all genes in the org.db.

We can also use `enrichGO`, which takes a list of genes that are not ranked. We will separate out the up and down regulated genes from `res` first.

In [None]:
up_genes <- data.frame(res) %>% dplyr::filter(padj < 0.1 & log2FoldChange > 0)
down_genes <- data.frame(res) %>% dplyr::filter(padj < 0.1 & log2FoldChange < 0)

Then we can run `enrichGO` on the up and down regulated genes and make dotplots.

In [None]:
up_ego <- enrichGO(gene = rownames(up_genes),
          keyType = 'ENSEMBL',
          ont = 'BP',
          universe = rownames(res),
          OrgDb = orgdb,
          readable = TRUE)
dotplot(up_ego, showCategory = 5) + ggtitle('Up regulated in dexamethasone')

In [None]:
down_ego <- enrichGO(gene = rownames(down_genes),
          keyType = 'ENSEMBL',
          ont = 'BP',
          universe = rownames(res),
          OrgDb = orgdb,
          readable = TRUE)
dotplot(down_ego, showCategory = 5) + ggtitle('Down regulated in dexamethasone')

Note that in each of the calls to `enrichGO` above, I have specified the `universe` argument so that we are taking into consideration which genes were actually detected in our experiment. We also used the `ont = 'BP'` argument to tell `enrichGO` that we want to look at genes in the Biological Process category. We can also set the `showCategory = 5` argument in the call to `dotplot` to tell it to only show us the first 5 categories. The enrichment of `response to peptide hormone` in the up regulated genes makes sense, as dexamethasone is a corticosteroid hormone. 

In this example, we are running `enrichGO` on the up and down regulated genes separately, but it is also valid to run all of the differentially expressed genes together, depending on your research question. https://royalsocietypublishing.org/doi/10.1098/rsif.2013.0950

<div class="alert alert-block alert-success"><b>Exercise:</b> Try running `enrichGO` without setting the `universe` argument. How does this change your results? </div>

<div class="alert alert-block alert-success"><b>Exercise:</b> Try running `enrichGO` on all of the differentially expressed genes without pre-splitting into up and down regulated genes. How does this change your results? </div>