# Gene Expression Analysis: Annotations and Ontology Analysis

### I. Overview and Objectives
In this prelab, we will continue to think about analysis of gene expression on microarrays, covering two additional topics.

* Annotations: libraries, how to load and attached, outputting content to files
* Ontology Analysis: Question, statistics, key ideas, resources, tools.


### Annotating your results: Back to your pipeline

After you've performed your analysis, you might have been thinking: These ensembl IDs are interesting, but I don't speak *ensembl ID*. Turns out not many do. You'd really like to know what gene names, positions, protein domains, and more! are attached to your top hits. 

There are a *lot* of different kinds of annotations that one can associate with transcripts. It also turns out there is a package in R: *biomaRt* that lets you obtain a great deal of information from [Ensembl](https://useast.ensembl.org/index.html), which is the __go to__ place to obtain these sorts of data.

You can even go to Ensembl directly to look up annotations: https://useast.ensembl.org/info/data/biomart/how_to_use_biomart.html

This package is available from Bioconductor - so if you wanted to use it for your own purposes on your own computer, you'd need to set that up for yourself (see https://www.bioconductor.org/install/) 
    
However, in CoCalc, we have loaded this package from bioconductor for you already! 

So you only need to invoke the correct `library()` function to include this functionality into your analysis pipeline. 

BiomaRt can do a lot of things. Check out this [User Guide](https://www.bioconductor.org/packages/3.7/bioc/vignettes/biomaRt/inst/doc/biomaRt.html) which gives you a range of examples. We'll walk through one of those, and give you some helper functions to 'trawl' through this information, then some additional functions to help connect those data to your association results.

Copy, paste, and execute the following code in the cell below:

    library(biomaRt)

### Task One: Obtaining the biomart of interest

The first thing we need to do is to identify the biomart of information that we want to access. We can list marts using

    listMarts()

Copy, paste, and execute the above code in the cell below:

In most cases, we'll be accessing Ensembl. But, we still need to find the organism we want to study. To get a list of those, we can use the function useMart(), followed by listDatasets():

    mymart <- useMart("ensembl")
    listDatasets(mymart)[,1]
        
This will give us a list of biomarts for all the organisms that have been curated by Ensembl. e.g. `hsapiens_gene_ensembl` for humans, or `mmusculus_gene_ensembl` for the mouse genome. Note the __VERSION__ of the database you are accessing: genome builds can change the interpretations and specific positions of things -- be mindful when making comparisons with other databases if the anchoring to a genome build could influence your results or interpretation.

that said, we can invoke useMart() to obtain that biomart specifically.

Execute the follow code in the cell below:

    mymart <- useMart("ensembl", "hsapiens_gene_ensembl")

### Task Two: obtaining ensembl IDs to search for.

In order to make use of your biomart, you will need to obtain a list of ensembl IDs to 'lookup' in this data base. These can be found in various object that you create from your actual expression analysis. This is quite easily done with `select()`. For example, if the ensemble id was containe with a column named `gene_id` in a variable named `data`, you could try:

    mylist <- results %>% select(ensembl_gene_id)

Alternatively, you can use dollar-sign notation to refer to the column name, e.g.

    mylist <- results$ensembl_gene_id

### Task Three: Obtaining annotations of interest

Next, we might want to obtain a complete listing of all the annotations that are available from the give mart. To do this, use the listAttributes() function.

Copy, paste, and execute the code in the cell below:

    attrs <- listAttributes(mymart)

Then access attrs in various ways

    > attrs[1:5,]
    
                       name            description
    1       ensembl_gene_id        Ensembl Gene ID
    2 ensembl_transcript_id  Ensembl Transcript ID
    3    ensembl_peptide_id     Ensembl Protein ID
    4       ensembl_exon_id        Ensembl Exon ID
    5           description            Description
    
There are hundreds of pieces of information: the name gives you the key you want to access, the description is just the human-readable description for it.

Once we have our list of keys, we can use another function, `getBM()`, to query ensembl to obtain the annotations we are interested in. This function take 4 arguments:

* attributes: a list of keys that we want to look up (use the `c()` function to make that list)
* values: the list of query terms that we are looking for. In our case, this is usually a list of our probeids
* mart: the biomart to query

So for example: 

    lookup <- getBM(attributes=c('ensembl_gene_id','ensembl_exon_id'), values=mylist, mart=mymart)
    
which would store our information into a variable called lookup.    

### Task Four: Joining annotations with our results.

Now we have a set of annotations stored in a new table. But those values are not connected to our table of results! We need a way to *join* those tables together.

Of course, tidyverse gives us a great many tools in which to join two tables together, assuming there is a key shared between them. And (say) ensembl_gene_id might be a perfect one to use:

    results_annot <- results %>% left_join(lookup, by="ensembl_gene_id")

Of course with join, the "key" that will be used to merge the table is `ensembl_gene_id`. Both tables must have that column name, or else join will give you a bad time! 

If something was named differently but you knew it was a key you wanted to do a lookup on, you could use `rename()` of course to rename the column so that the tables have matching columns for that key...

### Ontology Analysis

For an in-depth background on the scientific rationale, approaches, and issues. If you like, you can also head to [Canvas](http://canvas.upenn.edu) and watch the pre-recorded lecture on the subject. 

For class, we'll be using the tool [WebGestalt](http://www.webgestalt.org/), which has a number of versitile analyses, background comparison, statistical controls. There are several good tools out there.