# Table of Contents
 <p><div class="lev1 toc-item"><a href="#PEB-Belgrade---Bioconductor-workshop" data-toc-modified-id="PEB-Belgrade---Bioconductor-workshop-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>PEB Belgrade - Bioconductor workshop</a></div><div class="lev2 toc-item"><a href="#Requirements" data-toc-modified-id="Requirements-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Requirements</a></div><div class="lev2 toc-item"><a href="#Which-libraries-are-we-installing?" data-toc-modified-id="Which-libraries-are-we-installing?-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Which libraries are we installing?</a></div><div class="lev1 toc-item"><a href="#The-Annotation-packages-in-Bioconductor" data-toc-modified-id="The-Annotation-packages-in-Bioconductor-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The Annotation packages in Bioconductor</a></div><div class="lev1 toc-item"><a href="#The-Homo.sapiens-package" data-toc-modified-id="The-Homo.sapiens-package-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The Homo.sapiens package</a></div><div class="lev2 toc-item"><a href="#Gene-symbols-and-IDs:-the-org.Hs.eg.db-package" data-toc-modified-id="Gene-symbols-and-IDs:-the-org.Hs.eg.db-package-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Gene symbols and IDs: the org.Hs.eg.db package</a></div><div class="lev3 toc-item"><a href="#Converting-Entrez-Ids-to-symbols" data-toc-modified-id="Converting-Entrez-Ids-to-symbols-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Converting Entrez Ids to symbols</a></div><div class="lev2 toc-item"><a href="#Getting-gene-coordinates:-the-TxDB-packages" data-toc-modified-id="Getting-gene-coordinates:-the-TxDB-packages-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Getting gene coordinates: the TxDB packages</a></div><div class="lev1 toc-item"><a href="#Calculating-Enrichment" data-toc-modified-id="Calculating-Enrichment-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Calculating Enrichment</a></div><div class="lev1 toc-item"><a href="#Annotation-Hub" data-toc-modified-id="Annotation-Hub-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Annotation Hub</a></div>

# PEB Belgrade - Bioconductor workshop

Giovanni M. Dall'Olio, GSK. 10/09/2016. http://bioinfoblog.it

Welcome to the Bioconductor / Data Integration workshop.

This workshop is heavily inspired by the Coursera Bioconductor course. See here for materials: http://kasperdanielhansen.github.io/genbioconductor/


## Requirements

This workshop requires several bioconductor libraries, which take a while to install.

Please start their installation by copying&pasting the commands below. We'll continue the lecture while they get installed:

```
# dplyr
install.packages(c("dplyr"))

# bioconductor
source("http://bioconductor.org/biocLite/R")
biocLite("Homo.sapiens")
biocLite("rtracklayer")
biocLite("AnnotationHub")
```

## Which libraries are we installing?

- **Homo.sapiens**: Wrapper containing several H.sapiens-related packages:
    - **TxDB**: coordinates for genes, transcripts, exons...
    - **org.Hs.eg.db**: Gene symbols
    - **GenomicRanges**: allows to work with gene coordinates
- **rtracklayer**: allows to import BED files and other formats
- **DOSE** and **clusterProfiler**: for ontology enrichment (GO, Disease Ontology, Reactome)
- **AnnotationHub**: allows to download data from UCSC and many other sources

# The Annotation packages in Bioconductor

Bioconductor contains several Annotation packages (https://www.bioconductor.org/packages/release/data/annotation/)

Most of these are updated regularly, and contain datasets from public sources for multiple organisms:

Some examples:

- **TxDB** objects: coordinates for genes, transcripts, exons...
- **BSGenome**: genome sequences
- **microarray ids** (e.g. hgu133): conversions probe to genes for Affymetrix and Illumina arrays
- **org.\*.eg.db**: gene symbol to id conversion (entrez, ensembl, GO, ..)

In addition two packages allow to access large dataset repositories:

- **biomaRt**: any biomart installation, e.g. ensembl, hgnc, (see http://www.biomart.org/)
- **AnnotationHub**: access to several resources, e.g. any track in the UCSC browser, and more

In this tutorial we will see some of these (TxDB, org.eg.db, AnnotationHub).

# The Homo.sapiens package

Let's load the Homo.sapiens package. You will see that it will load several other packages:

In [6]:
library(Homo.sapiens)

Today we are going to focus on two packages: org.Hs.eg.db and TxDB

## Gene symbols and IDs: the org.Hs.eg.db package

The org.\*.eg.db packages allow to retrieve gene symbols and ids relative to a species (see [list of all packages](https://www.bioconductor.org/packages/release/data/annotation/)). Data is updated every two years.

To see which data is included in this package, we can open its help page:
```
library(help=org.Hs.eg.db)
```

Each of these datasets is stored as a Bimap object. Type the following to get information on BiMaps:
```
?AnnDbBimap
```

### Converting Entrez Ids to symbols

The org.Hs.egGENENAME contains mappings between Entrez IDs and Gene Symbols:

In [3]:
head(as.data.frame(org.Hs.egGENENAME))

gene_id,gene_name
1,alpha-1-B glycoprotein
2,alpha-2-macroglobulin
3,alpha-2-macroglobulin pseudogene 1
9,N-acetyltransferase 1 (arylamine N-acetyltransferase)
10,N-acetyltransferase 2 (arylamine N-acetyltransferase)
11,N-acetyltransferase pseudogene


## Getting gene coordinates: the TxDB packages

# Calculating Enrichment

# Annotation Hub

In [4]:
library(AnnotationHub)
ahub = AnnotationHub()
ahub



Attaching package: ‘AnnotationHub’

The following object is masked from ‘package:Biobase’:

    cache

snapshotDate(): 2016-08-15


AnnotationHub with 44404 records
# snapshotDate(): 2016-08-15 
# $dataprovider: BroadInstitute, UCSC, Ensembl, EncodeDCC, NCBI, ftp://ftp.n...
# $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Danio r...
# $rdataclass: GRanges, BigWigFile, FaFile, OrgDb, TwoBitFile, ChainFile, In...
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH2"]]' 

            title                                                 
  AH2     | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa     
  AH3     | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa  
  AH4     | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa  
  AH5     | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa            
  AH6     | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa          
  ...       ...                                                   
  AH51456 | Xiphophorus_maculatus.Xipmac4.4.2.cdna.all.2bit       
  AH51457 | Xiphophoru

In [5]:
library(Homo.sapiens)
genes(TxDb.Hsapiens.UCSC.hg19.knownGene)



GRanges object with 23056 ranges and 1 metadata column:
        seqnames                 ranges strand |     gene_id
           <Rle>              <IRanges>  <Rle> | <character>
      1    chr19 [ 58858172,  58874214]      - |           1
     10     chr8 [ 18248755,  18258723]      + |          10
    100    chr20 [ 43248163,  43280376]      - |         100
   1000    chr18 [ 25530930,  25757445]      - |        1000
  10000     chr1 [243651535, 244006886]      - |       10000
    ...      ...                    ...    ... .         ...
   9991     chr9 [114979995, 115095944]      - |        9991
   9992    chr21 [ 35736323,  35743440]      + |        9992
   9993    chr22 [ 19023795,  19109967]      - |        9993
   9994     chr6 [ 90539619,  90584155]      + |        9994
   9997    chr22 [ 50961997,  50964905]      - |        9997
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome