analysis/genomic_ranges.Rmd

---
title: "Using the Bioconductor GenomicsRanges package"
date: "`r Sys.Date()`"
output:
  workflowr::wflow_html:
    toc: true
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

From the [introductory article](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html) for GenomicRanges:

> The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. In the Bioconductor package hierarchy, it builds upon the IRanges (infrastructure) package and provides support for the BSgenome (infrastructure), Rsamtools (I/O), ShortRead (I/O & QA), rtracklayer (I/O), GenomicFeatures (infrastructure), GenomicAlignments (sequence reads), VariantAnnotation (called variants), and many other Bioconductor packages.
>
> This package lays a foundation for genomic analysis by introducing three classes (GRanges, GPos, and GRangesList), which are used to represent genomic ranges, genomic positions, and groups of genomic ranges. This vignette focuses on the GRanges and GRangesList classes and their associated methods.

## Installation

To being, install [GenomicRanges](https://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html).

```{r install_fgsea, message=FALSE, warning=FALSE}
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!require("GenomicRanges", quietly = TRUE))
  BiocManager::install("GenomicRanges")

library(GenomicRanges)
```

## Getting started

The introduction article starts with creating a `GRanges` object:

> The GRanges class represents a collection of genomic features that each have
> a single start and end location on the genome. This includes features such as
> contiguous binding sites, transcripts, and exons. These objects can be
> created by using the GRanges constructor function.

```{r gr}
gr <- GRanges(
  seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
  ranges = IRanges(101:110, end = 111:120, names = letters[1:10]),
  strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
  score = 1:10,
  GC = seq(1, 0, length=10)
)
 
class(gr)
```

The `GRanges()` function was used to specify a list of sequence names, their
ranges, strand, score, and GC content. For the `seqnames` and `strand`, the
`Rle()` function was used; RLE is an abbreviation for [run-length
encoding](https://en.wikipedia.org/wiki/Run-length_encoding), which is a way of
representing and compressing data. The function saves us from typing `c("chr1",
"chr2", "chr2", "chr2", "chr1", "chr1", "chr3", "chr3", "chr3", "chr3")`, which
would also work.

The `IRanges()` function was used to create a vector representation of sequence
ranges; 10 ranges were created and named using the first ten letters of the
alphabet.

`Rle()` was also used to indicate the strandedness of the ranges.

Metadata can also be added to a `GRanges` object and in the example, a score
and the GC content were included.

The genomic coordinates (seqnames, ranges, and strand) are displayed on the left-hand side and the metadata columns (annotations) are displayed on the right-hand side.

```{r gr_display}
gr
```

Each component of the genomic coordinates can be extracted using the accessor functions that have the same name as the column names. For example to retrieve the ranges we use the `ranges()` function.

```{r ranges}
ranges(gr)
```

Metadata is extracted using the `mcols()` function.

```{r mcols}
mcols(gr)
```

## BED to GRanges

The [Browser Extensible
Data](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) (BED) format is quite
ubiquitous in bioinformatics, so it is useful to know how a `GRanges` object
can be created from a BED file.

We first create a data frame from a small BED file hosted on my web server and then we create a `GRanges` object using the data frame. Since BED is 0-based, we add one to make it 1-based.

```{r bed_to_granges}
data <- read.table(
  "https://davetang.org/file/granges.bed",
  col.names = c('chr','start','end','id','score','strand')
)
 
bed <- with(data, GRanges(chr, IRanges(start+1L, end), strand, score, refseq=id))
bed
```

The `elementMetadata` function can be used to fetch metadata.

```{r bed_metadata}
elementMetadata(bed)
```

The `with()` function used to create the `GRanges` object is nice because it
[saves us some
typing](https://stackoverflow.com/questions/42283479/when-to-use-with-function-and-why-is-it-good); we can directly refer to the column names.

```{r with}
bed <- with(
  data,
  GRanges(
    chr,
    IRanges(start+1, end),
    strand,
    score,
    refseq=id
  )
)
 
bed2 <- GRanges(
  data$chr,
  IRanges(data$start+1L, data$end),
  data$strand,
  score = data$score,
  refseq = data$id
)
 
identical(bed, bed2)
```

Let's load another BED file to demonstrate how we can [intersect
ranges](https://stackoverflow.com/questions/5200946/coverage-by-intersecting-smaller-genomic-interval-data-over-larger-genomic-inter).

```{r bed_to_granges2}
data2 <- read.table(
  "https://davetang.org/file/granges2.bed",
  col.names = c('chr','start','end','id','score','strand')
)
 
bed2 <- with(data2, GRanges(chr, IRanges(start+1, end), strand, score, refseq = id))
bed2
```

We can use the `intersect` function to perform an intersect on the two `GRanges` objects; note that `ignore.strand=FALSE` is the default, which means the strand information is taken into account (but not illustrated in my example).

```{r intersect}
GenomicRanges::intersect(bed, bed2)
```

Note that the metadata from both BED files is lost. Thanks to a [suggestion by
Daniel](https://davetang.org/muse/2013/01/02/iranges-and-genomicranges/#comment-2900),
we can use the `subsetByOverlaps()` function to find overlapping genomic ranges and
returns results that retain the metadata. This function returns an additional result
because when considering the `refseq` metadata, there are two unique ranges.

```{r subset_by_overlaps}
subsetByOverlaps(bed, bed2)
```

## Matching the overlaps

In the previous example, `subsetByOverlaps()` does not indicate which features overlapped, only that there is an overlap. In this section, we will demonstrate the use of the `GenomicRanges` functions `countOverlaps()`, `findOverlaps()`, `queryHits()`, and `subjectHits()`.

```{r create_examples}
my_ranges <- GRanges(
  seqnames = rep('chr1',6),
  ranges = IRanges(
    start = c(66999824,33546713,48998526,16767166,16767166,8384389),
    end = c(67210768,33585995,50489626,16786584,16786584,8404227)
  ),
  strand = c('+','+','-','+','+','+'),
  score = rep(0,6),
  refseq = c('NM_032291','NM_052998','NM_032785','NM_001145278','NM_001145277','NM_001080397')
)
 
my_snps <- GRanges(
  seqnames = rep('chr1',3),
  ranges = IRanges(
    start = c(66999900,33546793,8384389),
    end = c(66999901,33546794,8384390)
  ),
  strand = c('+','+','+'),
  id = c('snp1','snp2','snp3'),
  score = rep(0,3)
)
```

The function `countOverlaps()` counts the overlaps with respect to the first `GRanges` object.

```{r count_overlaps}
countOverlaps(my_ranges, my_snps)
```

The function `findOverlaps()` returns the matching indexes between two `GRanges` objects; a subject and a query.

```{r find_overlaps}
my_overlaps <- findOverlaps(
  subject = my_ranges,
  query = my_snps
)
my_overlaps
```

Finally, we create a data frame that displays the matches by their metadata; the functions `queryHits()` and `subjectHits()` are used to retrieve the match indexes from the `findOverlaps()` result and these indexes are used to retrieve the metadata. This process is similar to using the base R function `match()`.

```{r data_frame_match}
my_query <- queryHits(my_overlaps)
my_subject <- subjectHits(my_overlaps)
 
data.frame(
  subject = my_ranges[my_subject]$refseq,
  query = my_snps[my_query]$id
)
```

## Further reading

* [subsetByOverlaps to keep info from both GRanges objects?](https://stat.ethz.ch/pipermail/bioconductor/2013-August/054504.html)