-
Notifications
You must be signed in to change notification settings - Fork 1
/
genomic_ranges.Rmd
230 lines (176 loc) · 7.83 KB
/
genomic_ranges.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "Using the Bioconductor GenomicsRanges package"
date: "`r Sys.Date()`"
output:
workflowr::wflow_html:
toc: true
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
From the [introductory article](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html) for GenomicRanges:
> The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. In the Bioconductor package hierarchy, it builds upon the IRanges (infrastructure) package and provides support for the BSgenome (infrastructure), Rsamtools (I/O), ShortRead (I/O & QA), rtracklayer (I/O), GenomicFeatures (infrastructure), GenomicAlignments (sequence reads), VariantAnnotation (called variants), and many other Bioconductor packages.
>
> This package lays a foundation for genomic analysis by introducing three classes (GRanges, GPos, and GRangesList), which are used to represent genomic ranges, genomic positions, and groups of genomic ranges. This vignette focuses on the GRanges and GRangesList classes and their associated methods.
## Installation
To being, install [GenomicRanges](https://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html).
```{r install_fgsea, message=FALSE, warning=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("GenomicRanges", quietly = TRUE))
BiocManager::install("GenomicRanges")
library(GenomicRanges)
```
## Getting started
The introduction article starts with creating a `GRanges` object:
> The GRanges class represents a collection of genomic features that each have
> a single start and end location on the genome. This includes features such as
> contiguous binding sites, transcripts, and exons. These objects can be
> created by using the GRanges constructor function.
```{r gr}
gr <- GRanges(
seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
ranges = IRanges(101:110, end = 111:120, names = letters[1:10]),
strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
score = 1:10,
GC = seq(1, 0, length=10)
)
class(gr)
```
The `GRanges()` function was used to specify a list of sequence names, their
ranges, strand, score, and GC content. For the `seqnames` and `strand`, the
`Rle()` function was used; RLE is an abbreviation for [run-length
encoding](https://en.wikipedia.org/wiki/Run-length_encoding), which is a way of
representing and compressing data. The function saves us from typing `c("chr1",
"chr2", "chr2", "chr2", "chr1", "chr1", "chr3", "chr3", "chr3", "chr3")`, which
would also work.
The `IRanges()` function was used to create a vector representation of sequence
ranges; 10 ranges were created and named using the first ten letters of the
alphabet.
`Rle()` was also used to indicate the strandedness of the ranges.
Metadata can also be added to a `GRanges` object and in the example, a score
and the GC content were included.
The genomic coordinates (seqnames, ranges, and strand) are displayed on the left-hand side and the metadata columns (annotations) are displayed on the right-hand side.
```{r gr_display}
gr
```
Each component of the genomic coordinates can be extracted using the accessor functions that have the same name as the column names. For example to retrieve the ranges we use the `ranges()` function.
```{r ranges}
ranges(gr)
```
Metadata is extracted using the `mcols()` function.
```{r mcols}
mcols(gr)
```
## BED to GRanges
The [Browser Extensible
Data](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) (BED) format is quite
ubiquitous in bioinformatics, so it is useful to know how a `GRanges` object
can be created from a BED file.
We first create a data frame from a small BED file hosted on my web server and then we create a `GRanges` object using the data frame. Since BED is 0-based, we add one to make it 1-based.
```{r bed_to_granges}
data <- read.table(
"https://davetang.org/file/granges.bed",
col.names = c('chr','start','end','id','score','strand')
)
bed <- with(data, GRanges(chr, IRanges(start+1L, end), strand, score, refseq=id))
bed
```
The `elementMetadata` function can be used to fetch metadata.
```{r bed_metadata}
elementMetadata(bed)
```
The `with()` function used to create the `GRanges` object is nice because it
[saves us some
typing](https://stackoverflow.com/questions/42283479/when-to-use-with-function-and-why-is-it-good); we can directly refer to the column names.
```{r with}
bed <- with(
data,
GRanges(
chr,
IRanges(start+1, end),
strand,
score,
refseq=id
)
)
bed2 <- GRanges(
data$chr,
IRanges(data$start+1L, data$end),
data$strand,
score = data$score,
refseq = data$id
)
identical(bed, bed2)
```
Let's load another BED file to demonstrate how we can [intersect
ranges](https://stackoverflow.com/questions/5200946/coverage-by-intersecting-smaller-genomic-interval-data-over-larger-genomic-inter).
```{r bed_to_granges2}
data2 <- read.table(
"https://davetang.org/file/granges2.bed",
col.names = c('chr','start','end','id','score','strand')
)
bed2 <- with(data2, GRanges(chr, IRanges(start+1, end), strand, score, refseq = id))
bed2
```
We can use the `intersect` function to perform an intersect on the two `GRanges` objects; note that `ignore.strand=FALSE` is the default, which means the strand information is taken into account (but not illustrated in my example).
```{r intersect}
GenomicRanges::intersect(bed, bed2)
```
Note that the metadata from both BED files is lost. Thanks to a [suggestion by
Daniel](https://davetang.org/muse/2013/01/02/iranges-and-genomicranges/#comment-2900),
we can use the `subsetByOverlaps()` function to find overlapping genomic ranges and
returns results that retain the metadata. This function returns an additional result
because when considering the `refseq` metadata, there are two unique ranges.
```{r subset_by_overlaps}
subsetByOverlaps(bed, bed2)
```
## Matching the overlaps
In the previous example, `subsetByOverlaps()` does not indicate which features overlapped, only that there is an overlap. In this section, we will demonstrate the use of the `GenomicRanges` functions `countOverlaps()`, `findOverlaps()`, `queryHits()`, and `subjectHits()`.
```{r create_examples}
my_ranges <- GRanges(
seqnames = rep('chr1',6),
ranges = IRanges(
start = c(66999824,33546713,48998526,16767166,16767166,8384389),
end = c(67210768,33585995,50489626,16786584,16786584,8404227)
),
strand = c('+','+','-','+','+','+'),
score = rep(0,6),
refseq = c('NM_032291','NM_052998','NM_032785','NM_001145278','NM_001145277','NM_001080397')
)
my_snps <- GRanges(
seqnames = rep('chr1',3),
ranges = IRanges(
start = c(66999900,33546793,8384389),
end = c(66999901,33546794,8384390)
),
strand = c('+','+','+'),
id = c('snp1','snp2','snp3'),
score = rep(0,3)
)
```
The function `countOverlaps()` counts the overlaps with respect to the first `GRanges` object.
```{r count_overlaps}
countOverlaps(my_ranges, my_snps)
```
The function `findOverlaps()` returns the matching indexes between two `GRanges` objects; a subject and a query.
```{r find_overlaps}
my_overlaps <- findOverlaps(
subject = my_ranges,
query = my_snps
)
my_overlaps
```
Finally, we create a data frame that displays the matches by their metadata; the functions `queryHits()` and `subjectHits()` are used to retrieve the match indexes from the `findOverlaps()` result and these indexes are used to retrieve the metadata. This process is similar to using the base R function `match()`.
```{r data_frame_match}
my_query <- queryHits(my_overlaps)
my_subject <- subjectHits(my_overlaps)
data.frame(
subject = my_ranges[my_subject]$refseq,
query = my_snps[my_query]$id
)
```
## Further reading
* [subsetByOverlaps to keep info from both GRanges objects?](https://stat.ethz.ch/pipermail/bioconductor/2013-August/054504.html)