Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
213 lines (162 sloc) 6.74 KB
layout title
page
Visualizing NGS data
library(knitr)
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))
suppressWarnings({
suppressPackageStartupMessages({
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
library(org.Dm.eg.db)
})
})

We will use the following libraries to demonstrate visualization of NGS data.

#biocLite("pasillaBamSubset")
#biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()

We will try four ways to look at NGS coverage: using the standalone Java program IGV, using simple plot commands, and using the Gviz and ggbio packages in Bioconductor.

IGV

Copy these files from the R library directory to the current working directory. First set the working directory to the source file location. We need to use the Rsamtools library to index the BAM files for using IGV.

file.copy(from=fl1,to=basename(fl1))
file.copy(from=fl2,to=basename(fl2))
library(Rsamtools)
indexBam(basename(fl1))
indexBam(basename(fl2))

IGV is freely available for download here: https://www.broadinstitute.org/igv/home

You will need to provide an email, and then you will get a download link.

Using IGV, look for gene lgs.

Note that if you have trouble downloading IGV, another option for visualization is the UCSC Genome Browser: http://genome.ucsc.edu/cgi-bin/hgTracks

The UCSC Genome Browser is a great resource, having many tracks involving gene annotations, conservation over multiple species, and the ENCODE epigenetic tracks already available. However, the UCSC Genome Browser requires that you upload your genomic files to their server, or put your data on a publicly available server. This is not always possible if you are working with confidential data.

Simple plot

Next we will look at the same gene using the simple plot function in R.

library(GenomicRanges)

Note: if you are using Bioconductor version 14, paired with R 3.1, you should also load this library. You do not need to load this library, and it will not be available to you, if you are using Bioconductor version 13, paired with R 3.0.x.

library(GenomicAlignments)

We read in the alignments from the file fl1. Then we use the coverage function to tally up the basepair coverage. We then extract the subset of coverage which overlaps our gene of interest, and convert this coverage from an RleList into a numeric vector. Remember from Week 2, that Rle objects are compressed, such that repeating numbers are stored as a number and a length.

x <- readGAlignments(fl1)
xcov <- coverage(x)
z <- GRanges("chr4",IRanges(456500,466000))
# Bioconductor 2.14
xcov[z]
# Bioconductor 2.13
xcov$chr4[ranges(z)]
xnum <- as.numeric(xcov$chr4[ranges(z)])
plot(xnum)

We can do the same for another file:

y <- readGAlignmentPairs(fl2)
ycov <- coverage(y)
ynum <- as.numeric(ycov$chr4[ranges(z)])
plot(xnum, type="l", col="blue", lwd=2)
lines(ynum, col="red", lwd=2)

We can zoom in on a single exon:

plot(xnum, type="l", col="blue", lwd=2, xlim=c(6200,6600))
lines(ynum, col="red", lwd=2)

Extracting the gene of interest using the transcript database

Suppose we are interested in visualizing the gene lgs. We can extract it from the transcript database TxDb.Dmelanogaster.UCSC.dm3.ensGene on Bioconductor, but first we need to look up the Ensembl gene name. We will use the functions that we learned in Week 7 to find the name.

# biocLite("biomaRt")
#library(biomaRt)
#m <- useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
#lf <- listFilters(m)
#lf[grep("name", lf$description, ignore.case=TRUE),]
#map <- getBM(mart = m,
#  attributes = c("ensembl_gene_id", "flybasename_gene"),
#  filters = "flybasename_gene", 
#  values = "lgs")
#map
library(org.Dm.eg.db)
map = select(org.Dm.eg.db, keys="lgs", keytype="SYMBOL", columns=c("ENTREZID", "ENSEMBL"))
map

Now we extract the exons for each gene, and then the exons for the gene lgs.

library(GenomicFeatures)
grl <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, by="gene")
gene <- grl[[map[1,]$ENSEMBL]]
gene

Finally we can plot these ranges to see what it looks like:

rg <- range(gene)
plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
       end(gene),rep(0,length(gene)),
       lwd=3, length=.1)

But actually, the gene is on the minus strand. We should add a line which corrects for minus strand genes:

plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
       end(gene),rep(0,length(gene)),
       lwd=3, length=.1, 
       code=ifelse(as.character(strand(gene)[1]) == "+", 2, 1))

Gviz

We will briefly show two packages for visualizing genomic data in Bioconductor. Note that each of these have extensive vignettes for plotting many kinds of data. We will show here how to make the coverage plots as before:

#biocLite("Gviz")
library(Gviz)
gtrack <- GenomeAxisTrack()
atrack <- AnnotationTrack(gene, name = "Gene Model")
plotTracks(list(gtrack, atrack))

Extract the coverage. Gviz expects that data will be provided as GRanges objects, so we convert the RleList coverage to a GRanges object:

xgr <- as(xcov, "GRanges")
ygr <- as(ycov, "GRanges")
dtrack1 <- DataTrack(xgr[xgr %over% z], name = "sample 1")
dtrack2 <- DataTrack(ygr[ygr %over% z], name = "sample 2")
plotTracks(list(gtrack, atrack, dtrack1, dtrack2))
plotTracks(list(gtrack, atrack, dtrack1, dtrack2), type="polygon")

ggbio

#biocLite("ggbio")
library(ggbio)
autoplot(gene)
autoplot(fl1, which=z)
autoplot(fl2, which=z)

Footnotes