Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
265 lines (209 sloc) 9.27 KB
layout title
A few quick examples of genome-scale data with Bioconductor
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))

Motivation and techniques

The videos in "What we measure and why" provide schematic illustrations of the basic biological processes that can now be studied computationally. We noted that recipes for all the proteins that are fundamental to life processes of an organism are coded in the organism's genomic DNA. Studies of differences between organisms, and certain changes within organisms (for example, development of tumors), often rely on computations involving genomic DNA sequence.

Bioconductor provides tools for working directly with genomic DNA sequence for many organisms.
One basic approach uses computations on a "reference sequence", another focuses on differences between the genomic sequence of a given individual, and the reference.

Reference sequence access

It is very easy to use Bioconductor to work with the reference sequence for Homo sapiens. Here we'll have a look at chromosome 17.


Of note:

  • the sequence is provided through an R package
  • the name of the package indicates the curating source (UCSC) and reference version (hg19)
  • familiar R syntax $ for selecting a list element is reused to select a chromosome

Representing DNA variants

A standard representation for individual departures from reference sequence is Variant Call Format. The VariantAnnotation package includes an example. We have two high-level representations of some DNA variants -- a summary of the VCF content in the example, and the genomic addresses of the sequence variants themselves.

fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") 
vcf <- readVcf(fl, "hg19")

Of note:

  • the example data are "built-in" to the package, for illustration and testing
  • the variable vcf has a concise display to the user
  • the variant locations, extracted using rowRanges, are shown with a tag indicating their context in the hg19 reference build

Measures of gene expression: microarrays elucidate the transcriptional program of the cell cycle in yeast

A highly influential experiment (Spellman PT et al., Mol. Biol. Cell v9, 1998, PMID 9843569) undertook to use genome-wide measurement of mRNA abundance over a series of time points in the reproductive cycle of Sacchomyces cerevisiae, baker's yeast.
Again we use an R package to manage the data, and we use a special Bioconductor-defined data structure to provide access to information about the experiment and the results.


After a bit of massaging, a topic on which you will become expert in the next few weeks, we can visualize the time course of a cell-cycle regulated gene.

alp = spYCCES[, spYCCES$syncmeth=="alpha"]
with(pData(alp), {
  plot(exprs(alp)["YAR007C",]~time, ylab="YAR007C expression",
   xlab="minutes from alpha synchronization")
  lines(exprs(alp)["YAR007C",]~time, lty=2)

Of note:

  • Informative metadata about the experiment are bound right to the data (pubmed ID and abstract accessible through experimentData)
  • Simple syntax can be used to select components of complex experimental designs; in this case spYCCES[, spYCCES$syncmeth=="alpha"] picks out just the colonies whose cell cycling was controlled using alpha pheromone
  • R's plotting tools support general plot annotation and enhancement
  • Statistical modeling tools to help distinguish cycling and non-cycling genes can be used immediately

Measuring gene expression with RNA-seq

A 2013 paper of Zarnack and colleagues studies the role of the heterogeneous nuclear ribonucleoproteins C1 and C2 (coded for by gene HNRNPC) in the prevention of inclusion of cryptic Alu elements in spliced transcripts. One part of the study involves knockdown of HNRNPC in HeLa cells, followed by genome-wide observation of frequency of cryptic Alu element inclusion.

The Bioconductor package RNAseqData.HNRNPC.bam.chr14 includes an extract from the RNA-seq data generated in this study. There are 8 BAM files that have been filtered and indexed to include information on mRNA molecules mapped to the sequence of chromosome 14. The protocol creates "paired-end" reads. We can have a quick look at the short reads of mRNA sequence as follows:

bf1 = RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] # first of 8 files in package
r1 = readGappedReads(bf1)

We can demonstrate the detectability of mRNA coded for by HNRNPC using the plotting package ggbio. We use our knowledge of the location of the gene to set up a scanning parameter to focus the data extract.

hnrnpc_param = 
   ScanBamParam(which=GRanges("chr14", IRanges(21.67e6,21.74e6)))
limr1 = readGappedReads(bf1, param=hnrnpc_param, use.names=TRUE)
autoplot(limr1, geom="line", stat="coverage") + ylim(0,900)

In exercises you will verify that the knockdown experiment succeeded.


You're about to engage with a few high-level lectures on genome structures and molecular biological techniques for measuring them. As you encounter these concepts, keep in mind what sorts of computations you consider relevant to understanding the structures and processes being studied. Find the tools to perform these computations in Bioconductor, and become expert in their use. And if you don't find them, let us know, and perhaps we can point them out, or, if they don't exist, build them together.