Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
417 lines (336 sloc) 13.6 KB
title author layout
Management of genome-scale data
Vince Carey
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))

Introduction (revised July 2017)

It is very common for life scientists to use spreadsheets to record and manage experimental data. In these problems we will build up a discipline of data management for genome-scale experiments using R and Bioconductor.

It may seem unusual to recommend R-based software for managing data. Ordinarily R is used for programming and computing statistical analyses. However, Bioconductor has taken very seriously the need to carefully annotate, in a unified way, many aspects of complex biological experiments. This approach allows us to import assay outputs (such as multiple microarray scans, or multiple sequencing alignments), combining them with relevant metadata about the experiment and the samples assayed, saving all the information in a single R variable. By carefully designing the structure of this variable, programs can operate reliably on various aspects of the data. Ultimately a more efficient management and analysis environment is obtained.

We will come to understand this enhancement of efficiency by working from some very basic representations of experimental data, proceeding stepwise to more advanced integrative representations that are the hallmark of Bioconductor's approach to genome-scale data.

Illustrative exercises (worked through in video)

1.0 Distinct components from an experiment.

.oldls = ls()
.newls = ls()

The objects obtained from the data() command are

newstuff = setdiff(.newls, .oldls)

and they have classes

cls = lapply(newstuff, function(x) class(get(x)))
names(cls) = newstuff

It is easy to make an informative plot of hybridization date by ethnicity of sample source using standard R:

boxplot(date~factor(ethnicity), data=sampleInfo, ylab="chip date")

What if we want to look at expression of the gene BRCA2 by ethnicity? We have to do several things:

  • We need to determine which row of geneExpression corresponds to BRCA2, and to do this we need to know which column of geneAnnotation provides the information;
  • We need to ensure that the gene expression data in the matrix geneExpression are ordered consistently with the ordering of data in sampleInfo, and to do this we need to know which column of sampleInfo identifies the samples using the same tags as geneExpression.

Therefore, to relate gene expression to a sample characteristic we need to manipulate three different entities (expression, sample annotation, and gene annotation), ensuring that they are properly coordinated.

all.equal(colnames(geneExpression), sampleInfo$filename)
all.equal(rownames(geneExpression), geneAnnotation$PROBEID)
ind = which(geneAnnotation$SYMBOL=="BRCA2")
boxplot(geneExpression[ind,]~sampleInfo$ethnicity, ylab="BRCA2 by hgfocus")

It works, but it is complicated. To reduce complexity, Bioconductor defines ExpressionSet.

1.1 The ExpressionSet container.

There are high-level commands that simplify ExpressionSet construction.

es1 = ExpressionSet(geneExpression)
pData(es1) = sampleInfo
fData(es1) = geneAnnotation

When we mention es1 we get a compact representation of lots of information about the experiment. Use es1 to visualize the confounding of hybridization date with ethnicity:


Now you've been introduced to the ExpressionSet container design and you've even constructed an ExpressionSet instance. We'll use a more comprehensive representation of the data in the underlying experiment in the next task. We'll use


1.2 Metadata management. It is useful to have information about an experiment's intentions and methods ready-to-hand. Some of the arrays used for GSE5859 were produced for a 2004 paper with Pubmed ID 15269782. We can, if connected to the internet, import information about that paper into R.

p = pmid2MIAME("15269782")

Check the display of e in your session. Then Use the command experimentData(e) <- p. How has the display of e changed? What is the result of experimentData(e)?

ExpressionSet: Why? and How?

We have moved fairly quickly through the creation and use of an ExpressionSet instance to illustrate some of its virtues. We will now provide some details on why and how this structure is designed and used.

Motivations underlying the ExpressionSet design are many. A few of the most important are

  • to reduce complexity of programming with complex assay, annotation, and sample-level information,
  • to support simple and idiomatic filtering of features and samples,
  • to control memory consumption as assay data are passed to analytical functions.

So much for "why", to this point. "How" takes us for an excursion in software design strategy: object-oriented programming in R with the class system known as S4.

Object-oriented programming and S4 classes

Object-oriented programming (OOP) is a discipline of computer programming that helps to reduce program complexity and maintenance requirements, and to enhance program interoperability. There are various approaches to OOP, even within R, and we focus on the one known as S4. The basic idea is that an object

  • brings together data of different types in a unified structure,
  • is identified as a member of a formally specified class, and may be formally situated in a system of related objects,
  • "responds to" generic method calls in a class-specific way.

We can see the class structure for an ExpressionSet with getClass:


This shows that there are r length(getSlots(getClass("ExpressionSet"))) 'slots' comprising the class definition. Only the annotation slot is occupied by a simple R data type. All the other slots are to be occupied by instances of other S4 classes (or more basic R data types).

As an example, the MIAME class has the following definition:


So we see that S4 allows us to define new data structures that encapsulate diverse types of information. We could do this with the very general "list" structure provided by R. But S4 allows us to guarantee properties of our new structures so that programs defined for these structures do not have to "check" on what they are operating on. The checks are built in to the class system.

DataFrame: an enhanced structure for tabular data

We are familiar with the data.frame structure from base R. Bioconductor introduced the DataFrame class to provide some additional capabilities that will be heavily exploited later.

sa2 = DataFrame(sampleInfo)

The show method depicts top and bottom table row data, provides information on column datatype, and has the capacity to manage complex data types in columns more effectively than data.frame. Here is the class definition:


Representations for mature archives from multisample NGS experiments

Short read sequencing leads to data collections that are more voluminous than gene-oriented microarray outputs.
Sequencing experiments may also be analyzed in more varied ways, so that it is valuable to provide efficient access to assay outputs at various levels of granularity.

A typical data flow for RNA-seq studies has phases

  1. chromatographic display of base calls per cycle (seldom provided after, say, 2008), 2) base calls over the read length, with quality values, in FASTQ format,
  2. feature quantification through statistical processing of genomic or transcriptomic alignments,
  3. genome-scale archiving of multiple samples for inference on biologic questions of interest. FASTQ files are often developed in pairs for each sample.

A set of BAM files in Bioconductor

The r Biocexptpkg("RNAseqData.HNRNPC.bam.chr14") package includes BAM files derived from paired-end sequencing of 8 samples of HeLa cell lines, 4 of which were subjected to RNA interference to "knock down" the expression of the heterogeneous nuclear ribonucleoproteins C1 and C2 (denoted hnRNPC). This family of nuclear proteins is involved in pre-mRNA processing. We will demonstrate some approaches to managing the data from BAM to archived counts related to HNRNPC.

The BAM file pathnames are available in a character vector stored in the RNAseqData... package.

bfp = RNAseqData.HNRNPC.bam.chr14_BAMFILES

We use the r Biocpkg("Rsamtools") package to manage these using a single variable.

bfl = BamFileList(file=bfp)

The BAM files include a limited amount of metadata about the genome to which the reads were aligned. This is propagated to the BamFileList instance.


We see that the genome is "unspecified", but we happen to know that the reference build 'hg19' was used to define the genomic sequence for alignment. [genome(bfl) <- "hg19" should rectify this but at present there is a bug. This should be fixed so that attempts to compare alignments using different references can be flagged or throw errors.]

In order to verify the success of the knockdown we will check the alignments in the vicinity of the genomic sequence coding for HNRNPC. Later we will show how to obtain the gene address using Bioconductor annotation tools; for now we just assert that for hg19

hnrnpcLoc = GRanges("chr14", IRanges(21677296, 21737638))

This identifies a region of chromosome 14 to which the HNRNPC coding gene is annotated for hg19. A simple way of assessing the expression quantifications uses summarizeOverlaps from the r Biocpkg("GenomicAlignments") package. This counting method will implicitly use multicore computation.

hnse = summarizeOverlaps(hnrnpcLoc,bfl)


5.1 Examine the contents of the count component of hnse. What are the column identifiers for quantifications corresponding to the HNRNPC knockdown?

5.2 Add a variable condition with values knockdown and wildtype to the colData component of hnse.

5.3 Samples with identifier suffixes 306 and 307 are technical replicates of a single control sample; likewise for 308 and 309. Consequently there are six different biological samples in use. Add a variable to colData distinguishing these six samples.

Some intricacies of the ExpressionSet design (advanced)

Bioinformaticians work in an environment that is constantly changing. Computer capacity and throughput has increased dramatically since Bioconductor began (around 2000CE). The R language has also changed considerably. In the early days, it was important to be very economical about memory consumption, and R's semantics at that time led to some inefficiencies through copying of objects in memory as they are passed to functions. Copying does not occur for R objects of class 'environment', and this led to the adoption of environments for storage of voluminous assay data.

class(slot(es1, "assayData"))
ae = slot(es1, "assayData")

This is a very special type of object for R. It can be thought of as a container for a reference to the matrix of expression values.


Exercise (optional)

4.1. Explain the following findings.

e1 = new.env()
l1 = list()
e1$x = 5
l1$x = 5
all.equal(e1$x, l1$x)
f = function(z) {z$x = 4; z$x}
all.equal(f(e1), f(l1))
all.equal(e1$x, l1$x)  # different from before