Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
185 lines (150 sloc) 6.18 KB
title author layout
Managing and processing large numbers of BED files
Vince Carey
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))


We have seen some very basic and relatively small scale data management problems solved fairly straightforwardly with the ExpressionSet class and with BamFileList for a collection of BAM alignment files.

  • S4 classes help control complexity and ensure coherence among components.
  • R packages are further used to coordinate and distribute data, software utilities, and documentation. Later in the course we will demonstrate how to build such packages.

To conclude this section on data management we address a problem that does not have an obviously optimal solution. We consider the dilemma between packaging serialized instances of GRanges, and packaging indexed compressed text files that can naturally be processed with parallel computing using GenomicFiles.

Why is this a dilemma? In certain contexts, GRanges instances for large numbers of features will be very large, can take significant time to load, and can consume significant quantities of RAM. This can lead to difficulties in using these representations on cluster nodes with limited RAM. If the data are kept on disk in an indexed format allowing random access, transforming targeted extracts of file content into GRanges on an as-needed basis may produce better performance for certain applications. One reason for creating the erma package is to acquire data on the tradeoffs between these different approaches.

erma: epigenomic road map adventures

To foster improved interpretation of effects of DNA variants, we are interested in epigenomic variation among cell types. Information on analysis of ChIP-seq experiments has been collected in the NIH epigenomic road map, with hundreds of different cell-type-specific and mark-specific files archived for public use.

What is an efficient way to manage such data? One experimental approach is provided in the erma package.

A collection of BED files

Statistical 'imputation' of chromatin states has been performed as reported by Ernst and Kellis using data on a large number of cell lines.

ef = dir(system.file("bed_tabix", package="erma"), patt="bed.gz$")

GenomicFiles for managing; improving the naming

We regard names of files as immutable, but the vector of paths to the bed files can be named more informatively. The metadata can be obtained through the makeErmaSet function.

mm = makeErmaSet()
names(files(mm)) = colData(mm)$Epigenome.Mnemonic

Tabulating chromatin state assignments

An overview of the states and marks used to define them

The following table indicates how configurations of various types of chromatin marks are combined to infer the state of a chromosomal segment. Abbreviations used indicate inferred existence of an active TSS (TssA), upstream and downstream promoters of two types, (PromU, PromD1, PromD2), Transcription that is 5' or 3' preferential or simply strong or weak (Tx5', Tx3', Tx, TxWk), jointly transcribed and regulatory (TxReg), transcribed with preference and enhancer activity, possibly weak (TxEnh5', TxEnh3', TxEnhW), active enhancers of two types, possibly flanking or weak (EnhA1, EnhA2, EnhAF, EnhW1, EnhW2) or facilitated by H3K27ac (EnhAc), primary DNase (DNase), ZNF genes and repeats (ZNF/Rpts), heterochromatin (Het), promoters of bivalent or poised type (PromP, PromBiv), repressed polycomb (ReprPC) or quiescent (Quies).

im = readPNG(system.file("pngs/emparms.png", package="erma"))

The files in the erma package contain tilings of genomes for 31 different cell lines, with assignments of tiles to chromatin states according to the state map shown above.

Here's an example of state labeling in the promoter regions for two genes, BRCA2 and EOMES.

stateProfile(mm, "BRCA2")
stateProfile(mm, "EOMES")

The two genes differ both in terms of prevalent state in the promoter region, and in terms of cell-type to cell-type consistency.

Parallelized targeted queries

The visualizations are nice, but we want to be able to program over the files to extract information for downstream use. Here we simply tabulate the states present for each cell type. reduceByFile assigns files to available cores or hosts for parallel processing.

gm = promoters(range(genemodel("BRCA2"))) # 2000 upstream, 200 down by default
ans = reduceByFile( gm, files(mm), MAP=function(range,file) {
   table( import(file, genome="hg19", which=range)$name ) } )
# alternate approach; reboot macbook if above hangs, or use:
#ans = lapply(  files(mm), function(x) {
#   table( import(x, genome="hg19", which=gm)$name ) } )
ans= unlist(ans, recursive=FALSE)

So we see that there is cell-type specific variation in the numbers and types of states imputed to genomic sequence in the promoter region of a given gene. In exercises we will consider how to develop other summaries of potential interest in this manner.


We have shown how we can

  • "leave on disk" large numbers of indexed textual representations of genomic annotation or experimental result,
  • use R packaging to simplify identification of paths to these resources,
  • use GenomicFiles to manage metadata about the files,
  • use reduceByFile to process file contents in parallel.

Such technique provides a certain level of scalability by avoiding construction, saving and loading of large R object serializations. In certain applications with large numbers of clusters with limited RAM, such strategy may be highly cost-effective.