Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
283 lines (241 sloc) 11.5 KB
title date output vignette
Thresholded Correlation of RNASeq Gene Expression Data
`r Sys.Date()`
html_document
highlight toc toc_depth mathjax
kate
false
4
%\VignetteIndexEntry{Correlation} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}

Thresholded Correlation of RNASeq Gene Expression Data

We illustrate the use of the tcor package to compute a thresholded gene expression correlation matrix of gene expression data from the Cancer Genome Atlas (TCGA).

TCGA (http://cancergenome.nih.gov/) is a joint effort of the National Cancer Institute and the National Human Genome Research Institute. TCGA provides curated data and analyses, including large-scale genome sequencing, for many cancer tumor types.

The example proceeds in two parts:

  1. Downloading RNASeq gene expression data from TCGA and tricks for efficiently reading the data into R
  2. Computing a thresholded gene by gene correlation matrix with tcor

Obtaining and reading the gene expression data

The data for this example are obtained from the Broad Institute's GDAC Firehose http://firebrowse.org/?cohort=BRCA&download_dialog=true. The GDAC provides a convenient way to download versioned and standardized TCGA data organized as sample by measurement tables in tab delimited text form.

Some of the following steps use Unix-like pipeline processing with shell utilities and R's pipe function. Windows users may need to install the Rtools suite (https://cran.r-project.org/bin/windows/Rtools/).

Download and decompress the data

We select breast invasive carcinoma gene expression data, one of the larger available datasets. The GDAC dashboard is available at http://gdac.broadinstitute.org/. Data may be browsed and manually downloaded directly from the dashboard, or downloaded and uncompressed using the download.file and untar lines in the script below.

url = "http://gdac.broadinstitute.org/runs/stddata__2015_11_01/data/BRCA/20151101/gdac.broadinstitute.org_BRCA.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2015110100.0.0.tar.gz"
destfile = "gdac.broadinstitute.org_BRCA.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2015110100.0.0.tar.gz"

download.file(url, destfile)  # (about 300 MB)
untar(destfile)               # (about 600 MB)

# data file name:
fn = dir(gsub("\\.tar\\.gz", "", destfile), pattern="*.data.txt", full.names=TRUE)

Efficiently reading the data into R

The data file is a tab-delimited text file with two header lines followed by 20,532 data lines. Each data line specifies a gene followed by sample measurements for that gene in the columns. The measurements include raw counts, normalized, and reads per kilobase of transcript per million mapped reads (RPKM) values. We'll use the RPKM-normalized values in this example.

The header lines look like:

# Hybridiz...   ID1        ID1                       ID1  ID2        ID2                      ID2        ...
# gene          raw_counts median_length_normalized  RPKM raw_counts median_length_normalized raw_counts ...

where, IDn indicates the nth TCGA barcode ID.

We could simply read these data into R using read.table, for example with:

# Simple but not particularly efficient way to read the data...
# (Don't run this, continue reading instead...)
id   = unlist(read.table(fn, sep="\t", stringsAsFactors=FALSE, header=FALSE, nrows=1))
brca = read.table(fn, sep="\t", stringsAsFactors=FALSE, header=FALSE, skip=2)
# ... now filter out just the gene and RPKM columns...

But since we're only interested in the RPKM-normalized values that approach reads too much from the file. It can be more efficient to skip the columns we're not interested in and read in just the gene and RPKM columns.

Instead we can use some the simple but effective shell tool cut and the idea of pipelines to process the data file to remove all but the gene and RPKM columns on the fly as we read it into R. There are two distinct advantages to this approach: we read in only what we're interested in (cutting processing time and memory consumption), and we employ pipeline-style parallel processing to further speed things up, running the column-skipping cut process in parallel with the data parsing R read.table function.

The pipelined processing as described in the last paragraph can use at most two CPU cores to process the data (in practice on average somewhat less due to I/O and other overhead). Lots of even cheap PCs today have more than two cores, and often quite fast storage systems (for example, solid state disk drives). We can wring even more performance out of such systems by combining the pipeline parallelism with explicit parallel processing using R's myriad available parallel processing functions.

The example code below uses Steve Weston's elegant foreach framework for explicit parallel processing to read the data file in chunks. Chunks are processed concurrently using the pipelined parallelism described above: on a four-CPU computer this yields four R and four cat worker processes plus the controlling R process.

Any foreach parallel back end can be used for this task. We use the Unix-specific doMC backend below but Windows users can equivalently use the doParallel backend. The work can even be distributed across more than one computer with doSNOW, doMPI, or doRedis backends (without changing the code). The example below runs in under 30 seconds on my inexpensive quad-core Athlon home PC.

There are 2,635 columns (878 unique samples) of tab-separated data. If we're interested only in RPKM values, then we want columns 1, 4, 7, 10, ..., 2635. One efficient way to get just the columns of interest uses an external shell pipeline.

h = 2       # total header lines in the data file
N = 20534   # total number of lines in the data file

# The argument to cut -d is a TAB symbol inside of single quotes.
# You can generate that by typing CTRL+V followed by TAB. For some
# reason it often does not copy right (coming over as spaces instead),
# so beware here...
command = sprintf("cat %s | cut -d '	' -f %s", fn, paste(c(1,seq(from=4, to=2635, by=3)), collapse=","))

# Read the first header line of sample IDs:
f = pipe(sprintf("%s | head -n 1", command), open="r")
id = unlist(read.table(f, sep="\t", stringsAsFactors=FALSE, header=FALSE, nrows=1))
id[1] = "gene"
close(f)

# Read the rest of the file in parallel.
library(doMC)
cores = 4
registerDoMC(cores)
block = floor((N - h)/cores)

brca = foreach(j=1:cores, .combine=rbind) %dopar%
{
  skip = block * (j - 1) + h + 1
  nrows = ifelse(j == cores, -1, block)
  f = pipe(sprintf("%s | tail -n +%.0f", command, skip), open="r")
  on.exit(close(f))
  read.table(f, sep="\t", stringsAsFactors=FALSE, header=FALSE, nrows=nrows)
}

# Finally, label the variables we've just read in using the TCGA sample IDs.
names(brca) = id

Once finished, we have a data frame named brca with 20,532 rows and 879 columns. The first column contains gene names, the rest contain sample RPKM values. The data frame column names include the TCGA barcode sample IDs.

See https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode for help understanding the TCGA barcode, a sequence of dash separated identifiers. In particular, the fourth identifier (sample/vial) indicates if the sample comes from normal tissue, solid tumor, or elsewhere, as described in https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeTable=sample%20type. We can identify columns associated with tumor, metastatic, and normal tissue samples by:

tumor      = grep("^....-..-....-01.-...-....-..", names(brca))
normal     = grep("^....-..-....-11.-...-....-..", names(brca))
metastatic = grep("^....-..-....-06.-...-....-..", names(brca))

The next step of our example computes thresholded gene correlation matrices and works with data in matrix form, not data frames. The final step in this section assembles two matrices corresponding to tumor and normal samples:

brca_tumor = t(as.matrix(brca[, tumor]))
brca_normal = t(as.matrix(brca[, normal]))
colnames(brca_tumor) = brca$gene   # gene names for reference
colnames(brca_normal) = brca$gene  # gene names for reference

print(dim(brca_tumor))
print(dim(brca_normal))
[1]   775 20532
[1]   100 20532

Efficient computation of thresholded correlation matrices with tcor

The tcor package (https://github.com/bwlewis/tcor, and companion preprint paper http://arxiv.org/abs/1512.07246) provides an implementation of the a new algorithm for fast and efficient thresholded correlation. You can install the development version of the R package directly from GitHub with

devtools::install_github("bwlewis/tcor")
library(tcor)

Because we're interested in correlation among the columns (gene expression), we need to filter out constant-valued columns (including, for example, columns of all zeros):

brca_tumor_filtered = brca_tumor[, apply(brca_tumor, 2, sd) > 0]
brca_normal_filtered = brca_normal[, apply(brca_normal, 2, sd) > 0]

Let's find all pairs of gene expression vectors among the filtered tumor data with correlation values at least 0.99:

tumor_cor = tcor(brca_tumor_filtered, t=0.99)
str(tumor_cor)
List of 6
 $ indices    : num [1:529, 1:3] 16749 8316 4320 4319 4320 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:3] "i" "j" "val"
 $ n          : num 195369
 $ longest_run: num 5467
 $ t          : num 0.99
 $ svd_time   : num 5.48
 $ total_time : num 34.3

The tcor function found 529 such correlated gene expression vectors (out of a total 20522^2 = 421,152,484 possible gene pairs) in about 35 seconds on my quad-core home PC. We can translate the listed matrix column indices into more easily readable gene names with, for example:

tumor = data.frame(i=colnames(brca_tumor_filtered)[tumor_cor$indices[,1]],
                   j=colnames(brca_tumor_filtered)[tumor_cor$indices[,2]],
                   val=tumor_cor$indices[,3])
head(tumor)
                     i                j       val
1 SNORD115-2|100033437 KRTAP20-3|337985 1.0000000
2      INS-IGF2|723961        IGF2|3481 0.9999759
3       CSN1S2A|286828        CSN2|1447 0.9999411
4              GC|2638   CSN1S2A|286828 0.9998493
5              GC|2638        CSN2|1447 0.9997900
6            NRAP|4892       CSRP3|8048 0.9997411

We can verify the result by explicitly computing a full correlation matrix, but this takes a lot longer and uses much more memory (270 seconds and 6 GB on my PC):

# Uncomment the following lines if you want to run the full correlation
# for comparison...

# C = cor(brca_tumor_filtered)
# sum(C[upper.tri(C)] >= 0.99)
# (You will get 529, same as computed above with tcor.)

We can similarly identify the 20,331 pairs of correlated gene expression vectors for the normal samples (in about 17 seconds on my PC):

normal_cor = tcor(brca_normal_filtered, t=0.99)
str(normal_cor)
normal = data.frame(i=colnames(brca_normal_filtered)[normal_cor$indices[,1]],
                    j=colnames(brca_normal_filtered)[normal_cor$indices[,2]],
                    val=normal_cor$indices[,3])
head(normal)
                      i                     j val
1          OR1S1|219959          LELP1|149018   1
2          LELP1|149018          OR9K2|441639   1
3          OR5M3|219482 SNORD115-10|100033447   1
4 SNORD115-10|100033447         OR4C46|119749   1
5          OR5T2|219464          OR8D4|338662   1
6          OR8D4|338662          OR2M5|127059   1