Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
240 lines (202 sloc) 7.95 KB
layout title
Some examples of integrative analysis
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))

Integrative analysis examples

In this document we'll review a few approaches to using genome-scale data of different types to reason about certain focused questions.

TF binding and expression co-regulation in yeast

An example of integrative analysis can be found in a paper of Lee and Rinaldi in connection with the regulatory program of the yeast cell cycle. There are two key experimental components:

  • Protein binding patterns: based on ChIP-chip experiments, we can determine the gene promoter regions to which transcription factors bind.

  • Expression patterns: based on timed observations of gene expression in a yeast colony we can identify times at which groups of genes reach maximal expression.

Figure 5 of the paper indicates that the Mbp1 transcription factor played a role in regulating expression in the transition from G1 to S phases of the cell cycle. The ChIP-chip data is in the harbChIP package.


This is a well-documented data object, and we can read the abstract of the paper directly.


Let's find MBP1 and assess the distribution of reported binding affinity measures. The sample names of the ExpressionSet (structure used for convenience even though the data are not expression data) are the names of the proteins "chipped" onto the yeast promoter array.

mind = which(sampleNames(harbChIP)=="MBP1")
qqnorm(exprs(harbChIP)[,mind], main="MBP1 binding")

The shape of the qq-normal plot is indicative of a strong departure from Gaussianity in the distribution of binding scores, with a very long right tail. We'll focus on the top five genes.

topb = featureNames(harbChIP)[ order(
  exprs(harbChIP)[,mind], decreasing=TRUE)[1:5] ]
select(org.Sc.sgd.db, keys=topb, keytype="ORF",

Our conjecture is that these genes will exhibit similar expression trajectories, peaking well within the first half of cell cycle for the yeast strain studied.

We will subset the cell cycle expression data from the yeastCC package to a colony whose cycling was synchronized using alpha pheromone.

alp = spYCCES[, spYCCES$syncmeth=="alpha"]
plot(exprs(alp)[ topb[1], ]~alp$time, lty=1,
   type="l", ylim=c(-1.5,1.5), lwd=2, ylab="Expression",
    xlab="Minutes elapsed")
for (i in 2:5) lines(exprs(alp)[topb[i],]~alp$time, lty=i, lwd=2)
legend(75,-.5, lty=1:10, legend=topb, lwd=2, cex=.6, seg.len=4)

We have the impression that at least three of these genes reach peak expression roughly together near times 20 and 80 minutes. There is considerable variability. A data filtering and visualization pattern is emerging by which genes bound by a given transcription factor can be assessed for coregulation of expression. We have not entered into the assessment of statistical significance, but have focused on how the data types are brought together.

TF binding and genome-wide DNA-phenotype associations in humans

Genetic epidemiology has taken advantage of high-throughput genotyping (mostly using genotyping arrays supplemented with model-based genotype imputation) to develop the concept of "genome-wide association study" (GWAS). Here a cohort is assembled and individuals are distinguished in terms of disease status or phenotype measurement, and the genome is searched for variants exhibiting statistical association with disease status or phenotypic class or value. An example of a GWAS result can be seen with the gwascat package, which includes selections from the NHGRI GWAS catalog, which has recently moved to EBI-EMBL.


This shows the complexity involved in recording information about a replicated genome-wide association finding. There are many fields recorded, by the key elements are the name and location of the SNP, and the phenotype to which it is apparently linked. In this case, we are talking about rheumatoid arthritis.

We will now consider the relationship between ESRRA binding in B-cells and phenotypes for which GWAS associations have been reported.

It is tempting to proceed as follows. We simply compute overlaps between the binding peak regions and the catalog GRanges.

fo = findOverlaps(GM12878, gwrngs19)
    subjectHits(fo) ]), decreasing=TRUE)[1:5]

The problem with this is that gwrngs19 is a set of records of GWAS hits. There are cases of SNP that are associated with multiple phenotypes, and there are cases of multiple studies that find the same result for a given SNP. It is easy to get a sense of the magnitude of the problem using reduce.


So our strategy will be to find overlaps with the reduced version of gwrngs19 and then come back to enumerate phenotypes at unique SNPs occupying binding sites.

fo = findOverlaps(GM12878, reduce(gwrngs19))
ovrngs = reduce(gwrngs19)[subjectHits(fo)]
phset = lapply( ovrngs, function(x)
  unique( gwrngs19[ which(gwrngs19 %over% x) ]$Disease.Trait ) )
sort(table(unlist(phset)), decreasing=TRUE)[1:5]

What can explain this observation? We see that there are commonly observed DNA variants in locations where ESRRA tends to bind. Do individuals with particular genotypes of SNPs in these areas have higher risk of disease because the presence of the variant allele interferes with ESRRA function and leads to arthritis or abnormal cholesterol levels? Or is this observation consistent with the play of chance in our work with these data? We will examine this in the exercises.

Harvesting GEO for families of microarray archives

The NCBI Gene Expression Omnibus is a basic resource for integrative bioinformatics. The Bioconductor GEOmetadb package helps with discovery and characterization of GEO datasets.

The GEOmetadb database is a 240MB download that decompresses to 3.6 GB of SQLite. Once you have acquired the GEOmetadb.sqlite file using the getSQLiteFile function, you can create a connection and start interrogating the database locally. Here we use an environment variable to establish the location of the database. Use your operating system environment variables to emulate this.

lcon = dbConnect(SQLite(), Sys.getenv("GEOMETADB_SQLITE_PATH"))

We will build a query that returns all the GEO GSE entries that have the phrase "pancreatic cancer" in their titles. Because GEO uses uninformative labels for array platforms, we will retrieve a field that records the Bioconductor array annotation package name so that we know what technology was in use. We'll tabulate the various platforms used.

vbls = "gse.gse, gse.title, gpl.gpl, gpl.bioc_package"
req1 = " from gse join gse_gpl on gse.gse=gse_gpl.gse"
req2 = " join gpl on gse_gpl.gpl=gpl.gpl"
goal = " where gse.title like '%pancreatic%cancer%'"
quer = paste0("select ", vbls, req1, req2, goal)
lkpc = dbGetQuery(lcon, quer)

We won't insist that you take the GEOmetadb.sqlite download/expansion, but if you do, variations on the query string constructed above can assist you with targeted identification of GEO datasets for analysis and reinterpretation.