Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
174 lines (127 sloc) 8.04 KB

An Introduction to the multiseq package

library(knitr)
opts_chunk$set(out.extra='style="display:block; margin: auto"', fig.align="center")

The multiseq package is an R package for multiscale sequence analysis and is ongoing work in the Stephens lab at the University of Chicago. Multiseq has two main modes of operation: smoothing and effect estimation....

In this vignette you will learn how to install the package, run multiseq in its two different modes of operation, and visualize its output (and input data) either in R or in the UCSC Genome Browser as a Track Hub. In the Genome Browser you will be able to compare your track with any of the available tracks (already uploaded) and to fine-tune the display (e.g. zoom and scroll the tracks display, highlight a region, change the order of the displayed tracks, etc).


Running multiseq

In this section you will learn how to use multiseq to smooth a signal and to estimate an effect given a covariate g.

First load the package then load the example data included in the package.

    library(multiseq)
    #load example data - R object
    #type ?example1 to get more information
    data(example1, package="multiseq")   
    x          <- dat$x
    g          <- dat$g
    read.depth <- dat$read.depth

Then select a subset of the signal and apply multiseq to perform smoothing.

    #smoothing
    par(mfrow=c(2,1), oma = c(5,4,0,0) + 0.1, mar = c(0,0,1,1) + 0.1)
    for (i in which(g==1)) plot(log(x[i,]), type="l", xlab="Position", ylab="log(x)")
    unique(sort(log(x[1,])))
    res0         <- multiseq(x=x[g==1,], minobs=1, lm.approx=FALSE, read.depth=read.depth[g==1])
    #plot baseline mean +/- 2 posterior standard deviations
    invisible(dev.off())
    plot(res0, fra=2, what="baseline")

Now, given a covariate g, you can estimate an effect:

    #estimating an effect
    res           <- multiseq(x=x, g=g, minobs=1, lm.approx=FALSE, read.depth=read.depth)
    #plot estimated effect mean +/- 2 posterior standard deviations
    plot(res, fra=2)

    #print intervals where `multiseq` found a strong effect (zero is outside of +/- fra posterior standard deviations
    res$intervals <- get.effect.intervals(res, fra=2)
    res$intervals

Function get.effect.intervals outputs intervals where multiseq found strong effect (zero is outside of +/- a specified multiple of the posterior standard deviations). Output interval is in bed format (start is 0-based, end is 1-based).

Running multiseq on sequencing data

Special tools are required to handle next generation sequencing data because of their high throughput nature. Multiseq's function get.counts uses samtools, the UCSC tools, or the R package rhdf5 to read data in bam, bigWig, or hdf5 format.

To run on sequencing data, multiseq requires a samplesheet with the following format:

SampleID Type Replicate Peaks ReadDepth bigWigPath
A1 Control 1 peakA1.bb 16335812 A1.bw
A2 Control 2 peakA2.bb 18197248 A2.bw
B1 Test 1 peakB1.bb 24225586 B1.bw
B2 Test 2 peakB2.bb 12378544 B2.bw

The following fields are required: "SampleID" containing sample IDs, "ReadDepth" specifying sequencing depth for each sample, "bigWigPath" and/or "h5FullPath" and/or "bamReads" specifying the path to count data files in bam, bigWig, or hdf5 format, respectively. Field "Peaks" is not required but can be used to specify the path to a bed or bigBed file (e.g., the path to a bigBed file with ChipSeq peaks). If "Peaks" is specified then also field "Type" is required.

You can load data using the samplesheet as follows:

    setwd(file.path(path.package("multiseq"),"extdata","sim"));
    samplesheet <- file.path(path.package("multiseq"),"extdata","sim","samplesheet.sim.txt")
    samples     <- read.table(samplesheet, stringsAsFactors=F, header=T)
    g <- factor(samples$Type)
    g <- match(g, levels(g))-1
    if (noExecutable("wigToBigWig")){
       data(example2, package="multiseq")
    }else{
       region      <- "chr1:154206209-154214400"
       x           <- get.counts(samples, region)
    }

To smooth a subset of the data:

    par(mfrow=c(2,1), oma = c(5,4,0,0) + 0.1, mar = c(0,0,1,1) + 0.1)
    for (i in which(g==0)) plot(log(x[i,]), type="l", xlab="Position", ylab="log(x)")
    res0 <- multiseq(x=x[g==0,], minobs=1, lm.approx=FALSE, read.depth=samples$ReadDepth[g==0])
    #plot estimated log baseline +- 2 s.d.
    invisible(dev.off())
    res0$region <- region
    plot(res0, fra=2, what="baseline")

To estimate an effect given the covariate g:

    res <- multiseq(x=x, g=g, minobs=1, lm.approx=FALSE, read.depth=samples$ReadDepth)
    #plot estimated effect and s.d. 
    par(mfrow=c(2,1), oma = c(5,4,0,0) + 0.1, mar = c(0,0,1,1) + 0.1)
    res$region <- region
    plot(res, fra=2, is.xaxis=FALSE) 
    transcripts   <- get.transcripts(file.path(path.package("multiseq"),"extdata","sim","hg19.ensGene.part.gp"), region)
    plot(transcripts, region) 

Visualizing input data using samplesheetToTrackHub

With function samplesheetToTrackHub you can create a Track Hub and visualize your input data in the UCSC Genome Browser. Before you use this function, follow installation instructions in the "Optional steps" paragraph above.

    setwd(file.path(path.package("multiseq"),"extdata","sim"))
    hub_name <- "testMultiseq/sim"
    samplesheetToTrackHub(samplesheet, hub_name, chr="chr1")

Function samplesheetToTrackHub will create a Track Hub in folder /some/path/testMultiseq/sim/ and will print the following message:

go to http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hubUrl=https:some/address/testMultiseq/sim/hub.txt
center the genome browser on the region of interest
and make track visible

If the read tracks or the bed files are large, make sure enough memory is available to run simulationToTrackHub.

This is a screenshot of the Track Hub in the UCSC Genome Browser: Image

Visualizing output data using multiseqToTrackHub

After running multiseq, you can use function multiseqToTrackHub to create a Track Hub that can be visualized in the UCSC Genome Browser to display

  • the effect +- a specified multiple (fra) of the posterior standard deviation
  • the intervals where multiseq found strong effect (zero is outside of +/- a multiple of the posterior standard deviation ' e.g. 2 or 3).
    res$region    <- region
    res$intervals <- get.effect.intervals(res, fra=2)
    multiseqToTrackHub(res, fra=2, hub_name="testMultiseq/multiseq_sim")

Function multiseqToTrackHub will create a Track Hub named multiseq_sim in the https:some/address/testMultiseq/ folder and will print the following message:

go to http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hubUrl=https:some/address/testMultiseq/multiseq_sim/hub.txt
center your genome browser around chr1:154206209-154214400 
and make track visible

By default multiseqToTrackHub uses the human genome hg19 but default settings can be changed by specifying additional arguments (see help). This is a screenshot of the Track Hub in the UCSC Genome Browser: Image

Note

If the Genome Browser doesn't show the Track Hub try going to http://genome.ucsc.edu/cgi-bin/hgHubConnect/MyHubs and checking the Track Hub.