Citation: High Resolution Modeling of Chromatin Interactions C. Reeder, D. Gifford. Research in Computational Molecular Biology, 186-198, 2013.
Abstract Sprout is a novel generative model for ChIA-PET data that characterizes physical chromatin interactions and points of contact at high spatial resolution. Sprout improves upon other methods by learn- ing empirical distributions for pairs of reads that reflect ligation events between genomic locations that are bound by a protein of interest. Using these learned empirical distributions Sprout is able to accurately position interaction anchors, infer whether read pairs were created by self-ligation or inter-ligation, and accurately assign read pairs to anchors which al- lows for the identification of high confidence interactions. When Sprout is run on CTCF ChIA-PET data it identifies more interaction anchors that are supported by CTCF motif matches than other approaches with competitive positional accuracy. Sprout rejects interaction events that are not supported by pairs of reads that fit the empirical model for inter-ligation read pairs, producing a set of interactions that are more consistent across CTCF biological replicates than established methods.
Sprout is an approach to identifying pairs of spatially proximal binding events for a protein from ChIA-PET data.
How To Run
Applying Sprout to ChIA-PET data requires several steps. Sprout is implemented to run on a cluster of machines through the Sun Grid Engine (SGE) queuing system. It is assumed that ChIA-PET sequence data have been appropriately processed to remove chimeric read pairs and that the linker sequences have been removed from the remaining sequence data. The remaining non-chimeric genomic sequence read pairs should be aligned to the appropriate reference genome. Reads from each pair should be aligned independently because no assumptions should be made about the locations of the reads in a pair relative to each other. It is assumed that input files reflect all read pairs such that both reads in each pair align to a unique location in the reference genome. Input files are expected to be tab-delimited with a pair of genomic locations on each line corresponding to the aligned locations of a read pair. For example:
11:22793448:+ 13:56522051:- 2:75705251:+ 5:53998331:- 11:106929428:- 11:106929538:+ 15:99392393:- 15:99393022:+ 12:104434000:- 3:96247693:-
Running all of the following commands with sproutseed.jar on the classpath should include all necessary dependencies. The first stage of Sprout identifies the locations of binding events. A set of initial binding event locations is generated by running MuTauFileGenerator.java:
edu.mit.csail.cgs.reeder.sproutseed.MuTauFileGenerator --species "Mus musculus;mm9" --spacing 500 --buffer 2000 --readfile --outfile
BreakUpMutauFile.java breaks up the file containing initial binding event locations into a number of smaller files to make event location detection more efficient:
edu.mit.csail.cgs.reeder.sproutseed.BreakUpMuTauFile --species "Mus musculus;mm9" --buffer 4000 --numregions 100 --mutaufile --outbase
SubmitSeedMuFile.java generates sets of commands that each submit a job to SGE. The following is an example of a set of arguments for SubmitSeedMuFile.java. The effects of the parameter settings are described in the Sprout manuscript.
edu.mit.csail.cgs.reeder.sproutseed.SubmitSeedMuFile --species "Mus musculus;mm9" --genome "mm9_1.txt" --rho 0.7 --alpha 5 --beta 1 --a 1 --b 1 --readfile --dumpfile --outfile --directory --stage 1 --maxiters 2000 --mutaubase --mutaunum --wd --submitfile
The basic Sprout workflow skips stage 2 and continues with what is called stage 3 in the code. First, the results from stage 1 must be consolidated by chromosome in order to be able to identify interactions between regions that were broken up to make binding event identification more efficient:
edu.mit.csail.cgs.reeder.sproutseed.ConsolidateMuFileStage1Results --species "Mus musculus;mm9" --filebase <prefix for the files that contain results from stage 1> --outbase --numfiles <number of files containing results from stage 1> --readfile --numreads
SubmitSeedMeFile3.java generates another set of commands that submit jobs to SGE:
edu.mit.csail.cgs.reeder.sproutseed.SubmitSeedMuFile3 --species "Mus musculus;mm9" --genome "mm9_1.txt" --rho 0.7 --alpha 5 --beta 1 --a 1 --b 1 --readfile --dumpfile --outfile --directory --stage 3 --maxiters 1000 --stage2file <prefix for the files containing the results from the previous stage, in this case stage 1> --eventout --interactionout --wd --submitfile
reeder.c at gmail