jobTree based python wrapper to run the genome simulation tool suite Evolver
Python Perl Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
external
src
.gitignore
LICENSE.txt
Makefile
README.md

README.md

evolverSimControl

(c) 2009 - 2012 The Authors, see LICENSE.txt for details.

Authors

Dent Earl, Benedict Paten, Mark Diekhans

The evolver team is responsible for items in external/ : George Asimenos and Robert C. Edgar, Serafim Batzoglou and Arend Sidow.

Summary

A jobTree based simulation manager for the Evolver genome evolution simulation tool suite.

evolverSimControl (eSC) can be used to simulate multi-chromosome genome evolution on an arbitrary phylogeny (Newick format). In addition to simply running evolver, eSC also automatically creates statistical summaries of the simulation as it runs including text and image files. Also included are convenience scripts to: check on a running simulation and see detailed status and logging information; extract fasta sequence files from the leaf nodes of a completed simulation; extract pairwise multiple alignment files (.maf) from leaf and branch nodes from a completed simulation and with the help of mafJoin, join them together into a single maf covering the entire simulation.

The use of jobTree means that you can run eSC on a cluster running a jobTree supported batch system, on a multi-cored server or on your laptop.

Dependencies

Requirements

  • Linux on i86 Intel. This is due to core Evolver executables being distributed as pre-compiled binaries.

Installation

  1. Download the package. Consider making it a sibling directory to jobTree/ and sonLib/.
  2. cd into the directory.
  3. Type make.
  4. Edit your PYTHONPATH environmental variable to contain the parent directory of the evolverSimControl/ directory.
  5. Type make test.

Example

This example will work you through a small simulation using the toy test example available at http://soe.ucsc.edu/~dearl/software/evolverSimControl/. If you want to create your own infile you can use evolverInfileGeneration to generate your own infile set.

  1. Download and expand the toy archive. For simplicity I'll assume that both root/ and params/ are in the working directory, i.e. ./ .
  2. Next we run the runSim program:
    • $ simCtrl_runSim.py --inputNewick '(Knife:0.004, (Fork:0.003, (Ladle:0.002, (Spoon:0.001, Teaspoon:0.001)S-TS:.001)S-TS-L:.001)S-TS-L-F:0.001);' --outDir toyExampleSim --rootDir root/ --rootName hg18 --paramsDir params/ --jobTree jobTreeToyExampleSim --maxThreads 32 --seed 3571
    • You can check on a running simulation by using simCtrl_checkSimStatus.py , use --help for options.
  3. Post simulation you can run simCtrl_postSimFastaExtractor.py to extract fasta sequence files from the genomes.
  4. You may also wish to run simCtrl_postSimAnnotDistExtractor.py which will use the ggplot2 package for R to display the length distributions of some of the annotations.
  5. You may also wish to construct a single maf for the simulation using simCtrl_postSimMafExtractor.py which will use mafJoin to join the pairwise maf output from Evolver into a single simulation wide maf. This process is extremely memory intensive with the 120Mb Mammal simulation eventually requiring aprroximately 250Gb of memory.

Use

Initiating a simulation

In order to run eSC you will need an infile set, a parameter set, a phylogenetic tree and optionally a mobile element library and mobile element parameter set. Infile sets can be created using evolverInfileGenerator or from scratch. Parameter sets can be generated by reading primary literature and coming up with reasonable values. Phylogenetic trees need to be in Newick format.

Available options for running a simulation are listed below.

$ bin/simCtrl_runSim.py --help

Usage: simCtrl_runSim.py --rootName=name --rootDir=/path/to/dir --paramsDir=/path/to/dir --tree=newickTree --stepLength=stepLength --outDir=/path/to/dir --jobTree=/path/to/dir [options]

simCtrl_runSim.py is used to initiate an evolver simulation using jobTree/scriptTree.

Options:

  • -h, --help show this help message and exit
  • --rootDir=ROOTINPUTDIR Input root directory.
  • --rootName=ROOTNAME name of the root genome, to differentiate it from the input Newick. default=root
  • --inputNewick=INPUTNEWICK Newick tree. http://evolution.genetics.washington.edu/phylip/newicktree.html
  • --stepLength=STEPLENGTH stepLength for each cycle. default=0.001
  • --paramsDir=PARAMSDIR Parameter directory.
  • --outDir=OUTDIR Out directory.
  • --seed=SEED Random seed, either an int or "stochastic". default=stochastic
  • --noMEs Turns off all mobile element and RPG modules in the sim. default=False
  • --noBurninMerge Turns off checks for an aln.rev file in the root dir. default=False
  • --noGeneDeactivation Turns off the gene deactivation step. default=False
  • --maxThreads=MAXTHREADS The maximum number of threads to use when running in single machine mode. default=4
  • ... and all other jobTree standard options.

Simulation Status

To check on a running simulation you can use the simCtrl_checkSimStatus.py script.

$ bin/simCtrl_checkSimStatus.py --help

Usage: simCtrl_checkSimStatus.py --simDir path/to/dir [options]

simCtrl_checkSimStatus.py can be used to check on the status of a running or completed evolverSimControl simulation.

Options:

  • -h, --help show this help message and exit
  • --simDir=SIMDIR Parent directory.
  • --drawText, --drawTree prints an ASCII representation of the current tree status. default=False
  • --curCycles prints out the list of currently running cycles. default=False
  • --stats prints out the statistics for cycle steps. default=False
  • --cycleStem prints out a stem and leaf plot for completed cycle runtimes, in seconds. default=False
  • --cycleStemHours prints out a stem and leaf plot for completed cycle runtimes, in hours. default=False
  • --printChrTimes prints a table of chromosome lengths (bp) and times (sec) for intra chromosome evolution step (CycleStep2).
  • --cycleList prints out a list of all completed cycle runtimes. default=False
  • --html prints output in HTML format for use as a cgi. default=False
  • --htmlDir=HTMLDIR prefix for html links.

Sequence Extraction

To extract fasta sequences from a completed simulation you can use the simCtrl_postSimFastaExtractor.py script.

$ bin/simCtrl_postSimFastaExtractor.py --help

Usage: simCtrl_postSimFastaExtractor.py --simDir path/to/dir [options]

simCtrl_postSimFastaExtractor.py takes in a simulation directory and then extracts the sequences of leaf nodes in fasta format and stores them in the respective step's directory.

Options:

  • -h, --help show this help message and exit
  • --simDir=SIMDIR the simulation directory.
  • --allCycles extract fastas from all cycles, not just leafs. default=False

Simulation maf creation

To create a single maf reflecting the evolutionary history of the entire simulation simCtrl_postSimFastaExtractor.py script.

$ bin/simCtrl_postSimMafExtractor.py --help

Usage: simCtrl_postSimMafExtractor.py --simDir path/to/dir [options]

simCtrl_postSimMafExtractor.py requires mafJoin which is part of mafTools and is available at https://github.com/dentearl/mafTools/ .

Options:

  • -h, --help show this help message and exit
  • --simDir=SIMDIR Simulation directory.
  • --maxBlkWidth=MAXBLKWIDTH Maximum mafJoin maf block output size. May be reduced towards 250 for complicated phylogenies. default=10000
  • --maxInputBlkWidth=MAXINPUTBLKWIDTH Maximum mafJoin maf block input size. mafJoin will cut inputs to size, may result in long runs for very simple joins. May be reduced towards 250 for complicated phylogenies. default=1000
  • --noBurninMerge Will not perform a final merge of simulation to the burnin. default=False
  • --maxThreads=MAXTHREADS The maximum number of threads to use when running in single machine mode. default=4
  • ... and all other jobTree standard options.