PhaseDancerSimulator - Simulator of collapsed segmental duplications

To account for a complex history of the formation of segmental duplications PhaseDancerSimulator simulates such regions based on the tree topology. Fragment of the reference genome is assigned to the root of the tree. Child sequences are created by copying a parent node sequence and mutating each base at a fixed per base rate. Then, long-reads (PacBio or Oxford Nanopore) are simulated using PBSIM2.

PhaseDancerSimulator outputs:

image presenting tree topology.
XML file storing tree topology (gexf).
information about the mutations for every node of the tree (in vcf format).
reads simulated for leaves or all nodes (depending on the type setting).
reads mapped on the root sequence grouped by node (RG tag)

Configuration

Step 1: Cloning the PhaseDancerSimulator repository and installing dependencies

To clone PhaseDancerSimulator repository and installing dependencies, the following lines should be executed:

git clone --recurse-submodules https://github.com/bposzewiecka/PhaseDancerSimulator.git
cd PhaseDancerSimulator
python3 -m venv PhaseDancerSimulator_venv
source PhaseDancerSimulator_venv/bin/activate
pip install -r requirements.txt
./install.sh

Step 2: Creating the configuration file

Configuration file config.yaml must include two dictionaries:

simulations, that contains parameters of each simulation(s) by the name of each simulation.
regions, that contains coordinates of regions to simulate from by the name of each region.

Each entry from the simulations dictionary must have the following properties:

Property	Description	Values
topology	Type of the tree topology	See: topologies
simulations-number	Number of the simulations	integer
region	List of names of the region from the regions dictionary	List of strings - keys from regions dictionary
mutation-rates	List of probabilities	list of probabilities
chemistry	HMM model of quality code for chemistry. See: chemistries	P4C2, P5C3, P6C4, R103, R94, R95
coverages	List of the simulated coverages	list of integers
accuracies	List of the simulated reads accuracies	list of numbers from range 70-100
length-mean	Mean of the simulated reads length	integer
length-sd	Standard deviation of simulated reads length	integer
type	Simulate from all tree nodes or only leaves?	all, leaves
random-prefix-size	Size of random sequence at the beginning of the simulated sequence	integer (default 0)
random-suffix-size	Size of random sequence at the end of the simulated sequence	integer (default 0)

Chemistries

Reads are simulated using FIC-HMM model for chemistries of PacBio and Nanopore.

Name	Technology
P6C4	PacBio
P5C3	PacBio
P4C2	PacBio
R103	Nanopore
R95	Nanopore
R94	Nanopore

Topologies

Tree topologies can be simulated using 4 topology types: flat, cascading, bifurcating and random.

	Flat	Cascading
Topology
Property format	flat number_of_leaves	cascading number_of_leaves
Example value	flat 6	cascading 6

	Bifurcating	Random
Topology
Property format	bifurcating nodes_multiplier_level_1 nodes_multiplier_level_2 nodes_multiplier_level_3 ...	random number_of_leaves
Example value	bifurcating 2 3 2	random 6

Each entry from the regions dictionary must have the following properties:

Property	Description	Values
reference	Reference name, from which region will be extracted. If reference name is ref the file ref.fa should be in the data/refs directory.	string
chrom	Region chromosome	string
start	Region start coordinate	number
end	Region end coordinate	number

Example configuration file

The following yaml file can be created to simulate duplication of two different topology types. First set of parameters myflat simulate one topology tree with root and 10 leaves. Second set of parameters myrandom simulate 4 random topology trees with 10 leaves.

In both simulations the same input region is used. For each simulation reference parameter is equal to "chr1_30". In the dictionary references the name ('chr1_30') is used to specify the region from the hg38 genome build used for the simulations.

simulations:
    myflat:
        topology: 'flat 10'
        simulations-number: 1
        regions: [ region1, region2 ]
        mutation-rates: [0.001, 0.005]
        chemistries: ['P6C4', 'R103']
        coverages: [40, 60, 80]
        accuracies: [85, 90, 95]
        length-mean: 18000
        length-sd: 3000
        type: 'leaves'
    myrandom:
        topology: 'random 10'
        simulations-number: 4
        regions: [ region2 ]
        mutation-rates: [0.001, 0.005]
        chemistries: ['P6C4', 'R103']
        coverages: [40, 60, 80]
        accuracies: [85, 90, 95]
        length-mean: 18000
        length-sd: 3000
        type: 'all'
regions:
    region1:
       chrom: 'chr1'
       start: 30000000
       end: 30500000
       reference: 'hg38'
    region2:
       chrom: 'chr2'
       start: 60000000
       end: 60500000
       reference: 'hg38'

Step 3: Setting environment variables PHASEDANCER_SIMULATOR_DIR and PHASEDANCER_SIMULATOR_DATA_DIR

Environment variable PHASEDANCER_SIMULATOR_DIR should point to a directory PhaseDancerSimulator cloned from github repository.

export PHASEDANCER_SIMULATOR_DIR=/path/to/PhaseDancerSimulator

Environment variable PHASEDANCER_SIMULATOR_DATA_DIR should point to a directory where reference genomes and simulation will be stored.

export PHASEDANCER_SIMULATOR_DATA_DIR=/directory/where/simulations/will/be/stored

To set PHASEDANCER_SIMULATOR_DATA_DIR environment variable to the current directory, following command should be executed:

export PHASEDANCER_SIMULATOR_DATA_DIR=`pwd`

Step 4: Placing the reference genomes (or symbolic link to them) used for simulations in the appropriate directory

Reference genomes used for simulations should be placed in the data/refs directory. Symbolic links to reference genomes can be also used. Fasta file or symbolic link to it should have a .fa extension.

The following command can be executed to download hg38 human genome build.

mkdir -p $PHASEDANCER_SIMULATOR_DATA_DIR/data/refs
cd $PHASEDANCER_SIMULATOR_DATA_DIR/data/refs
wget --directory-prefix=$PHASEDANCER_SIMULATOR_DATA_DIR  https://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip $PHASEDANCER_SIMULATOR_DATA_DIR/hg38.fa.gz

Step 5: Starting the Snakemake workflow

To start the Snakemake workflow, the following line of code should be executed with the number_of_threads replaced by the maximum number of threads that can be used for simulations.

snakemake --cores number_of_threads

Note

The PhaseDancerSimulator simulator is based on the idea of simulating segmental duplications derived from tree topologies proposed in ¹. Simulator was used to validate and benchmark PhaseDancer ² assembler.

Embedded software:

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
images		images
minimap2 @ fcaadc2		minimap2 @ fcaadc2
pbsim2 @ e71f789		pbsim2 @ e71f789
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

bposzewiecka/PhaseDancerSimulator

Folders and files

Latest commit

History

Repository files navigation

PhaseDancerSimulator - Simulator of collapsed segmental duplications

Configuration

Step 1: Cloning the PhaseDancerSimulator repository and installing dependencies

Step 2: Creating the configuration file

Chemistries

Topologies

Example configuration file

Step 3: Setting environment variables PHASEDANCER_SIMULATOR_DIR and PHASEDANCER_SIMULATOR_DATA_DIR

Step 4: Placing the reference genomes (or symbolic link to them) used for simulations in the appropriate directory

Step 5: Starting the Snakemake workflow

Note

Footnotes

About

Resources

License

Stars

Watchers

Forks

Languages