Snakemake pipeline for a friendly-use of the DeCoSTAR software
This repository contains a pipeline to produce input data for the DeCoSTAR software  in order to apply ARt-DeCo , ADseq  and DeClone  algorithms on a dataset from input files whose format is described in the section INPUT.
- Python2 (≥v.2.6) -> v.2.7:
- Python3 (≥v.3.5) -> v.3.5.2:
- Snakemake -> v.3.5.5 (Newer versions introduced bugs in the pipeline due to modification in input/output management of rules)
- C++ (≥c++11):
- Gene trees inference pipeline:
- Scaffolding adjacencies pipeline:
- DeCoSTAR (GitHub repository)
After downloading the Snakemake pipeline with the command line:
git clone https://github.com/YoannAnselmetti/DeCoSTAR_pipeline.git cd DeCoSTAR_pipeline
First, step will consist to install all softwares anb bio++ library necessary to execute the different steps of the pipeline (for others libraries you need to install it manually):
Dataset example on 18 Anopheles species
The repository contains a dataset composed of 18 Anopheles genomes corresponding to the dataset used in . Input data files for this dataset are present in the directory 18Anopheles_dataset/ and 2 configurations files are present in directory config_files/snakemake ("config_18Anopheles_WGtopo.yaml" and "config_18Anopheles_Xtopo.yaml") to apply the pipeline on the 18 Anopheles dataset with 2 species tree topologies. For reproduction of input data of DeCoSTAR used in , go to commit 572d5a5 of this repository.
Adapt pipeline to your dataset:
- Produce data files described in section INPUT from your dataset
- Create Snakemake configuration file adapted to your dataset from the configuration file example
- Set the correct configuration file path at the top of each "*.snakefile" files (configfile: "config_files/snakemake/your_config_file.yaml")
Command lines for the different steps of the Snakemake pipeline (N: #CPUs):
snakemake --snakefile preprocessing.snakefile -j N snakemake --snakefile input_decostar.snakefile -j N snakemake --snakefile run_decostar.snakefile -j N snakemake --snakefile create_adjacencies_graph.snakefile -j N
The pipeline to execute DeCoSTAR on a dataset is divided in 6 parts:
- Input data preprocessing to produce GENE file and discard gene families/trees containing included genes (script preprocessing.snakefile)
- Gene trees inference with MUSCLE, GBlocks, RAXML and profileNJ (script pipeline_trees_inference.snakefile - TO DO: optional step)
- Generate scaffolding adjacencies with BESST (script pipeline_scaffolding.snakefile - TO DO: optional step)
- Generation of input files for DeCoSTAR (script input_decostar.snakefile)
- Execution of DeCoSTAR and linearization of adjacencies prediction (script run_decostar.snakefile)
- Generation of adjacencies graph (script create_adjacencies_graph.snakefile)
Steps 2 and 3 are optional. If you don't use the step 2 (pipeline for gene trees inference), you need to provide a gene trees file in the Snakemake configuration file (families: path_to_your_gene_trees_file) and to remove the line: gene_trees: path_to_gene_trees_after_inference_pipeline
The pipeline required 6 input:
- A species tree file in newick format
- A tab-separated file composed of 2 columns (chromosome file):
- Species name
- Expected chromosome number
- A gene trees file in newick or NHX format OR a gene families tab-separated file composed of 2 columns (required to use step 2 - script in progress):
- Gene family ID
- Gene ID
- GFF files containing exons positions on reference genome assembly of all species present in species tree (see GFF file format)
=> Column 9 ('attribute' in GFF file format) corresponds to gene ID (must match with gene ID present in gene trees/families)
- The reference genome assemblies of extant species in FASTA file format (name file format: $(species_name).*)
- If scaffolding data are available (used in step 3: optional), user has to provide directory with SRA architecture, ex:
FASTQ/RAW/Anopheles_albimanus/ ├── SRX084279 │ ├── SRR314655 │ ├── SRR314656 │ └── SRR314659 ├── SRX111456 │ ├── SRR389778 │ ├── SRR389781 │ ├── SRR390324 │ └── SRR390326 └── SRX200219 └── SRR606148
It is important that in gene trees/families and gene sequences FASTA file the gene ID is present in the format: $(species_name)$separator$(gene_ID). By default we use '@' as separator character between the species name and the gene ID. The choice of the separator character used is given in the Snakemake configuration file (line: separator: '@'). Here, is an example of gene ID: Anopheles_albimanus@AALB003958 where Anopheles_albimanus is the species name and AALB003958 is the ID of the gene. Species name should not contain space and is commonly represented with the format: $genus_$species (In this case, '_' can't be used as the separator character!!!)
 Duchemin, W., Anselmetti, Y., Patterson, M., Ponty, Y., Bérard, S., Chauve, C., Scornavacca, C., Daubin, V., Tannier, E. (2017). DeCoSTAR: Reconstructing the Ancestral Organization of Genes or Genomes Using Reconciled Phylogenies. Genome Biology and Evolution, 9(5), 1312–1319.
 Anselmetti, Y., Berry, V., Chauve, C., Chateau, A., Tannier, E., & Bérard, S. (2015). Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genomics, 16(Suppl 10), S11.
 Anselmetti, Y., Duchemin, W., Tannier, E., Chauve, C., & Bérard, S. (2018). Phylogenetic signal from rearrangements in 18 Anopheles species by joint scaffolding extant and ancestral genomes. BMC Genomics, 19(S2), 96.
 Chauve, C., Ponty, Y., & Zanetti, J. P. P. (2014). Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach. In Lecture Notes in Computer Science (Advances in Bioinformatics and Computational Biology) (Vol. 8826 LNBI, pp. 49–56).