Skip to content
No description, website, or topics provided.
Python Shell R
Branch: master
Clone or download

README.md

Snakemake pipeline for a friendly-use of the DeCoSTAR software

This repository contains a pipeline to produce input data for the DeCoSTAR software [1] in order to apply ARt-DeCo [2], ADseq [3] and DeClone [4] algorithms on a dataset from input files whose format is described in the section INPUT.

Requirements

Software/Tools

Quickstart

After downloading the Snakemake pipeline with the command line:

git clone https://github.com/YoannAnselmetti/DeCoSTAR_pipeline.git
cd DeCoSTAR_pipeline

First, step will consist to install all softwares anb bio++ library necessary to execute the different steps of the pipeline (for others libraries you need to install it manually):

./install_dependencies.sh

Dataset example on 18 Anopheles species

The repository contains a dataset composed of 18 Anopheles genomes corresponding to the dataset used in [3]. Input data files for this dataset are present in the directory 18Anopheles_dataset/ and 2 configurations files are present in directory config_files/snakemake ("config_18Anopheles_WGtopo.yaml" and "config_18Anopheles_Xtopo.yaml") to apply the pipeline on the 18 Anopheles dataset with 2 species tree topologies. For reproduction of input data of DeCoSTAR used in [3], go to commit 572d5a5 of this repository.

Adapt pipeline to your dataset:

  1. Produce data files described in section INPUT from your dataset
  2. Create Snakemake configuration file adapted to your dataset from the configuration file example
  3. Set the correct configuration file path at the top of each "*.snakefile" files (configfile: "config_files/snakemake/your_config_file.yaml")

Command lines for the different steps of the Snakemake pipeline (N: #CPUs):

snakemake --snakefile preprocessing.snakefile -j N
snakemake --snakefile input_decostar.snakefile -j N
snakemake --snakefile run_decostar.snakefile -j N
snakemake --snakefile create_adjacencies_graph.snakefile -j N

Pipeline

The pipeline to execute DeCoSTAR on a dataset is divided in 6 parts:

  1. Input data preprocessing to produce GENE file and discard gene families/trees containing included genes (script preprocessing.snakefile)
  2. Gene trees inference with MUSCLE, GBlocks, RAXML and profileNJ (script pipeline_trees_inference.snakefile - TO DO: optional step)
  3. Generate scaffolding adjacencies with BESST (script pipeline_scaffolding.snakefile - TO DO: optional step)
  4. Generation of input files for DeCoSTAR (script input_decostar.snakefile)
  5. Execution of DeCoSTAR and linearization of adjacencies prediction (script run_decostar.snakefile)
  6. Generation of adjacencies graph (script create_adjacencies_graph.snakefile)

Steps 2 and 3 are optional. If you don't use the step 2 (pipeline for gene trees inference), you need to provide a gene trees file in the Snakemake configuration file (families: path_to_your_gene_trees_file) and to remove the line: gene_trees: path_to_gene_trees_after_inference_pipeline

INPUT

The pipeline required 6 input:

  • A species tree file in newick format
  • A tab-separated file composed of 2 columns (chromosome file):
    1. Species name
    2. Expected chromosome number
  • A gene trees file in newick or NHX format OR a gene families tab-separated file composed of 2 columns (required to use step 2 - script in progress):
    1. Gene family ID
    2. Gene ID
  • GFF files containing exons positions on reference genome assembly of all species present in species tree (see GFF file format)
    => Column 9 ('attribute' in GFF file format) corresponds to gene ID (must match with gene ID present in gene trees/families)
  • The reference genome assemblies of extant species in FASTA file format (name file format: $(species_name).*)
  • If scaffolding data are available (used in step 3: optional), user has to provide directory with SRA architecture, ex:
FASTQ/RAW/Anopheles_albimanus/
├── SRX084279
│   ├── SRR314655
│   ├── SRR314656
│   └── SRR314659
├── SRX111456
│   ├── SRR389778
│   ├── SRR389781
│   ├── SRR390324
│   └── SRR390326
└── SRX200219
    └── SRR606148

It is important that in gene trees/families and gene sequences FASTA file the gene ID is present in the format: $(species_name)$separator$(gene_ID). By default we use '@' as separator character between the species name and the gene ID. The choice of the separator character used is given in the Snakemake configuration file (line: separator: '@'). Here, is an example of gene ID: Anopheles_albimanus@AALB003958 where Anopheles_albimanus is the species name and AALB003958 is the ID of the gene. Species name should not contain space and is commonly represented with the format: $genus_$species (In this case, '_' can't be used as the separator character!!!)

References

[1] Duchemin, W., Anselmetti, Y., Patterson, M., Ponty, Y., Bérard, S., Chauve, C., Scornavacca, C., Daubin, V., Tannier, E. (2017). DeCoSTAR: Reconstructing the Ancestral Organization of Genes or Genomes Using Reconciled Phylogenies. Genome Biology and Evolution, 9(5), 1312–1319.

[2] Anselmetti, Y., Berry, V., Chauve, C., Chateau, A., Tannier, E., & Bérard, S. (2015). Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genomics, 16(Suppl 10), S11.

[3] Anselmetti, Y., Duchemin, W., Tannier, E., Chauve, C., & Bérard, S. (2018). Phylogenetic signal from rearrangements in 18 Anopheles species by joint scaffolding extant and ancestral genomes. BMC Genomics, 19(S2), 96.

[4] Chauve, C., Ponty, Y., & Zanetti, J. P. P. (2014). Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach. In Lecture Notes in Computer Science (Advances in Bioinformatics and Computational Biology) (Vol. 8826 LNBI, pp. 49–56).

You can’t perform that action at this time.