Skip to content

Fast phylogenetic placement on large reference trees

License

Notifications You must be signed in to change notification settings

flu-crew/splicer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Splicer

Splicer scales-up phylogenetic placement with EPA-ng and pplacer, and makes it applicable to datasets with millions of reference sequences. Splicer performs placement in sub-linear time using a decomposition approach without losing accuracy on very large datasets. Additionally, splicer can automatically classify new sequences via your pre-defined clades file.

Installation

We recommend using a conda environment to install splicer together with epa-ng and pplacer.

If you haven't already, configure bioconda.

conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

Then, create a new environment with required dependencies and install splicer inside that environment.

git clone https://github.com/flu-crew/splicer.git
cd splicer
conda create -n splicer-env --file conda-requirements.txt
conda activate splicer-env
pip install .
<Run splicer to place/classify sequences>
conda deactivate

Usage

To run splicer you need

  1. A reference tree (a maximum-likelihood or Bayesian tree with your reference sequences)
  2. A reference alignment file in FASTA format
  3. A substitution model file from RAxML. If you did not infer your tree with RAxML or do not have a log file saved, we provide an option in splicer to infer the substitution model over your reference tree (see "splicer model -h" for more details).
  4. (Optional) a clades definition file in tab-separated format of type "<reference-name>\t<clade-name>". E.g., "CY040559<tab>1A.3.3". See below for recommended clade-naming convention.

The first step is to decompose you reference dataset into smaller subtrees and create a scaffold tree. This is performed using the decomp command in splicer:

splicer decomp -t reference.tre -s reference.fasta --clades myclades.tsv -n mydecomposition

The name of the decomposition specified with the -n flag will then be used in subsequent placement steps to tell splicer which decomposition to use.

Now, you are ready to place your new sequences with splicer (e.g., stored in file query.fasta).

splicer place -n mydecomposition -q query.fasta -m epa-ng -s raxml-info.log

If you prefer pplacer, you can use -m pplacer instead.

After the placements are complete, you will see information about each query sequence in standard output, and you can find the final JPLACE file in <mydecomposition>/splicer.jplace.

Sequence classification with Splicer

If you provide a clades-definition file to Splicer at the decomposition step, it will be able to immediately classify your new sequences upon placement. We recommend a PANGO-like clade naming format, where clades have names of type 1B.1.1.2 or BA.1.2. In this convention, clade BA is ancestral to clade BA.1, and BA.1 is ancestral to BA.1.2, etc. Splicer then infers clade-names for the ancestral nodes according to this example:

About

Fast phylogenetic placement on large reference trees

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages