Vespasian performs genome scale detection of site and branch-site signatures of positive selection by orchestrating evolutionary hypothesis tests with PAML. Given a collection of alignments of protein-coding orthologous gene families and labelled trees, Vespasian infers gene trees from a species tree and evaluates site and lineage-specific models of evolution. Model testing is CPU-intensive but embarrassingly parallel, and can be executed on one or many machines with snakemake. Vespasian is the pure Python successor to VESPA by Webb et al. (2017).
If the conda package manager is already installed, skip this step, otherwise:
Linux
-
Install Miniconda, following instructions and accepting default options:
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
MacOS
An x86_64
Miniconda installation is required in order to install Vespasian.
-
If using a Mac with an Intel processor, skip this step. Otherwise:
arch -x86_64 zsh
-
Install Miniconda, following instructions and accepting default options:
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh bash Miniconda3-latest-MacOSX-x86_64.sh
-
If using a Mac has an Intel processor, skip this step. Otherwise:
arch -x86_64 zsh
-
Install Vespasian:
curl -OJ https://raw.githubusercontent.com/bede/vespasian/master/environment.yml conda env create -f environment.yml conda activate vespasian
-
Test Vespasian:
vespasian version
conda create -y -n vespasian-dev python=3.11 paml==4.10.6 -c conda-forge -c bioconda
conda activate vespasian-dev
git clone https://github.com/bede/vespasian
pip install --editable './vespasian[dev]'
e.g. vespasian infer-gene-trees --warnings --progress input tree
- Required input (please read carefully):
input
Path to directory containing orthologous gene families as individual nucleotide alignments in fasta format with a.fasta
or.fa
extension. These should be in frame and free from stop codons. Fasta headers should contain a taxonomic identifier (mirroring tip labels in the tree file), optionally followed by separator character ('|
' by default). A minimum of seven taxa must be present.tree
Path to species tree in Newick format. Tip labels must correspond to fasta headers before the separator character.
- Output:
- Directory (default name
gene-trees
) containing minimal gene trees for each family.
- Directory (default name
$ vespasian infer-gene-trees -h
usage: vespasian infer-gene-trees [-h] [-o OUTPUT] [-s SEPARATOR] [-w] [-p]
input tree
Create gene trees by pruning a given species tree
positional arguments:
input path to directory containing gene families
tree path to newick formatted species tree
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
path to output directory (default: 'gene-trees')
-s SEPARATOR, --separator SEPARATOR
character separating taxon name and identifier(s)
(default: '|')
-w, --warnings show warnings (default: False)
-p, --progress show progress bar (default: False)
e.g. vespasian codeml-setup --progress --warnings --branches branches.yml input gene-trees
-
Required input:
input
Path to directory containing aligned orthologous gene families as individual fasta files.gene-trees
Path to directory containing minimal gene trees.
-
Optional input:
-
—-branches BRANCHES
Path to yaml file containing a YAML mapping of lineages to be labelled for evaluation of lineage-specific evolutionary signal using branch-site tests. To label an individual leaf node taxon, specify its name followed by a colon. To label an internal node, choose a suitable name (e.g.carnivora
) followed by a colon and its corresponding leaf nodes inside square brackets (a sequence in yaml) and separated by commas. For internal nodes, all child nodes present in the species tree must be specified, even if they are not present in all of the gene families.-
cat: carnivora: [cat, dog]
-
-
-
Output:
- Directory (default name
codeml
) containing nested directory structure of models and starting parameters for each gene family. - File
codeml-commands.sh
containing list of commands to execute the model tests - File
Snakefile
for running the contents ofcodeml-commands.sh
locally or using a cluster
N.B. By default, at least two taxa must be present within a given family for a named internal node to be labelled. Use
--strict
to skip named internal nodes unless all child leaf nodes are present. - Directory (default name
$ vespasian codeml-setup -h
usage: vespasian codeml-setup [-h] [-b BRANCHES] [-o OUTPUT]
[--separator SEPARATOR] [--strict] [-t THREADS]
[-w] [-p]
input gene-trees
Create suite of branch and branch-site codeml environments
positional arguments:
input path to directory containing aligned gene families
gene-trees path to directory containing gene trees
optional arguments:
-h, --help show this help message and exit
-b BRANCHES, --branches BRANCHES
path to yaml file containing branches to be labelled
(default: -)
-o OUTPUT, --output OUTPUT
path to output directory (default: 'codeml')
--separator SEPARATOR
character separating taxon name and identifier(s)
(default: '|')
--strict label only branches with all taxa present in tree
(default is >= 2) (default: False)
-t THREADS, --threads THREADS
number of parallel workers (default: 6)
-w, --warnings show warnings (default: False)
-p, --progress show progress bar (default: False)
e.g. cd codeml && snakemake --cores 8
-
Ensure
codeml
binary is present inside$PATH
-
Using PAML version
4.9=h01d97ff_5
from Conda is recommended -
cd codeml
(the directory created bycodeml-setup
in step 2) -
Local execution (for small jobs)
snakemake -k --cores 8
(recommended)- Or, using GNU parallel (not recommended – doesn't catch errors!)
parallel --bar :::: codeml-commands.sh
-
Cluster execution
snakemake -k --cores MAXJOBS --cluster OPTIONS
- SGE example:
snakemake -k --jobs 100 --cluster "qsub -cwd -V" --max-status-checks-per-second 0.1
- Oxford Rescomp:
qsub -cwd -V -P bag.prjc -q short.qc
- Profiles are available for other cluster platforms
e.g. vespasian report --progress input
- Required input (please read carefully):
input
path to directory (defaultcodeml
) containing models configured in step 2 and executed in step 3
- Output:
- Directory containing per-gene tables of likelihood ratio test results, model parameters, and positively selected sites from the highest scoring models.
$ vespasian report -h
usage: vespasian report [-h] [-o OUTPUT] [--hide] [-p] input
Perform likelihood ratio tests and and report positively selected sites
positional arguments:
input path to codeml-setup output directory
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
path to output directory (default: 'report-codeml')
--hide hide gratuitous emperor portrait (default: False)
-p, --progress show progress bar (default: False)
- Positively selected site visualisation
- Python API
- Specify site and/or branch-site models only
- Renaming:
-
infer-gene-trees
->infer-trees
-
- Consider B-H correction