Skip to content

bede/vespasian

Repository files navigation

DOI Tests PyPI

Vespasian

Vespasian performs genome scale detection of site and branch-site signatures of positive selection by orchestrating evolutionary hypothesis tests with PAML. Given a collection of alignments of protein-coding orthologous gene families and labelled trees, Vespasian infers gene trees from a species tree and evaluates site and lineage-specific models of evolution. Model testing is CPU-intensive but embarrassingly parallel, and can be executed on one or many machines with snakemake. Vespasian is the pure Python successor to VESPA by Webb et al. (2017).

Installation

Installing Miniconda

If the conda package manager is already installed, skip this step, otherwise:

Linux

  • Install Miniconda, following instructions and accepting default options:

    curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh

MacOS

An x86_64 Miniconda installation is required in order to install Vespasian.

  • If using a Mac with an Intel processor, skip this step. Otherwise:

    arch -x86_64 zsh
  • Install Miniconda, following instructions and accepting default options:

    curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
    bash Miniconda3-latest-MacOSX-x86_64.sh

Installing Vespasian

  • If using a Mac has an Intel processor, skip this step. Otherwise:

    arch -x86_64 zsh
  • Install Vespasian:

    curl -OJ https://raw.githubusercontent.com/bede/vespasian/master/environment.yml
    conda env create -f environment.yml
    conda activate vespasian
  • Test Vespasian:

    vespasian version

Development install

conda create -y -n vespasian-dev python=3.11 paml==4.10.6 -c conda-forge -c bioconda
conda activate vespasian-dev
git clone https://github.com/bede/vespasian
pip install --editable './vespasian[dev]'

Usage

Step 1: gene tree inference from a species tree

e.g. vespasian infer-gene-trees --warnings --progress input tree

  • Required input (please read carefully):
    • input Path to directory containing orthologous gene families as individual nucleotide alignments in fasta format with a .fasta or .fa extension. These should be in frame and free from stop codons. Fasta headers should contain a taxonomic identifier (mirroring tip labels in the tree file), optionally followed by separator character ('|' by default). A minimum of seven taxa must be present.
    • tree Path to species tree in Newick format. Tip labels must correspond to fasta headers before the separator character.
  • Output:
    • Directory (default name gene-trees) containing minimal gene trees for each family.
$ vespasian infer-gene-trees -h
usage: vespasian infer-gene-trees [-h] [-o OUTPUT] [-s SEPARATOR] [-w] [-p]
                                  input tree

Create gene trees by pruning a given species tree

positional arguments:
  input                 path to directory containing gene families
  tree                  path to newick formatted species tree

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        path to output directory (default: 'gene-trees')
  -s SEPARATOR, --separator SEPARATOR
                        character separating taxon name and identifier(s)
                        (default: '|')
  -w, --warnings        show warnings (default: False)
  -p, --progress        show progress bar (default: False)

Step 2: Configure model test environments

e.g. vespasian codeml-setup --progress --warnings --branches branches.yml input gene-trees

  • Required input:

    • input Path to directory containing aligned orthologous gene families as individual fasta files.
    • gene-trees Path to directory containing minimal gene trees.
  • Optional input:

    • —-branches BRANCHES Path to yaml file containing a YAML mapping of lineages to be labelled for evaluation of lineage-specific evolutionary signal using branch-site tests. To label an individual leaf node taxon, specify its name followed by a colon. To label an internal node, choose a suitable name (e.g. carnivora) followed by a colon and its corresponding leaf nodes inside square brackets (a sequence in yaml) and separated by commas. For internal nodes, all child nodes present in the species tree must be specified, even if they are not present in all of the gene families.

      • cat:
        carnivora: [cat, dog]
  • Output:

    • Directory (default name codeml) containing nested directory structure of models and starting parameters for each gene family.
    • File codeml-commands.sh containing list of commands to execute the model tests
    • File Snakefile for running the contents of codeml-commands.sh locally or using a cluster

    N.B. By default, at least two taxa must be present within a given family for a named internal node to be labelled. Use --strict to skip named internal nodes unless all child leaf nodes are present.

$ vespasian codeml-setup -h
usage: vespasian codeml-setup [-h] [-b BRANCHES] [-o OUTPUT]
                              [--separator SEPARATOR] [--strict] [-t THREADS]
                              [-w] [-p]
                              input gene-trees

Create suite of branch and branch-site codeml environments

positional arguments:
  input                 path to directory containing aligned gene families
  gene-trees            path to directory containing gene trees

optional arguments:
  -h, --help            show this help message and exit
  -b BRANCHES, --branches BRANCHES
                        path to yaml file containing branches to be labelled
                        (default: -)
  -o OUTPUT, --output OUTPUT
                        path to output directory (default: 'codeml')
  --separator SEPARATOR
                        character separating taxon name and identifier(s)
                        (default: '|')
  --strict              label only branches with all taxa present in tree
                        (default is >= 2) (default: False)
  -t THREADS, --threads THREADS
                        number of parallel workers (default: 6)
  -w, --warnings        show warnings (default: False)
  -p, --progress        show progress bar (default: False)

Step 3: Run models

e.g. cd codeml && snakemake --cores 8

  • Ensure codeml binary is present inside $PATH

  • Using PAML version 4.9=h01d97ff_5 from Conda is recommended

  • cd codeml (the directory created by codeml-setup in step 2)

  • Local execution (for small jobs)

    • snakemake -k --cores 8 (recommended)
    • Or, using GNU parallel (not recommended – doesn't catch errors!)
      • parallel --bar :::: codeml-commands.sh
  • Cluster execution

    • snakemake -k --cores MAXJOBS --cluster OPTIONS
    • SGE example:

Step 4: Report model tests and positively selected sites

e.g. vespasian report --progress input

  • Required input (please read carefully):
    • input path to directory (default codeml) containing models configured in step 2 and executed in step 3
  • Output:
    • Directory containing per-gene tables of likelihood ratio test results, model parameters, and positively selected sites from the highest scoring models.
$ vespasian report -h
usage: vespasian report [-h] [-o OUTPUT] [--hide] [-p] input

Perform likelihood ratio tests and and report positively selected sites

positional arguments:
  input                 path to codeml-setup output directory

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        path to output directory (default: 'report-codeml')
  --hide                hide gratuitous emperor portrait (default: False)
  -p, --progress        show progress bar (default: False)

Todo

  • Positively selected site visualisation
  • Python API
  • Specify site and/or branch-site models only
  • Renaming:
    • infer-gene-trees -> infer-trees
  • Consider B-H correction

About

Genome-scale detection of selective pressure variation

Resources

License

Stars

Watchers

Forks

Packages

No packages published