Skip to content

davideyre/runListCompare

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

runListCompare: maximum likelihood sequence comparison, corrected for recombination

runListCompare.py and associated scripts provide a Python wrapper for generating maximum likelihood phylogenies from a list of fasta consensus sequence files obtained from mapping to the same reference. The script enables large numbers of samples to be initially handled in parallel and clustered with similar sequences based on a SNP threshold before calculating maximum likelihood trees for each cluster using either PhyML or IQTree. Correction for recombination is done with ClonalFrameML.

Requirements

Installation

  1. Download and decompress the latest release, and cd into it
  2. Install dependencies using Conda:
    • conda env create -f conda.yml
  3. Activate and test installation
    • conda activate runlistcompare
    • pytest (takes ~2mins)

Python 2.7 version

  1. Download and decompress the release, and cd into it
  2. conda env create -f conda_python2.yml
  3. conda activate runlistcompare2

Usage

python runListCompare.py python runListCompare.py tests/data/ec/ec.ini

Here test.ini is an ini file containing the desired parameters. It is advisable to run the above command to test that things are working with the included demo data. Input sequences are listed in a tab separated format, and an example is provided in tests/data/ec/ec.seqlist.txt. The first column can be up to 8 characters in length and is used for tip labels of the final trees, a requirement imposed by ClonalFrameML.

Important configurable parameters to consider include:

  • perACGT_cutoff: Minimum percentage of reference genome to be called in order for a sequence to be included
  • cluster_snp: Threshold for single linkage clustering by SNP distance
  • varsite_keep: Proportion of variable sites that need to be called across all sequences for site to be retained
  • seq_keep: Proportion of variable sites that need to be called within a sequence for the sequence to be retained

Output files

  • align_snps.fa, align_positions.txt, align-compare.txt are the variable sites, their position in the reference genome and the raw pairwise snp difference between samples
  • the cluster folder contains variable site alignments for each cluster of related samples
  • the cluster_ml folder contains the output maximum likelihood phylogenies, e.g. cluster_1_phyml_tree_scaled.tree is the phyML generated tree scaled to have branch lengths in SNPs and the cluster_1_cf_scaled.tree is the ClonalFrameML corrected tree scaled to have branch lengths in SNPs
  • the recomb_corr folder contains alignment of variable sites with recombination removed for use as input with other software, e.g. BEAST
  • the ML_distances.txt and CF_distances.txt files contain the pairwise distances obtained from the maximum likelihood and ClonalFrameML phylogenies.

David Eyre & Bede Constantinides
david.eyre@bdi.ox.ac.uk
17 April 2019

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published