pangolin

Phylogenetic Assignment of Named Global Outbreak LINeages

Requirements

Pangolin runs on MacOS and Linux. The conda environment recipe may not build on Windows (I haven't tested it) but can be run using the Windows subsystem for Linux.

Some version of conda, we use Miniconda3. Can be downloaded from here
Your query fasta file

Install pangolin

Clone this repository and cd pangolin
conda env create -f environment.yml
conda activate pangolin
python setup.py install or pip install .
That's it

Note: we recommend using pangolin in the conda environment specified in the environment.yml file as per the instructions above. If you can't use conda for some reason, bear in mind the data files are now hosted in a separate repository at hCoV-2019/lineages and you will need to pip install that alongside the other dependencies for pangolin (details found in environment.yml).

Updating pangolin

Note: Even if you have previously installed pangolin, as it is being worked on intensively, we recommend you check for updates before running.

To update:

conda activate pangolin
git pull
pulls the latest changes from github
python setup.py install
re-installs pangolin
pip install git+https://github.com/hCoV-2019/lineages.git --upgrade
updates if there is a new data release
conda env update -f environment.yml
updates the conda environment (you're unlikely to need to do this, but just in case!)

Usage

Activate the environment conda activate pangolin
Run pangolin <query>

pangolin: Phylogenetic Assignment of Named Global Outbreak LINeages

positional arguments:
  query

optional arguments:
  -h, --help                  show this help message and exit
  -o OUTDIR, --outdir OUTDIR  Output directory
  -d DATA, --data DATA        Data directory minimally containing a fasta alignment
                              and guide tree
  -n, --dry-run               Go through the motions but don't actually run
  -f, --force                 Overwrite all output
  --tempdir TEMPDIR           Specify where you want the temp stuff to go. Default:
                              $TMPDIR
  --panGUIlin                 Run web-app version of pangolin
  --max-ambig MAXAMBIG        Maximum proportion of Ns allowed for pangolin to
                              attempt assignment. Default: 0.5
  --min-length MINLEN         Minimum query length allowed for pangolin to attempt
                              assignment. Default: 10000
  -t THREADS, --threads       THREADS
                              Number of threads
  -v, --version               show program's version number and exit
  -lv, --lineages-version     show lineages's version number and exit

Output

Your output will be a csv file with taxon name and lineage assigned, one line corresponding to each sequence in the fasta file provided

Example:

Taxon	Lineage	aLRT	UFbootstrap	lineages_version	status	note
Virus1	B.1	80	82	2020-04-27	passed_qc
Virus2	A.1	65	95	2020-04-27	passed_qc
Virus3	A.3	100	100	2020-04-27	passed_qc
Virus4	B.1.4	82	73	2020-04-27	passed_qc
Virus5	None	0	0	2020-04-27	fail	N_content:0.80
Virus6	None	0	0	2020-04-27	fail	seq_len:0

Resources for interpreting the aLRT and UFbootstrap output can be found here and here.

Recall rate

Of 9,843 GISAID sequences assigned lineages by hand (taking sequence, phylogeny and metadata into account), pangolin accurately assigns the lineage of 97.85% of those sequences. Of the sequences that were not recalled correctly, 74.5% had 0 bootstrap and 0 alrt. We're continuing to work to improve this recall rate, but recommend interpreting the pangolin output cautiously with due attention to the UFbootstrap and aLRT values.

Given hCoV-2019 is relatively slow evolving for an RNA virus and there is still not a huge amount of diversity, missing or ambiguous data at key residues may lead to incorrect placement within the guide tree. We have a filter in place that by default with not call a lineage for any sequence with >50% N-content, but this can be made more conservative with the command line option --max-ambig.

Source data

pangolin runs using a guide tree and alignment hosted at hCoV-2019/lineages. Some of this data is sourced from GISAID, but anonymised and encrypted to fit with guidelines. Appropriate permissions have been given and acknowledgements for the teams that have worked to provide the original SARS-CoV-2 genome sequences to GISAID are also hosted in hCoV-2019/lineages.

Authors

Pangolin was created by Áine O'Toole and JT McCrone. It uses lineages from Rambaut et al..

References

The following external software is run as part of pangolin:

iqtree

L.-T. Nguyen, H.A. Schmidt, A. von Haeseler, B.Q. Minh (2015) IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.. Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300

D.T. Hoang, O. Chernomor, A. von Haeseler, B.Q. Minh, L.S. Vinh (2018) UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol., 35:518–522. https://doi.org/10.1093/molbev/msx281

mafft

Katoh, Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. (outlines version 7)

snakemake

Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.

Name		Name	Last commit message	Last commit date
Latest commit History 303 Commits
docs		docs
pangolin		pangolin
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pangolin

Requirements

Install pangolin

Updating pangolin

Usage

Output

Recall rate

Source data

Authors

References

About

Releases

Packages

Languages

License

ViralVerity/pangolin

Folders and files

Latest commit

History

Repository files navigation

pangolin

Requirements

Install pangolin

Updating pangolin

Usage

Output

Recall rate

Source data

Authors

References

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages