Automated detection of pseudogene candidates in prokaryotic genomes.
Table of contents
- Getting started
- Preparing your genome
- How does Pseudofinder detect pseudogene candidates?
- Versions and changes
- Wish list
- Citing Pseudofinder
Pseudofinder is a Python3 script that detects pseudogene candidates [https://en.wikipedia.org/wiki/Pseudogene] from annotated genbank files of bacterial and archaeal genomes.
It was tested mostly on genbank (.gbf/.gbk) files annotated by Prokka [https://github.com/tseemann/prokka] with the --compliant flag (i.e. including both /gene and /CDS annotations).
There are alternative programs for pseudogene finding and annotation (e.g. the NCBI Prokaryotic Genome Annotation Pipeline [https://www.ncbi.nlm.nih.gov/genome/annotation_prok/]), but to the best of our knowledge, none of them is open source and allows easy fine-tuning of parameters.
These instructions will (hopefully) get you the pipeline up and running on your local machine.
- pip3 (or any other way how to install python libraries, e.g. conda or easy_install)
- Databases: NCBI-NR (non-redundant) protein database (or similar such as SwissProt) formatted for blastP/blastX searches.
- An annotated prokaryotic genome in genbank (.gbf/.gbk) format.
A step by step series of commands to install all system dependencies:
Installation of python3, pip3, git (optional), and ncbi-blast+ on Ubuntu (as an administrator):
sudo apt-get update sudo apt-get install python3 sudo apt-get install python3-pip sudo apt-get install ncbi-blast+ sudo apt-get install git
Installation of 3rd party python3 libraries on Ubuntu (as an administrator):
sudo pip3 install biopython sudo pip3 install plotly sudo pip3 install pandas sudo pip3 install numpy sudo pip3 install reportlab
Alternative installation with conda (no root access required) is also possible:
#install Python3 miniconda https://conda.io/miniconda.html conda install -c conda-forge biopython conda install -c bioconda blast conda install plotly conda install pandas conda install numpy conda install reportlab
Clone the up to date pseudofinder.py code from github (no root access required):
git clone https://github.com/filip-husnik/pseudo-finder.git
Or download a stable release:
# No stable version released yet.
Preparing your genome
We recommend several rounds of Pilon polishing with Illumina reads to improve your consensus sequence [https://github.com/broadinstitute/pilon/wiki], particularly if you're interested to find pseudogenes in minION/PacBio assemblies. However, Pseudofinder can also help with finding sequencing/basecalling errors potentially breaking genes in MinION/PacBio-only assemblies.
Genomes closed into one or several circular-mapping molecules (chromosomes, plasmids, and phages) should be ideally oriented based on their origin of replication [e.g. by Ori-Finder 2; http://tubic.tju.edu.cn/Ori-Finder2/] to avoid broken genes on contigs randomly linearized by the genome assembler.
We recommend genbank (.gbf/.gbk) files generated by Prokka [https://github.com/tseemann/prokka] with the --compliant and --rfam flags. Annotating rRNAs, tRNAs, and other ncRNAs in Prokka is recommended to eliminate any false positive 'pseudogene' candidates. ORFs overlapping with non-coding RNAs such as rRNA can be sometimes misannotated in databases as 'hypothetical proteins'.
prokka --compliant --rfam contigs.fa
How does Pseudofinder detect pseudogene candidates?
This flowchart shows all steps of the Pseudofinder pipeline.
Annotate is Pseudofinder's core command. Calling this command will identify pseudogene candidates in the input genome annotation, and will produce various output files, explained in detail below.
As with any other python script, there are two ways how to run it:
# Call it directly with python3 (or just python if python3 is your default). python3 pseudofinder.py # Or make the file executable and then rely on its shebang line [#!/usr/bin/env python3]. chmod u+x ./pseudofinder.py
Providing input files:
# Run the full pipeline on 16 processors (for BlastX/BlastP searches). # Unless you have a $BLASTDB environmental variable set on your system, you have to provide a full path to the NR database. python3 pseudofinder.py annotate --genome GENOME.GBF --output PREFIX --database /PATH/TO/NR/nr --threads 16
All command line arguments:
Required arguments: -g GENOME, --genome GENOME Please provide your genome file in the genbank format. -db DATABASE, --database DATABASE Please provide name (if $BLASTB is set on your system) or absolute path of your blast database. -op OUTPREFIX, --outprefix OUTPREFIX Specify an output prefix. Adjustable parameters: -t THREADS, --threads THREADS Please provide total number of threads to use for blast, default is 4. -i INTERGENIC_LENGTH, --intergenic_length INTERGENIC_LENGTH Please provide length of intergenic regions to check, default is 30 bp. -l LENGTH_PSEUDO, --length_pseudo LENGTH_PSEUDO Please provide percentage of length for pseudo candidates, default is 0.60 (60%). Example: "-l 0.50" will consider genes that are less than 50% of the average length of similar genes. -s SHARED_HITS, --shared_hits SHARED_HITS Percentage of blast hits that must be shared in order to join two nearby regions, default is 0.30 (30%). Example: "-s 0.50" will merge nearby regions if they shared 50% of their blast hits. -e EVALUE, --evalue EVALUE Please provide e-value for blast searches. Default is 1e-4. -d DISTANCE, --distance DISTANCE Maximum distance between two regions to consider joining them. Default is 1000. -hc HITCAP, --hitcap HITCAP Maximum number of allowed hits for BLAST. Default is 15. -ce, --contig_ends Forces the program to include intergenic regions at contig ends. If not specified, the program will ignore any sequence before the first ORF and after the last ORF on a contig. -it INTERGENIC_THRESHOLD, --intergenic_threshold INTERGENIC_THRESHOLD Number of BlastX hits needed to annotate an intergenic region as a pseudogene. Calculated as a percentage of maximum number of allowed hits (--hitcap). Default is 0.3.
Output of Annotate:
Every run will produce the following files:
|[prefix]_functional.gff||Functional genes in GFF3 format.|
|[prefix]_functional.faa||Functional genes in fasta format.|
|[prefix]_intergenic.fasta||Intergenic regions in fasta format.|
|[prefix]_blastX_output.tsv||Tab-delimited output of BLASTX run on intergenic regions.|
|[prefix]_log.txt||Summary of all inputs, outputs, parameters and results.|
|[prefix]_map.pdf||Concatenated chromosome map. Input genes appear on the inner track in blue, and candidate pseudogenes are shown in red on the outer track.|
|[prefix]_proteome.faa||All protein sequences in fasta format.|
|[prefix]_blastP_output.tsv||Tab-delimited output of BLASTP run on proteome.|
|[prefix]_pseudos.gff||Candidate pseudogenes in GFF3 format.|
|[prefix]_pseudos.fasta||Candidate pseudogenes in fasta format.|
Reannotate will run the annotate workflow, beginning after the BLAST steps. This command can very quickly reannotate pseudogenes if you would like to change any parameters downstream of BLAST.
pseudofinder.py reannotate -g GENOME -p BLASTP -x BLASTX -log LOGFILE -op OUTPREFIX
One strength of Pseudofinder is its ability to be fine-tuned to the user's preferences. To help visualize the effects of changing the parameters of this program, we have provided the visualize command. This command will display how many pseudogenes will be detected based on any combination of '--length_pseudo' and '--shared_hits'. It is run by providing the blast files from the annotate command:
pseudofinder.py visualize -g GENOME -op OUTPREFIX -p BLASTP -x BLASTX -log LOGFILE
With a single command, the entire Pseudofinder workflow can be run on the 139 kbp genome of Candidatus Tremblaya princeps strain PCIT.
Simply enter the following command:
python3 pseudofinder.py test --database /PATH/TO/NR/nr
The workflow will begin immediately and write the results to a timestamped folder found in
Versions and changes
Read the ChangeLog.txt [https://github.com/filip-husnik/pseudo-finder/blob/master/ChangeLog.txt] for major changes or look at Github commits for everything else [https://github.com/filip-husnik/pseudo-finder/commits/master].
We appreciate any critical comments or suggestions for improvements. Please raise issues or submit pull requests.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE.md file for details.
This code was inspired mostly by work on bacterial symbionts in early stages of becoming intracellular and strictly host-associated. This ecological shift releases selection pressure ('use it or loose it') on many genes considered essential for free-living bacteria, so relatively recent symbionts can have over 50% of their genes pseudogenized.
Basic information about bacterial pseudogenes:
Recognizing the pseudogenes in bacterial genomes: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1142405/
Taking the pseudo out of pseudogenes: https://www.ncbi.nlm.nih.gov/pubmed/25461580
Several examples from the Sodalis clade showing how important is pseudogene annotation for bacteria in a nascent stage of symbiosis:
Mobile genetic element proliferation and gene inactivation impact over the genome structure and metabolic capabilities of Sodalis glossinidius, the secondary endosymbiont of tsetse flies: https://www.ncbi.nlm.nih.gov/pubmed/20649993
A novel human-infection-derived bacterium provides insights into the evolutionary origins of mutualistic insect–bacterial symbioses: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3499248/
Genome degeneration and adaptation in a nascent stage of symbiosis: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3914690/
Repeated replacement of an intrabacterial symbiont in the tripartite nested mealybug symbiosis: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5027413/
Large scale and significant expression from pseudogenes in Sodalis glossinidius - a facultative bacterial endosymbiont: https://www.biorxiv.org/content/early/2017/07/23/124388
There are several additional features we'll try to include in the script in the near future.
Include an optional FPKM cut-off when there are RNA-Seq data available.
Improve logic for ORFs on contigs ends broken by assembly issues (e.g. metagenome-assembled genomes).
Check if the ORFs called as pseudogenes do not represent individual protein domains that can exists and evolve independently of the rest of the original multi-domain protein chain (PFAM?)
Include an optional analysis of cryptic pseudogenes based on dN/dS ratios (PAML?...) when there are closely related genomes available.
Include DIAMOND blastX and blastP as much faster alternatives to NCBI BlastX and BlastP.
Fine tune pseudogene finding for mobile elements such as transposases.
Visualize results by a scatter plot of all genes/pseudogenes (dN/dS, GC content, expression, length ratio, ...).
Sometimes ORFs are predicted by mistake on the opposite strand (e.g. in GC-rich genomes), check regions with ORFS with no blastP hits by blastX.
Please suggest any additional features here: [https://github.com/filip-husnik/pseudo-finder/issues].
Pseudofinder is developed by Mitch Syberg-Olsen & Filip Husnik at the Keeling Lab, University of British Columbia, Vancouver, Canada.
This project still a work in progress, so there is no official publication. If it was useful for your work, you can cite it as: M. Syberg-Olsen & F. Husnik 2018: Pseudofinder, GitHub repository: https://github.com/filip-husnik/pseudo-finder/.
Please also cite various dependencies used by Pseudofinder.