5 FuncSanity
FuncSanity uses prodigal
, kofamscan
, interproscan
, PROKKA
, VirSorter
, psortb
, and KEGGDecoder
to structurally and functionally annotate contig data.
FuncSanity uses a config file and a type file to run its analysis. Its pipeline is generated from the following programs:
Prodigal is used to search for ORFs within the user-provided contigs. Predicted proteins are used in the downstream annotation analysis.
PROKKA provides whole genome annotation of the microbial genomes. Users may incorporate PROKKA annotations into their Prodigal gene calls, or may choose to use PROKKA gene calls in all downstream analyses. RNAMMER 1.2 can be optionally integrated to detect rRNA sequences instead of Barrnap, which is packaged with Prokka. Please see our note about RNAmmer and installation of this program. SignalP4.1 may also be installed as part of the PROKKA annotation pipeline.
This analysis is optional. To remove it, comment out the [PROKKA]
section of the config file.
KofamScan assigns KO numbers to predicted coding sequences. This information is fed into KEGG-Decoder v.1.0.10 to determine completeness of certain predetermined metabolic pathways.
This analysis is optional. To remove it, comment out the [KOFAMSCAN]
and [BIODATA]
sectionS of the config file.
InterProScan allows for customizable protein domain identification.
This analysis is optional. To include it, un-comment the [INTERPROSCAN]
section of the config file.
VirSorter predicts viral signatures. Contigs with a predicted viral signature are stored. VirSorter provides phage and prophage predictions with varying confidence levels.
This analysis is optional. To remove it, comment out the [VIRSORTER]
section of the config file.
The extracellular peptidase annotation pipeline determine putative peptidases with an HMMER (v3.1b2) search against the MEROPS database. Putative peptidases are screened for predicted extracellular localization using PSORTb v.3.0. Optionally, SignalP v4.1 can be integrated to identify proteins lacking PSORTb localization. Please see our note about SignalP and installation of this program. Predicted proteins will be assigned CAZy identifiers based on an HMMER search of the dbCAN database.
This analysis is optional. To remove it entirely, ensure that the [CAZY]
, [MEROPS]
, [SIGNALP]
, and [PSORTB]
are commented out.
The extracellularity prediction is also optional. To include it, ensure that the [CAZY]
, [MEROPS]
, and [PSORTB]
are all un-commented.
[SIGNALP]
may be optionally included by un-commenting its section.
A complete annotation pipeline will take approx. 1-2 hours per genome to complete, with the longest annotations occurring with use of InterProScan's PANTHER database.
Some proteins may generate multiple annotations, particularly within interproscan and prokka. All overlapping regions which are identified with the same annotation are reduced to a single non-overlapping span and are stored in the annotation database.
The raw output from these pipelines is voluminous! However, these blog posts provide detailed instructions on using BioMetaDB to query the results, and also provides an explanation of the raw outputs of this program.
Running FuncSanity requires a configuration file, a directory of genomes, and an optional type file.
- Required arguments
- --directory (-d): directory of fasta files
- --config_file (-c): config.ini file matching template in Sample/Config
- Optional flags
- --prokka (-p): Use prokka gene calls in downstream analysis
- --cancel_autocommit (-a): Cancel creation/update of BioMetaDB project
- Optional arguments
- --output_directory (-o): Output prefix
- --biometadb_project (-b): Name to assign to BioMetaDB project, or name of existing project to use
- --type_file (-t): type_file, formatted as
'file_name.fna\t[Archaea/Bacteria]\t[gram+/gram-]\n'
- This argument is only required if running the peptidase portion of the pipeline on non gram- bacteria.
Each file within the folder passed to --directory
should contain a single genome/MAG/SAG in simple FASTA format:
>MAG_contig_1
ATCGAAA
>MAG_contig_2
TTTAGCAA
The default settings run searches for gram negative bacteria, but users may also search for gram positive bacteria and archaea.
This info for relevant genomes should be provided in a separate file and passed to MetaSanity
from the command line using the -t
flag.
Pipeline searches can be run with any combination of gram +/- bacteria/archaea. The format of this file should include
the following info, separated by tabs, with one line per relevant fasta file passed to pipeline:
[fasta-file]\t[Bacteria/Archaea]\t[gram+/gram-]\n
Example:
example-fasta-file.fna\tArchaea\tgram+\n
This file can be automatically generated
using the generate-typefile.py
script in the MetaSanity/Accessories/
folder of the MetaSanity installation.
Assuming that the directory contents are on your path, and that you have already completed PhyloSanity, generate the type-file
typefile.list
by running the command generate-typefile.py MSResults
and updating the resulting file with membrane information.
The FuncSanity default config file allows for program-level flags to be provided. Note that individual flags (e.g. those that are passed without arguments) are set using FLAGS
.
Users may select which portions of the FuncSanity pipeline that they wish to run. FuncSanity determines valid sections from un-commented sections of the user-provided config file and builds its pipeline accordingly.
The FuncSanity config file is divided into sections representing available annotation steps.
The docker config file sections come pre-populated with the proper path arguments, and should only be modified
with additional flags or by commenting out unwanted sections. Source code users must adjust PATH
, DATA
, and DATA_DICT
values accordingly.
By default, the sections corresponding to SignalP, PSORTb, and InterProScan are commented out (see below) and will not run. If we wish to incorporate these analyses, we can uncomment these sections from the config file.
- Location (Conda):
MetaSanity/build/Config/Conda/PhyloSanity.ini
- Location (Docker):
Config/Docker(or SourceCode)/FuncSanity.ini
# Conda/FuncSanity.ini
# Default config file for running the FuncSanity pipeline
# Users are recommended to edit copies of this file only
# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# The following **MUST** be set
[PRODIGAL]
PATH = prodigal
-p = meta
FLAGS = -m
[HMMSEARCH]
PATH = hmmsearch
-T = 75
[HMMCONVERT]
PATH = hmmconvert
[HMMPRESS]
PATH = hmmpress
[BIOMETADB]
PATH = dbdm
--db_name = MSResults
FLAGS = -s
[DIAMOND]
PATH = diamond
--threads = 1
# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# The following pipe sections may optionally be set
# Ensure that the entire pipe section is valid,
# or deleted/commented out, prior to running pipeline
# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Peptidase annotation
[CAZY]
DATA = /path/to/peptidase/dbCAN-fam-HMMs.txt
[MEROPS]
DATA = /path/to/peptidase/MEROPS.pfam.hmm
DATA_DICT = /path/to/peptidase/merops-as-pfams.txt
#[SIGNALP]
#PATH = /path/To/signalp
#[PSORTB]
#PATH = /path/To/psortb
# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# KEGG pathway annotation
[KOFAMSCAN]
PATH = exec_annotation
FLAGS = -p,/path/to/kofamscan/profiles,-k,/path/to/kofamscan/ko_list
--cpu = 1
[BIODATA]
PATH = KEGG-decoder
--vizoption = interactive
# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# PROKKA
[PROKKA]
PATH = prokka
FLAGS = --addgenes,--addmrna,--usegenus,--metagenome,--rnammer
--evalue = 1e-10
--cpus = 1
# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# InterproScan
#[INTERPROSCAN]
#PATH = /path/To/interproscan.sh
#--applications = TIGRFAM,SFLD,SMART,SUPERFAMILY,Pfam,ProDom,Hamap,CDD,PANTHER
#FLAGS = --goterms,--iprlookup,--pathways
# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# VirSorter
[VIRSORTER]
PATH = /path/to/virsorter
--user = UID-of-user-from-etc/passwd-file
- General Notes
- Depending on the number of genomes, the completion time for this pipeline can vary from several hours to several days.
MetaSanity FuncSanity -d fasta_folder/ -c metagenome_annotation.ini -o annot 2>annot.err
- This command will use the fasta files in
fasta_folder/
in the annotation pipeline. It will output to the folderannot
and will use the config file entitledmetagenome_annotation.ini
to name the output database and to determine individual program arguments. Debugging and error messages will be saved toannot.err
. - This pipeline will generate a series of tables - a summary table, whose name is user-provided in the config file, as well as an individual table for each genome provided that describes annotations for each protein sequence identified from the starting contigs.
- View the summary tables using
dbdm SUMMARIZE -c MSResults/
. Each genome analyzed will output a version of the table below, with theDatabase
column populated by the results of each program in the pipeline.
SUMMARIZE: View summary of all tables in database
Project root directory: MSResults
Name of database: MSResults.db
*********************************************************************************************
Table Name: gimesia-maris-uba2506
Number of Records: 5862/5862
Database Average Std Dev
phage_contig_1 0.000 0.000
phage_contig_2 0.000 0.000
phage_contig_3 0.000 0.000
prophage_1 0.000 0.000
prophage_2 0.000 0.026
prophage_3 0.000 0.000
---------------------------------------------------------------------------------------------
Database Most Frequent Number Total Count
cazy CE10 16 156
cdd cd00156 IPR001789... 77 1767
hamap MF_00167 IPR00375... 8 468
is_extracellular False 128 130
ko K02456 general se... 70 1879
merops_pfam PF05569 12 130
panther PTHR30093 141 3644
pfam PF07596 IPR011453... 144 4079
prodom PD009007 10 54
prokka hypothetical prot... 256 480
sfld SFLDS00029 IPR007... 16 55
smart SM00710 IPR006626... 291 904
superfamily SSF52540 IPR02741... 256 3675
tigrfam TIGR02532 IPR0129... 125 1099
---------------------------------------------------------------------------------------------
Below is an excerpt of the table summarizing putative metabolic functions and raw peptidase counts for the entire genome set. Large sections of it have been omitted below.
- View the summary table using
dbdm SUMMARIZE -c MSResults/ -t functions
SUMMARIZE: View summary of all tables in database
Project root directory: Metagenomes
Name of database: Metagenomes.db
*******************************************************************************************************************************************
Table Name: functions
Number of Records: 10/10
Database Average Std Dev
adhesion 0.000 0.000
alcohol_oxidase 0.000 0.000
alphaamylase 0.000 0.000
alt_thiosulfate_oxidation_doxad 0.000 0.000
alt_thiosulfate_oxidation_tsda 0.100 0.316
aminopeptidase_n 0.400 0.516
ammonia_oxidation_amopmmo 0.000 0.000
anaplerotic_genes 0.475 0.275
anoxygenic_typei_reaction_center 0.000 0.000
anoxygenic_typeii_reaction_center 0.000 0.000
arsenic_reduction 0.250 0.264
bacterial_prepeptidase_cterminal_domain 0.000 0.000
basic_endochitinase_b 0.000 0.000
betacarotene_1515monooxygenase 0.300 0.483
betaglucosidase 0.100 0.316
betanacetylhexosaminidase 0.300 0.483
bifunctional_chitinaselysozyme 0.000 0.000
biofilm_pga_synthesis_protein 0.000 0.000
biofilm_regulator_bsss 0.000 0.000
ce6 0.200 0.422
ce7 0.300 0.483
ce9 1.200 1.033
cellulase 0.000 0.000
chemotaxis 0.187 0.314
chitinase 0.700 0.483
clostripain_family 0.100 0.316
cobalamin_biosynthesis 0.146 0.082
coenzyme_bcoenzyme_m_regeneration 0.040 0.084
dissimilatory_sulfite___sulfide 0.000 0.000
dms_dehydrogenase 0.000 0.000
dmso_reductase 0.000 0.000
dmsp_demethylation 0.000 0.000
dmsp_lyase_dddlqpdkw 0.000 0.000
dmsp_synthase_dsyb 0.200 0.422
dnra 0.200 0.422
dsrd_dissimilatory_sulfite_reductase 0.000 0.000
entnerdoudoroff_pathway 0.475 0.249
exopolyalphagalacturonosidase 0.000 0.000
exopolygalacturonase 0.000 0.000
ferredoxin_hydrogenase 0.000 0.000
ferrioxamine_biosynthesis 0.225 0.079
flagellum 0.513 0.399
fourhydroxybutyrate3hydroxypropionate 0.190 0.110
ftype_atpase 0.912 0.132
g02null 0.100 0.316
gh65 0.100 0.316
gh74 1.900 3.071
gh77 0.500 0.527
gh8 0.100 0.316
gh81 0.300 0.483
gh88 0.200 0.422
gh9 0.300 0.483
gh93 0.600 0.843
gh94 0.000 0.000
gh95 0.100 0.316
gh99 0.100 0.316
glucoamylase 0.000 0.000
gluconeogenesis 0.424 0.371
glycolysis 0.625 0.095
glyoxylate_shunt 0.200 0.422
hydrazine_dehydrogenase 0.000 0.000
hydrazine_synthase 0.000 0.000
hydrogenquinone_oxidoreductase 0.000 0.000
hydroxylamine_oxidation 0.000 0.000
n4null 0.100 0.316
n6null 0.600 0.516
nadhquinone_oxidoreductase 0.434 0.346
nadphquinone_oxidoreductase 0.000 0.000
nadpreducing_hydrogenase 0.000 0.000
nadreducing_hydrogenase 0.000 0.000
naphthalene_degradation_to_salicylate 0.000 0.000
nife_hydrogenase 0.000 0.000
nife_hydrogenase_hyd1 0.000 0.000
nitric_oxide_reduction 0.000 0.000
nitrite_oxidation 0.200 0.422
nitrite_reduction 0.000 0.000
nitrogen_fixation 0.000 0.000
nitrousoxide_reduction 0.100 0.316
oligoendopeptidase_f 0.800 0.422
oligogalacturonide_lyase 0.000 0.000
p1null 0.300 0.675
pectinesterase 0.000 0.000
peptidase_family_c25 0.300 0.483
peptidase_family_m28 1.000 0.000
peptidase_family_m50 1.000 0.000
peptidase_propeptide_and_ypeb_domain 0.000 0.000
peptidase_s24like 0.000 0.000
peptidase_s26 0.000 0.000
phosphoserine_aminotransferase 1.000 0.000
photosystem_i 0.000 0.000
photosystem_ii 0.000 0.000
pullulanase 0.000 0.000
retinal_biosynthesis 0.300 0.230
rhodopsin 0.300 0.483
riboflavin_biosynthesis 0.925 0.121
rtca_cycle 0.000 0.000
rubisco 0.000 0.000
secsrp 0.639 0.338
transporter_ammonia 0.300 0.483
transporter_phosphate 0.700 0.483
transporter_phosphonate 0.198 0.417
transporter_thiamin 0.033 0.104
transporter_urea 0.020 0.063
transporter_vitamin_b12 0.000 0.000
twin_arginine_targeting 0.500 0.000
type_i_secretion 0.033 0.104
type_ii_secretion 0.377 0.067
type_iii_secretion 0.007 0.021
type_iv_secretion 0.017 0.035
type_vabc_secretion 0.000 0.000
type_vi_secretion 0.000 0.000
u32null 0.300 0.483
u62null 0.700 1.160
u73null 0.700 0.823
ubiquinolcytochrome_c_reductase 0.000 0.000
vanadiumonly_nitrogenase 0.000 0.000
vtype_atpase 0.000 0.000
woodljungdahl 0.017 0.054
xaapro_aminopeptidase 1.000 0.000
zinc_carboxypeptidase 0.800 0.422
-------------------------------------------------------------------------------------------------------------------------------------------
In general, program flags/arguments that filter or reduce output are supported, and thus can be provided in the user-passed config file. However, flags that change the output of individual programs may cause unsuspected issues, and thus are not recommended.