Skip to content

5 FuncSanity

Christopher Neely edited this page Sep 6, 2021 · 13 revisions

FuncSanity uses prodigal, kofamscan, interproscan, PROKKA, VirSorter, psortb, and KEGGDecoder to structurally and functionally annotate contig data.

Rationale

FuncSanity uses a config file and a type file to run its analysis. Its pipeline is generated from the following programs:

Prodigal v2.6.3

Prodigal is used to search for ORFs within the user-provided contigs. Predicted proteins are used in the downstream annotation analysis.

PROKKA v1.13.3(Docker)/v1.14.6(Conda)

PROKKA provides whole genome annotation of the microbial genomes. Users may incorporate PROKKA annotations into their Prodigal gene calls, or may choose to use PROKKA gene calls in all downstream analyses. RNAMMER 1.2 can be optionally integrated to detect rRNA sequences instead of Barrnap, which is packaged with Prokka. Please see our note about RNAmmer and installation of this program. SignalP4.1 may also be installed as part of the PROKKA annotation pipeline.

This analysis is optional. To remove it, comment out the [PROKKA] section of the config file.

KEGG Pathway annotation 1.3.0

KofamScan assigns KO numbers to predicted coding sequences. This information is fed into KEGG-Decoder v.1.0.10 to determine completeness of certain predetermined metabolic pathways.

This analysis is optional. To remove it, comment out the [KOFAMSCAN] and [BIODATA] sectionS of the config file.

InterProScan 5.36-75.0

InterProScan allows for customizable protein domain identification.

This analysis is optional. To include it, un-comment the [INTERPROSCAN] section of the config file.

VirSorter v1.0.5(Docker)/v2.2.3(Conda)

VirSorter predicts viral signatures. Contigs with a predicted viral signature are stored. VirSorter provides phage and prophage predictions with varying confidence levels.

This analysis is optional. To remove it, comment out the [VIRSORTER] section of the config file.

Extracellular peptidase and carbohydrate-active enzyme annotations

The extracellular peptidase annotation pipeline determine putative peptidases with an HMMER (v3.1b2) search against the MEROPS database. Putative peptidases are screened for predicted extracellular localization using PSORTb v.3.0. Optionally, SignalP v4.1 can be integrated to identify proteins lacking PSORTb localization. Please see our note about SignalP and installation of this program. Predicted proteins will be assigned CAZy identifiers based on an HMMER search of the dbCAN database.

This analysis is optional. To remove it entirely, ensure that the [CAZY], [MEROPS], [SIGNALP], and [PSORTB] are commented out.

The extracellularity prediction is also optional. To include it, ensure that the [CAZY], [MEROPS], and [PSORTB] are all un-commented. [SIGNALP] may be optionally included by un-commenting its section.

Run time

A complete annotation pipeline will take approx. 1-2 hours per genome to complete, with the longest annotations occurring with use of InterProScan's PANTHER database.

Note

Some proteins may generate multiple annotations, particularly within interproscan and prokka. All overlapping regions which are identified with the same annotation are reduced to a single non-overlapping span and are stored in the annotation database.

Output

The raw output from these pipelines is voluminous! However, these blog posts provide detailed instructions on using BioMetaDB to query the results, and also provides an explanation of the raw outputs of this program.

FuncSanity

Running FuncSanity requires a configuration file, a directory of genomes, and an optional type file.

  • Required arguments
    • --directory (-d): directory of fasta files
    • --config_file (-c): config.ini file matching template in Sample/Config
  • Optional flags
    • --prokka (-p): Use prokka gene calls in downstream analysis
    • --cancel_autocommit (-a): Cancel creation/update of BioMetaDB project
  • Optional arguments
    • --output_directory (-o): Output prefix
    • --biometadb_project (-b): Name to assign to BioMetaDB project, or name of existing project to use
    • --type_file (-t): type_file, formatted as 'file_name.fna\t[Archaea/Bacteria]\t[gram+/gram-]\n'
      • This argument is only required if running the peptidase portion of the pipeline on non gram- bacteria.

FASTA File Format

Each file within the folder passed to --directory should contain a single genome/MAG/SAG in simple FASTA format:

>MAG_contig_1
ATCGAAA
>MAG_contig_2
TTTAGCAA

FuncSanity type file

The default settings run searches for gram negative bacteria, but users may also search for gram positive bacteria and archaea. This info for relevant genomes should be provided in a separate file and passed to MetaSanity from the command line using the -t flag. Pipeline searches can be run with any combination of gram +/- bacteria/archaea. The format of this file should include the following info, separated by tabs, with one line per relevant fasta file passed to pipeline:

[fasta-file]\t[Bacteria/Archaea]\t[gram+/gram-]\n

Example: example-fasta-file.fna\tArchaea\tgram+\n

This file can be automatically generated using the generate-typefile.py script in the MetaSanity/Accessories/ folder of the MetaSanity installation. Assuming that the directory contents are on your path, and that you have already completed PhyloSanity, generate the type-file typefile.list by running the command generate-typefile.py MSResults and updating the resulting file with membrane information.

FuncSanity config file

The FuncSanity default config file allows for program-level flags to be provided. Note that individual flags (e.g. those that are passed without arguments) are set using FLAGS.

Users may select which portions of the FuncSanity pipeline that they wish to run. FuncSanity determines valid sections from un-commented sections of the user-provided config file and builds its pipeline accordingly.

Configuring a pipeline

The FuncSanity config file is divided into sections representing available annotation steps. The docker config file sections come pre-populated with the proper path arguments, and should only be modified with additional flags or by commenting out unwanted sections. Source code users must adjust PATH, DATA, and DATA_DICT values accordingly.

By default, the sections corresponding to SignalP, PSORTb, and InterProScan are commented out (see below) and will not run. If we wish to incorporate these analyses, we can uncomment these sections from the config file.

  • Location (Conda): MetaSanity/build/Config/Conda/PhyloSanity.ini
  • Location (Docker): Config/Docker(or SourceCode)/FuncSanity.ini
# Conda/FuncSanity.ini
# Default config file for running the FuncSanity pipeline
# Users are recommended to edit copies of this file only

# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# The following **MUST** be set

[PRODIGAL]
PATH = prodigal
-p = meta
FLAGS = -m

[HMMSEARCH]
PATH = hmmsearch
-T = 75

[HMMCONVERT]
PATH = hmmconvert

[HMMPRESS]
PATH = hmmpress

[BIOMETADB]
PATH = dbdm
--db_name = MSResults
FLAGS = -s

[DIAMOND]
PATH = diamond
--threads = 1


# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# The following pipe sections may optionally be set
# Ensure that the entire pipe section is valid,
# or deleted/commented out, prior to running pipeline


# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Peptidase annotation

[CAZY]
DATA = /path/to/peptidase/dbCAN-fam-HMMs.txt

[MEROPS]
DATA = /path/to/peptidase/MEROPS.pfam.hmm
DATA_DICT = /path/to/peptidase/merops-as-pfams.txt

#[SIGNALP]
#PATH = /path/To/signalp

#[PSORTB]
#PATH = /path/To/psortb

# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# KEGG pathway annotation

[KOFAMSCAN]
PATH = exec_annotation
FLAGS = -p,/path/to/kofamscan/profiles,-k,/path/to/kofamscan/ko_list
--cpu = 1

[BIODATA]
PATH = KEGG-decoder
--vizoption = interactive

# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# PROKKA

[PROKKA]
PATH = prokka
FLAGS = --addgenes,--addmrna,--usegenus,--metagenome,--rnammer
--evalue = 1e-10
--cpus = 1

# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# InterproScan

#[INTERPROSCAN]
#PATH = /path/To/interproscan.sh
#--applications = TIGRFAM,SFLD,SMART,SUPERFAMILY,Pfam,ProDom,Hamap,CDD,PANTHER
#FLAGS = --goterms,--iprlookup,--pathways

# - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# VirSorter

[VIRSORTER]
PATH = /path/to/virsorter
--user = UID-of-user-from-etc/passwd-file
  • General Notes
    • Depending on the number of genomes, the completion time for this pipeline can vary from several hours to several days.

Simple Example

  • MetaSanity FuncSanity -d fasta_folder/ -c metagenome_annotation.ini -o annot 2>annot.err
  • This command will use the fasta files in fasta_folder/ in the annotation pipeline. It will output to the folder annot and will use the config file entitled metagenome_annotation.ini to name the output database and to determine individual program arguments. Debugging and error messages will be saved to annot.err.
  • This pipeline will generate a series of tables - a summary table, whose name is user-provided in the config file, as well as an individual table for each genome provided that describes annotations for each protein sequence identified from the starting contigs.
  • View the summary tables using dbdm SUMMARIZE -c MSResults/. Each genome analyzed will output a version of the table below, with the Database column populated by the results of each program in the pipeline.
SUMMARIZE:  View summary of all tables in database
 Project root directory:  MSResults
 Name of database:    MSResults.db

*********************************************************************************************
         Table Name:  gimesia-maris-uba2506
  Number of Records:  5862/5862
            Database  Average               Std Dev     
      phage_contig_1  0.000                 0.000       
      phage_contig_2  0.000                 0.000       
      phage_contig_3  0.000                 0.000       
          prophage_1  0.000                 0.000       
          prophage_2  0.000                 0.026       
          prophage_3  0.000                 0.000       
---------------------------------------------------------------------------------------------

            Database  Most Frequent         Number      Total Count 
                cazy  CE10                  16          156         
                 cdd  cd00156 IPR001789...  77          1767        
               hamap  MF_00167 IPR00375...  8           468         
    is_extracellular  False                 128         130         
                  ko  K02456 general se...  70          1879        
         merops_pfam  PF05569               12          130         
             panther  PTHR30093             141         3644        
                pfam  PF07596 IPR011453...  144         4079        
              prodom  PD009007              10          54          
              prokka  hypothetical prot...  256         480         
                sfld  SFLDS00029 IPR007...  16          55          
               smart  SM00710 IPR006626...  291         904         
         superfamily  SSF52540 IPR02741...  256         3675        
             tigrfam  TIGR02532 IPR0129...  125         1099        
---------------------------------------------------------------------------------------------

Below is an excerpt of the table summarizing putative metabolic functions and raw peptidase counts for the entire genome set. Large sections of it have been omitted below.

  • View the summary table using dbdm SUMMARIZE -c MSResults/ -t functions
SUMMARIZE: View summary of all tables in database
 Project root directory:  Metagenomes
 Name of database:    Metagenomes.db

*******************************************************************************************************************************************
                                                Table Name: functions   
    Number of Records:                                      10/10        

                                                  Database  Average               Std Dev     

                                                  adhesion  0.000                 0.000       
                                           alcohol_oxidase  0.000                 0.000       
                                              alphaamylase  0.000                 0.000       
                           alt_thiosulfate_oxidation_doxad  0.000                 0.000       
                            alt_thiosulfate_oxidation_tsda  0.100                 0.316       
                                          aminopeptidase_n  0.400                 0.516       
                                 ammonia_oxidation_amopmmo  0.000                 0.000       
                                         anaplerotic_genes  0.475                 0.275       
                          anoxygenic_typei_reaction_center  0.000                 0.000       
                         anoxygenic_typeii_reaction_center  0.000                 0.000       
                                         arsenic_reduction  0.250                 0.264       
                   bacterial_prepeptidase_cterminal_domain  0.000                 0.000       
                                     basic_endochitinase_b  0.000                 0.000       
                            betacarotene_1515monooxygenase  0.300                 0.483       
                                           betaglucosidase  0.100                 0.316       
                                 betanacetylhexosaminidase  0.300                 0.483       
                            bifunctional_chitinaselysozyme  0.000                 0.000       
                             biofilm_pga_synthesis_protein  0.000                 0.000       
                                    biofilm_regulator_bsss  0.000                 0.000              
                                                       ce6  0.200                 0.422       
                                                       ce7  0.300                 0.483       
                                                       ce9  1.200                 1.033       
                                                 cellulase  0.000                 0.000       
                                                chemotaxis  0.187                 0.314       
                                                 chitinase  0.700                 0.483       
                                        clostripain_family  0.100                 0.316       
                                    cobalamin_biosynthesis  0.146                 0.082       
                         coenzyme_bcoenzyme_m_regeneration  0.040                 0.084       
                           dissimilatory_sulfite___sulfide  0.000                 0.000       
                                         dms_dehydrogenase  0.000                 0.000       
                                            dmso_reductase  0.000                 0.000       
                                        dmsp_demethylation  0.000                 0.000       
                                      dmsp_lyase_dddlqpdkw  0.000                 0.000       
                                        dmsp_synthase_dsyb  0.200                 0.422       
                                                      dnra  0.200                 0.422       
                      dsrd_dissimilatory_sulfite_reductase  0.000                 0.000       
                                   entnerdoudoroff_pathway  0.475                 0.249       
                             exopolyalphagalacturonosidase  0.000                 0.000       
                                      exopolygalacturonase  0.000                 0.000       
                                    ferredoxin_hydrogenase  0.000                 0.000       
                                 ferrioxamine_biosynthesis  0.225                 0.079       
                                                 flagellum  0.513                 0.399       
                     fourhydroxybutyrate3hydroxypropionate  0.190                 0.110       
                                              ftype_atpase  0.912                 0.132       
                                                   g02null  0.100                 0.316           
                                                      gh65  0.100                 0.316       
                                                      gh74  1.900                 3.071       
                                                      gh77  0.500                 0.527       
                                                       gh8  0.100                 0.316       
                                                      gh81  0.300                 0.483       
                                                      gh88  0.200                 0.422       
                                                       gh9  0.300                 0.483       
                                                      gh93  0.600                 0.843       
                                                      gh94  0.000                 0.000       
                                                      gh95  0.100                 0.316       
                                                      gh99  0.100                 0.316       
                                              glucoamylase  0.000                 0.000       
                                           gluconeogenesis  0.424                 0.371       
                                                glycolysis  0.625                 0.095       
                                          glyoxylate_shunt  0.200                 0.422             
                                   hydrazine_dehydrogenase  0.000                 0.000       
                                        hydrazine_synthase  0.000                 0.000       
                            hydrogenquinone_oxidoreductase  0.000                 0.000       
                                   hydroxylamine_oxidation  0.000                 0.000         
                                                    n4null  0.100                 0.316       
                                                    n6null  0.600                 0.516       
                                nadhquinone_oxidoreductase  0.434                 0.346       
                               nadphquinone_oxidoreductase  0.000                 0.000       
                                  nadpreducing_hydrogenase  0.000                 0.000       
                                   nadreducing_hydrogenase  0.000                 0.000       
                     naphthalene_degradation_to_salicylate  0.000                 0.000       
                                          nife_hydrogenase  0.000                 0.000       
                                     nife_hydrogenase_hyd1  0.000                 0.000       
                                    nitric_oxide_reduction  0.000                 0.000       
                                         nitrite_oxidation  0.200                 0.422       
                                         nitrite_reduction  0.000                 0.000       
                                         nitrogen_fixation  0.000                 0.000       
                                    nitrousoxide_reduction  0.100                 0.316       
                                      oligoendopeptidase_f  0.800                 0.422       
                                  oligogalacturonide_lyase  0.000                 0.000       
                                                    p1null  0.300                 0.675       
                                            pectinesterase  0.000                 0.000       
                                      peptidase_family_c25  0.300                 0.483       
                                      peptidase_family_m28  1.000                 0.000       
                                      peptidase_family_m50  1.000                 0.000       
                      peptidase_propeptide_and_ypeb_domain  0.000                 0.000       
                                         peptidase_s24like  0.000                 0.000       
                                             peptidase_s26  0.000                 0.000       
                            phosphoserine_aminotransferase  1.000                 0.000       
                                             photosystem_i  0.000                 0.000       
                                            photosystem_ii  0.000                 0.000             
                                               pullulanase  0.000                 0.000       
                                      retinal_biosynthesis  0.300                 0.230       
                                                 rhodopsin  0.300                 0.483       
                                   riboflavin_biosynthesis  0.925                 0.121       
                                                rtca_cycle  0.000                 0.000       
                                                   rubisco  0.000                 0.000              
                                                    secsrp  0.639                 0.338              
                                       transporter_ammonia  0.300                 0.483       
                                     transporter_phosphate  0.700                 0.483       
                                   transporter_phosphonate  0.198                 0.417       
                                       transporter_thiamin  0.033                 0.104       
                                          transporter_urea  0.020                 0.063       
                                   transporter_vitamin_b12  0.000                 0.000       
                                   twin_arginine_targeting  0.500                 0.000       
                                          type_i_secretion  0.033                 0.104       
                                         type_ii_secretion  0.377                 0.067       
                                        type_iii_secretion  0.007                 0.021       
                                         type_iv_secretion  0.017                 0.035       
                                       type_vabc_secretion  0.000                 0.000       
                                         type_vi_secretion  0.000                 0.000       
                                                   u32null  0.300                 0.483       
                                                   u62null  0.700                 1.160       
                                                   u73null  0.700                 0.823       
                           ubiquinolcytochrome_c_reductase  0.000                 0.000       
                                  vanadiumonly_nitrogenase  0.000                 0.000       
                                              vtype_atpase  0.000                 0.000       
                                             woodljungdahl  0.017                 0.054       
                                     xaapro_aminopeptidase  1.000                 0.000       
                                     zinc_carboxypeptidase  0.800                 0.422       
-------------------------------------------------------------------------------------------------------------------------------------------

A note on flags

In general, program flags/arguments that filter or reduce output are supported, and thus can be provided in the user-passed config file. However, flags that change the output of individual programs may cause unsuspected issues, and thus are not recommended.