FUGAsseM
FUGAsseM (Function predictor of Uncharacterized Gene products by Assessing high-dimensional community data in Microbiomes) is a computational tool based on “guilt by association” to predict functions of novel gene products in the context of microbial communities. It uses machine learning methods to predict functions of microbial proteins by integrating multiple types of evidence such as gene co-expression patterns from metatranscriptomics (MTX), and optionally, genomic context from metagenomic (MGX) assemblies, homology-based annotations, domain-domain interaction-based annotations, etc.
For more information about the method and its associated publication, visit our summary page and read the FUGAsseM User Manual.
Support for FUGAsseM is available via the FUGAsseM channel of the bioBakery Support Forum.
- 1. Setup
- 2. Predicting functions using FUGAsseM
- 3. Predicting functions using FUGAsseM with advanced settings
- 4. Advanced topics
- Python (version >= 3.7, requiring numpy, pandas multiprocessing, sklearn, matplotlib, scipy, goatools, statistics, typing python packages; tested 3.7)
- AnADAMA2 (version >= 0.8.0; tested 0.8.0)
Note: If you are using a bioBakery machine image (e.g. in Google Cloud) you do not need to install FUGAsseM because the tool and its dependencies are already installed.
You only need to do any one of the following options to install the FUGAsseM package.
Option 1: Installing with conda
$ conda install -c biobakery fugassem
Option 2: Installing with pip
$ pip install fugassem
- If you do not have write permissions to
/usr/lib/
, then add the option --user to the install command. This will install the python package into subdirectories of~/.local/
. Please note when using the --user install option on some platforms, you might need to add~/.local/bin/
to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following messagefugassem: command not found
when trying to run FUGAsseM after installing with the --user option.
In the VM, navigate to the correct directory with:
cd ~/Tutorials/fugassem/examples/
The steps in the canonical function prediction workflow of FUGAsseM which uses MTX coexpression profiles and raw GO annotations include: (1) preparing protein families and annotations, (2) building coexpression profiles of proteins within each taxon, and (3) building a machine learning classifier for function prediction. Please see Advanced Topics for steps (1) and (2).
The input files for running coexpression-based prediction can be found in the fugassem/examples/input/
directory. If you installed it with pip
or conda
, download them here and save them to the fugassem/examples/input
directory. If you installed from source, copy them from the fugassem source package to the fugassem/examples/input/
directory. The inputs include (1) the protein families MTX abundances file demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv
and (2) raw GO annotations for some of these protein families demo_proteinfamilies.GO.simple.tsv
.
The input TSV files include:
This file contains the MTX abundances of protein families normalized within a taxon and is used by FUGAsseM to generate the coexpression matrix. In this example, we are using a subset of gut microbiome MTX data from the HMP2 cohort of individuals with IBD and non-IBD controls. Here, a protein family cluster is equivalent to a uniref90 cluster.
To view this file, type:
less -S input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv
which yields:
ID CSM5FZ3T_P CSM5FZ46_P CSM5FZ4C_P CSM5FZ4G_P
Cluster_100559|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli 0 0 0 0
Cluster_100569|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli 0 0 0 0
Cluster_1008935|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli 0 0 0 0
Cluster_101048|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli 0 0 0 0
This file contains the raw GO annotations of the protein families in the HMP2 gut microbiomes.
To view this file, type:
less -S input/demo_proteinfamilies.GO.simple.tsv
which yields:
Cluster_100559 GO:0009058
Cluster_100559 GO:0016788
Cluster_1008935 GO:0005985
Cluster_1008935 GO:0005737
...
To run the canonical FUGAsseM function prediction workflow, type:
fugassem --basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--output my_output/fugassem_output_mtx
Note:
- In the command, replace
my_output/fugassem_output_mtx
with the path to the folder where you want to write the output files. - See the section on parallelization options to optimize the workflow run based on your computing resources.
- The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).
All output files will be stored in the my_output/fugassem_output_mtx
directory. There are two subdirectories i.e. main
and merged
. The main
subdirectory further contains directors for prediction results for each taxon in the input data; in this example Escherichia coli and Bacteroides thetaiotaomicron. The merged folder
contains results from each taxon concatenated into a single result file. The main output files in this example are main/Bacteroides_thetaiotaomicron/demo_GO.Bacteroides_thetaiotaomicron.finalized_ML.prediction.tsv
, main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv
, and merged/demo_GO.finalized_ML.prediction.tsv
.
This file (and the analogous file in the E. coli subdirectory) contains the predicted functions of protein family clusters in B. thetaiotaomicron. Functions are GO terms and a higher score indicates a more confident prediction. The 'raw_ann' column indicates if the protein family cluster was originally annotated with the predicted function, e.g. GO term.
To view this file, type:
less -S my_output/fugassem_output_mtx/main/Bacteroides_thetaiotaomicron/demo_GO.Bacteroides_thetaiotaomicron.finalized_ML.prediction.tsv
which yields:
feature func category score raw_ann
Cluster_1024034 GO:0000155 GO 0.75 1
Cluster_1024034 GO:0003700 GO 0.73 1
Cluster_1024034 GO:0003824 GO 0.22 0
Cluster_1024034 GO:0004673 GO 0.8 1
This file is a merged file containing prediction results from the two taxa in this example. The columns remain the same except for the addition of the 'taxon' column.
To view this file, type:
less -S my_output/fugassem_output_mtx/merged/demo_GO.finalized_ML.prediction.tsv
which yields:
taxon feature func category score raw_ann
Bacteroides_thetaiotaomicron Cluster_1024034 GO:0000155 GO 0.75 1
Bacteroides_thetaiotaomicron Cluster_1024034 GO:0003700 GO 0.73 1
Bacteroides_thetaiotaomicron Cluster_1024034 GO:0003824 GO 0.22 0
.
.
.
Escherichia_coli Cluster_100559 GO:0000271 GO 0.45 0
Escherichia_coli Cluster_100559 GO:0003677 GO 0.35 0
Escherichia_coli Cluster_100559 GO:0003700 GO 0.61 0
Note: The output prediction files containing the terms 'coexp' instead of 'finalized' are identical since only coexpression is used for prediction.
When other community-wide data are available, FUGAsseM can predict functions by integrating coexpression with other pieces of evidence. The additional steps in this workflow are building individual machine learning classifiers for each type of evidence including coexpression as discussed above and integration to generate an ensemble classifier for final function prediction.
This optional evidence can be formatted as vector-based evidence and matrix-based evidence, where the former represents a vector of gene-over-function relationships and the latter represents a matrix of the interplay networks among genes within species (e.g. co-expression, co-annotation or co-occurrence patterns). Evidence such as homology between protein families, gene neighborhood, and domain-domain interactions may be included. Information on how to generate such additional input evidence data is included under Advanced Topics.
The input files for running integrated prediction can be found in the fugassem/examples/input/
directory. If you installed it with pip
or conda
, download them here and save them to the fugassem/examples/input
directory. If you installed from source, copy them from the fugassem source package to the fugassem/examples/input/
directory. The input TSV files include:
- Normalized MTX abundance table stratified by taxa, e.g. demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv (same as above)
- GO (raw) annotations for protein families, e.g. demo_proteinfamilies.GO.simple.tsv (same as above)
- Vector-based evidence 1: GO annotations based on homology between protein families, e.g. demo_proteinfamilies.GO.homology.tsv
- Matrix-based evidence 1: Domain-Domain interactions (DDIs) for building DDI networks for prediction, e.g. demo_proteinfamilies.DDI.simple.tsv
- Matrix-based evidence 2: Source MGX-based contigs of protein families for building co-contig network for prediction, e.g. demo_proteinfamilies.contig.simple.tsv
Let's see the contents of the additional evidence files:
This file provides information on GO annotations of protein families based on homology (Uniref50) between protein families and raw GO annotations that will be used as one type of vector evidence.
To view this file, type:
less -S input/demo_proteinfamilies.GO.homology.tsv
which yields:
ID GO:0000155__seqSimilarity GO:0000271__seqSimilarity GO:0003677__seqSimilarity GO:0003700__seqSimilarity
Cluster_100559 0 0 0 0
Cluster_100569 0 0 0 0
Cluster_1008935 0 0 0 0
Cluster_101048 0 1 0 0
...
This file provides the domain-domain interaction annotations of protein families that will be used to build DDI networks as one type of matrix evidence.
To view this file, type:
less -S input/demo_proteinfamilies.DDI.simple.tsv
which yields:
Cluster_100559 PF00501:PF00109
Cluster_100559 PF00501:PF00975
Cluster_100559 PF00501:PF02801
Cluster_100559 PF00668:PF00109
...
This file provides the contig annotations of protein families that will be used to build co-contig networks as one type of matrix evidence.
To view this file, type:
less -S input/demo_proteinfamilies.contig.simple.tsv
which yields:
Cluster_7792 CSM7KORG_contig_k105_16193
Cluster_10199 CSM79HI7_contig_k105_24044
Cluster_10250 HSM7J4MW_contig_k105_17216
Cluster_12531 CSM79HI7_contig_k105_3486
...
To run integrated function prediction, make sure you are in the fugassem/examples/
directory and run:
fugassem --basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv \
--output my_output/fugassem_output
Note:
- The evidence data can be supplied as vector-based evidence (via
--vector-list
) that will be formatted into a single evidence vector, representing gene-over-function relationships. - Alternatively, evidence data can be supplied as matrix-based evidence (via
--matrix-list
) that will be formatted as an evidence matrix, representing the interplay networks among genes within species (e.g. co-expression, co-annotation, or co-occurrence patterns). - In the command, replace
my_output/fugassem_output
with the path to the folder where you want to write the output files.
All output files will be stored in the my_output/fugassem_output
directory. The architecture of this directory is similar to that of fugassem_output_mtx
. Prediction results for each taxon are stored in the main
folder and combined results in the merged
folder. Further, each taxon subdirectory contains function prediction results based on integrated evidence (demo_GO.Escherichia_coli.finalized_ML.prediction.tsv
) as well as individual evidence (other output TSV files).
To view all output files for Escherichia coli, type:
ls my_output/fugassem_output/main/Escherichia_coli/
which yields:
data demo_GO.Escherichia_coli.vector1_ML.prediction.tsv
demo_GO.Escherichia_coli.coexp_ML.prediction.tsv Escherichia_coli.fugassem.log
demo_GO.Escherichia_coli.finalized_ML.prediction.tsv feature_maps.txt
demo_GO.Escherichia_coli.matrix1_ML.prediction.tsv prediction
demo_GO.Escherichia_coli.matrix2_ML.prediction.tsv preprocessing
Here, the most important output file of each taxon is demo_GO.Escherichia_coli.finalized_ML.prediction.tsv
.
To view this file, type:
less -S my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv
which yields:
feature func category score raw_ann
Cluster_100559 GO:0000271 GO 0.16 0
Cluster_100559 GO:0003677 GO 0.17 0
Cluster_100559 GO:0003700 GO 0.29 0
Cluster_100559 GO:0003979 GO 0.15 0
Note: The prediction results by using individual evidence per taxon are in the file: my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.$EVIDENCE_TYPE_ML.prediction.tsv
(where $EVIDENCE_TYPE
= the basename of each evidence).
To view the merged finalized output file, type:
less -S my_output/fugassem_output/merged/demo_GO.finalized_ML.prediction.tsv
which yields:
taxon feature func category score raw_ann
Bacteroides_thetaiotaomicron Cluster_1024034 GO:0000155 GO 0.97 1
Bacteroides_thetaiotaomicron Cluster_1024034 GO:0003700 GO 0.97 1
Bacteroides_thetaiotaomicron Cluster_1024034 GO:0003824 GO 0.34 0
.
.
.
Escherichia_coli Cluster_100559 GO:0000271 GO 0.16 0
Escherichia_coli Cluster_100559 GO:0003677 GO 0.17 0
Escherichia_coli Cluster_100559 GO:0003700 GO 0.29 0
Note: The prediction results by using individual evidence per taxon are in the file: my_output/fugassem_output/merged/demo_GO.$EVIDENCE_TYPE_ML.prediction.tsv
(where $EVIDENCE_TYPE
= the basename of each evidence).
FUGAsseM
provides a visualization utility to quantify the performance of prediction using a cross-validation approach (i.e. treating the original annotation as true labels).
For example, to quantify the performance of prediction based on only coexpression in E. coli, type:
fugassem_performance_vis -i my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.coexp_ML.prediction.tsv \
-o my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.coexp_ML
which yields a PDF file including the AUC figures of each GO term: my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.coexp_ML.each.auc.pdf
:
To quantify the performance of prediction based on integrated evidence in E. coli, type:
fugassem_performance_vis -i my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv \
-o my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML
which yields a PDF file including the AUC figures of each GO term: my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.each.auc.pdf
:
- Preprocessing features of each taxon
- FUGAsseM preprocesses input evidence data and prepares feature tables for machine learning per taxon. Each type of feature will be used to build an ML classifier. These input data are in the folder:
$OUTPUT_DIR/main/$TAXON_NAME/data/
. - All intermediate preprocessing results are in the folder:
$OUTPUT_DIR/main/$TAXON_NAME/preprocessing/
. - All intermediate prediction results are in the folder per taxon:
$OUTPUT_DIR/main/$TAXON_NAME/prediction/
.
- FUGAsseM preprocesses input evidence data and prepares feature tables for machine learning per taxon. Each type of feature will be used to build an ML classifier. These input data are in the folder:
Considering the hierarchical structures of gene ontology, if a term is super broad, it provides us with limited information to understand the function of a person. Meanwhile, if a term is super specific with few proteins assigned to it, it will be hard to predict this term by learning limited existing annotation information. To address this concern, FUGAsseM provides several approaches to select informative terms for predictions, which contain a certain number of genes without any child term passing the criteria. Using the informative GO terms for prediction will help us exclude very broad terms (that represent super general functions and non-informative) as well as too specific terms (that are specific to a few genes lacking enough data for function prediction).
FUGAsseM can build a "union" informative GO set that includes informative GO terms and will be applied to all taxa. To build this set, an informative term per taxon is defined as each term containing k (e.g. >5) genes without any k-size children; then these taxon-specific informative terms are combined across taxa with solving parent-child relationships, resulting in a “union” set of terms without any informative children. This "union" set makes a balance between general and specific terms and most terms are partially covered by taxa, indicating less general functions. Thus, we recommend it as default.
fugassem --go-mode union --go-level 5 --threads 2 --local-job 2 \
--basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv \
--output my_output/fugassem_output_union
Note:
- In the command, replace
my_output/fugassem_output_union
with the path to the folder where you want to write the output files. - Architecture of the output folder is similar to that of
fugassem_output
with themain
andmerged
subdirectories. - This run will assign annotations of integrated informative GO terms across taxa to the protein families.
- See the section on parallelization options to optimize the workflow run based on your computing resources.
- The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).
To view the main output file type:
less -S my_output/fugassem_output_union/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv
which yields:
feature func category score raw_ann
Cluster_100559 GO:0000271:[BP] polysaccharide biosynthetic process BP 0.05 0
Cluster_100559 GO:0003700:[MF] DNA-binding transcription factor activity MF 0.22 0
Cluster_100559 GO:0003979:[MF] UDP-glucose 6-dehydrogenase activity MF 0.04 0
Cluster_100559 GO:0005737:[CC] cytoplasm CC 0.43 0
...
Alternatively, FUGAsseM can build a "universal" informative GO set that includes informative GO terms among all taxa and will be applied to all taxa. In this set, each informative term is defined as each term containing k (e.g. >5) genes without any k-size children among all taxa. This "universal" set makes cross-taxa comparison simpler but may lose sensitivity to some taxon-specific terms.
fugassem --go-mode universal --go-level 5 --threads 2 --local-job 2 \
--basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv \
--output my_output/fugassem_output_universal
Note:
- In the command, replace
my_output/fugassem_output_universal
with the path to the folder where you want to write the output files. - This run will assign annotations of universal informative GO terms across species to the protein families.
To view the main output file type:
less -S my_output/fugassem_output_universal/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv
which yields:
feature func category score raw_ann
Cluster_100559 GO:0000271:[BP] polysaccharide biosynthetic process BP 0.0 0
Cluster_100559 GO:0003700:[MF] DNA-binding transcription factor activity MF 0.1 0
Cluster_100559 GO:0003979:[MF] UDP-glucose 6-dehydrogenase activity MF 0.01 0
Cluster_100559 GO:0005737:[CC] cytoplasm CC 0.45 0
...
Additionally, FUGAsseM can also provide the option to build "bug-specific" informative GO sets that include informative GO terms per taxon and will be applied to each taxon with the corresponding taxon-specific terms. In this set, each informative term is defined as each term containing k (e.g. >5) genes without any k-size children per taxon. This approach can predict taxon-specific terms per taxon but will make cross-taxa comparison much more complex.
fugassem --go-mode "taxon-specific" --go-level 5 \
--threads 2 --local-job 2 -\
-basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv \
--output my_output/fugassem_output_bug_specific
Note:
- In the command, replace
my_output/fugassem_output_bug_specific
with the path to the folder where you want to write the output files. - This run will assign annotations of informative GO terms specific to each taxon to the protein families.
To view the main output file type:
less -S my_output/fugassem_output_bug_specific/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv
which yields:
feature func category score raw_ann
Cluster_100559 GO:0000271:[BP] polysaccharide biosynthetic process BP 0.03 0
Cluster_100559 GO:0003979:[MF] UDP-glucose 6-dehydrogenase activity MF 0.02 0
Cluster_100559 GO:0005737:[CC] cytoplasm CC 0.61 0
Cluster_100559 GO:0005886:[CC] plasma membrane CC 0.28 0
...
Moreover, FUGAsseM provides advanced options to further select taxa for function prediction based on their detection by MTX and annotation coverage of functions for predictions.
fugassem --minimum-prevalence 0.01 --minimum-number 20 --minimum-coverage 0.1 \
--go-mode union --go-level 5 --threads 2 --local-job 2 \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv \
--basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--output my_output/fugassem_output_union_advanced
Note:
- In the command, replace
my_output/fugassem_output_union_advanced
with the path to the folder where you want to write the output files. - This process will select species that contain at least a certain fraction of annotated proteins with higher abundance and prevalence than the minimums.
- This run will assign annotations of integrated informative GO terms across selected taxa to the protein families.
Notice that only Escherichia coli has been selected for the function prediction. The list of selected taxa can be found in taxa_abunds_files.txt
. To view the main output file type:
less -S my_output/fugassem_output_union_advanced/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv
which yields:
feature func category score raw_ann
Cluster_100559 GO:0000271:[BP] polysaccharide biosynthetic process BP 0.02 0
Cluster_100559 GO:0003979:[MF] UDP-glucose 6-dehydrogenase activity MF 0.06 0
Cluster_100559 GO:0005737:[CC] cytoplasm CC 0.59 0
Cluster_100559 GO:0005886:[CC] plasma membrane CC 0.19 0
...
FUGAsseM takes a stratified MTX abundance table as input (that is normalized within each taxon). Users may provide this table using their own analysis or use one of two options as fellow.
Option 1: Reference-based approach
Users can also obtain this type of input using other tools such as HUMAnN that aligns MTX shortgun reads against reference proteins and quantifies MTX abundance within species. This stratified MTX abundance table should be normalized within each taxon by either HUMAnN’s utility humann_renorm_table or FUGAsseM’s utility fugassem_abundance_normalization.
Option 2: Reference-based approach
- FUGAsseM includes a utility called
fugassem_generate_stratified_mtx_input
to prepare the required MTX abundance input file in which protein family abundances are normalized within each taxon. Users can also obtain this type of input using other tools such as HUMAnN. - Five inputs are required including (1) QC'ed shotgun sequencing MTX file, (2) MGX-assembled, non-redundant gene catalogs, (3) nucleotide sequences of the representative genes in the catalogs, (4) protein family level clusters of representative genes, and (5) taxonomy of the protein families.
- This utility (1) maps MTX shotgun reads against MGX-assembled gene catalogs, (2) sums up the quantified abundance of gene catalogs to the level of protein families, and (3) normalizes the protein-family-based MTX abundance within each taxon. Example files are provided in the
raw_input
directory in the VM. The stratified MTX abundance file can be created with the command:
fugassem_generate_stratified_mtx_input --taxon-level Species \
--gene-catalog raw_input/demo_genecatalogs.clstr \
--gene-catalog-seq raw_input/demo_genecatalogs.centroid.fna \
--protein-family raw_input/demo_proteinfamilies.clstr \
--family-taxonomy raw_input/demo_proteinfamilies_annotation.taxonomy.tsv \
--basename demo --input raw_input --output my_output/fugassem_input_prep
- The main output is a TSV file
my_output/fugassem_input_prep/demo.proteinfamilies.nrm.tsv
containing MTX abundances of protein families normalized within a taxon which looks like:
ID sample1 sample2
Cluster_100569|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli 26585.4 5199.89
Cluster_1022788|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli 0 19218.2
Cluster_1022791|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli 0 18934.2
Cluster_1022793|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli 0 5432.9
...
- For more details, please see the User Manual.
FUGAsseM provides other utilities to prepare optional evidence data used for prediction. This utility generates evidence files such as raw functional annotations, domain-based annotation evidence, assembly-based annotation evidence.
- FUGAsseM provides
fugassem_generate_annotation_input
to prepare evidence input files based on the outputs of MetaWIBELE and homology-based annotation that are used by FUGAsseM for function prediction. - Inputs may include the annotation file of MetaWIBELE (which includes Pfam-based domain-domain interactions) and information about gene neighborhood (contig), and homology (Uniref 50) of protein families.
- The utility will then create the respective vector and matrix based evidence files that are used in the integrated (full model) prediction. Example files are provided in the
raw_input
directory in the VM. The output files can be created with the command:
fugassem_generate_annotation_input --input raw_input/demo_proteinfamilies_annotation.tsv \
--clust-file raw_input/demo_proteinfamilies.clstr \
--contig --gene-info raw_input/demo_gene_info.tsv \
--homology --homology-ann raw_input/demo_map_proteinfamilies.ident50.tsv \
--basename demo_proteinfamilies --output my_output/fugassem_input_prep
- The following are the main output files (contents shown under Section 2.2.1)
my_output/fugassem_input_prep/demo_proteinfamilies.GO.simple.tsv
my_output/fugassem_input_prep/demo_proteinfamilies.DDI.simple.tsv
my_output/fugassem_input_prep/demo_proteinfamilies.contig.simple.tsv
my_output/fugassem_input_prep/demo_proteinfamilies.GO.homology.tsv
- For more details, please see the User Manual.
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- PARATHAA