Skip to content

FUGAsseM

Yancong Zhang edited this page Oct 5, 2023 · 3 revisions

Welcome to the FUGAsseM tutorial

FUGAsseM (Function predictor of Uncharacterized Gene products by Assessing high-dimensional community data in Microbiomes) is a computational tool based on “guilt by association” to predict functions of novel gene products in the context of microbial communities. It uses machine learning methods to predict functions of microbial proteins by integrating multiple types of evidence such as gene co-expression patterns from metatranscriptomics (MTX), and optionally, genomic context from metagenomic (MGX) assemblies, homology-based annotations, domain-domain interaction-based annotations, etc.

For more information about the method and its associated publication, visit our summary page and read the FUGAsseM User Manual.

Support for FUGAsseM is available via the FUGAsseM channel of the bioBakery Support Forum.


Contents


1. Setup

1.1 Requirements

  1. Python (version >= 3.7, requiring numpy, pandas multiprocessing, sklearn, matplotlib, scipy, goatools, statistics, typing python packages; tested 3.7)
  2. AnADAMA2 (version >= 0.8.0; tested 0.8.0)

1.2 Installation

Note: If you are using a bioBakery machine image (e.g. in Google Cloud) you do not need to install FUGAsseM because the tool and its dependencies are already installed.

You only need to do any one of the following options to install the FUGAsseM package.

Option 1: Installing with conda

  • $ conda install -c biobakery fugassem

Option 2: Installing with pip

  • $ pip install fugassem
  • If you do not have write permissions to /usr/lib/, then add the option --user to the install command. This will install the python package into subdirectories of ~/.local/. Please note when using the --user install option on some platforms, you might need to add ~/.local/bin/ to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message fugassem: command not found when trying to run FUGAsseM after installing with the --user option.

2. Predicting functions using FUGAsseM

In the VM, navigate to the correct directory with:

cd ~/Tutorials/fugassem/examples/

2.1 Coexpression-based prediction

The steps in the canonical function prediction workflow of FUGAsseM which uses MTX coexpression profiles and raw GO annotations include: (1) preparing protein families and annotations, (2) building coexpression profiles of proteins within each taxon, and (3) building a machine learning classifier for function prediction. Please see Advanced Topics for steps (1) and (2).

2.1.1 Inputs

The input files for running coexpression-based prediction can be found in the fugassem/examples/input/ directory. If you installed it with pip or conda, download them here and save them to the fugassem/examples/input directory. If you installed from source, copy them from the fugassem source package to the fugassem/examples/input/ directory. The inputs include (1) the protein families MTX abundances file demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv and (2) raw GO annotations for some of these protein families demo_proteinfamilies.GO.simple.tsv.

The input TSV files include:

This file contains the MTX abundances of protein families normalized within a taxon and is used by FUGAsseM to generate the coexpression matrix. In this example, we are using a subset of gut microbiome MTX data from the HMP2 cohort of individuals with IBD and non-IBD controls. Here, a protein family cluster is equivalent to a uniref90 cluster.

To view this file, type:

less -S input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv

which yields:

ID	CSM5FZ3T_P	CSM5FZ46_P	CSM5FZ4C_P	CSM5FZ4G_P
Cluster_100559|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli	0	0	0	0
Cluster_100569|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli	0	0	0	0
Cluster_1008935|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli	0	0	0	0
Cluster_101048|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli	0	0	0	0

This file contains the raw GO annotations of the protein families in the HMP2 gut microbiomes.

To view this file, type:

less -S input/demo_proteinfamilies.GO.simple.tsv

which yields:

Cluster_100559  GO:0009058
Cluster_100559  GO:0016788
Cluster_1008935 GO:0005985
Cluster_1008935 GO:0005737
...

2.1.2 Running FUGAsseM-MTX model for prediction

To run the canonical FUGAsseM function prediction workflow, type:

fugassem --basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--output my_output/fugassem_output_mtx

Note:

  • In the command, replace my_output/fugassem_output_mtx with the path to the folder where you want to write the output files.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).

2.1.3 Output

All output files will be stored in the my_output/fugassem_output_mtx directory. There are two subdirectories i.e. main and merged. The main subdirectory further contains directors for prediction results for each taxon in the input data; in this example Escherichia coli and Bacteroides thetaiotaomicron. The merged folder contains results from each taxon concatenated into a single result file. The main output files in this example are main/Bacteroides_thetaiotaomicron/demo_GO.Bacteroides_thetaiotaomicron.finalized_ML.prediction.tsv, main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv, and merged/demo_GO.finalized_ML.prediction.tsv.

main/Bacteroides_thetaiotaomicron/demo_GO.Bacteroides_thetaiotaomicron.finalized_ML.prediction.tsv

This file (and the analogous file in the E. coli subdirectory) contains the predicted functions of protein family clusters in B. thetaiotaomicron. Functions are GO terms and a higher score indicates a more confident prediction. The 'raw_ann' column indicates if the protein family cluster was originally annotated with the predicted function, e.g. GO term.

To view this file, type:

less -S my_output/fugassem_output_mtx/main/Bacteroides_thetaiotaomicron/demo_GO.Bacteroides_thetaiotaomicron.finalized_ML.prediction.tsv

which yields:

feature func    category        score   raw_ann
Cluster_1024034 GO:0000155      GO      0.75    1
Cluster_1024034 GO:0003700      GO      0.73    1
Cluster_1024034 GO:0003824      GO      0.22    0
Cluster_1024034 GO:0004673      GO      0.8     1
merged/demo_GO.finalized_ML.prediction.tsv

This file is a merged file containing prediction results from the two taxa in this example. The columns remain the same except for the addition of the 'taxon' column.

To view this file, type:

less -S my_output/fugassem_output_mtx/merged/demo_GO.finalized_ML.prediction.tsv

which yields:

taxon   feature func    category        score   raw_ann
Bacteroides_thetaiotaomicron    Cluster_1024034 GO:0000155      GO      0.75    1
Bacteroides_thetaiotaomicron    Cluster_1024034 GO:0003700      GO      0.73    1
Bacteroides_thetaiotaomicron    Cluster_1024034 GO:0003824      GO      0.22    0
.
.
.
Escherichia_coli        Cluster_100559  GO:0000271      GO      0.45    0
Escherichia_coli        Cluster_100559  GO:0003677      GO      0.35    0
Escherichia_coli        Cluster_100559  GO:0003700      GO      0.61    0

Note: The output prediction files containing the terms 'coexp' instead of 'finalized' are identical since only coexpression is used for prediction.

2.2 Integrated prediction

When other community-wide data are available, FUGAsseM can predict functions by integrating coexpression with other pieces of evidence. The additional steps in this workflow are building individual machine learning classifiers for each type of evidence including coexpression as discussed above and integration to generate an ensemble classifier for final function prediction.

This optional evidence can be formatted as vector-based evidence and matrix-based evidence, where the former represents a vector of gene-over-function relationships and the latter represents a matrix of the interplay networks among genes within species (e.g. co-expression, co-annotation or co-occurrence patterns). Evidence such as homology between protein families, gene neighborhood, and domain-domain interactions may be included. Information on how to generate such additional input evidence data is included under Advanced Topics.

2.2.1 Inputs

The input files for running integrated prediction can be found in the fugassem/examples/input/ directory. If you installed it with pip or conda, download them here and save them to the fugassem/examples/input directory. If you installed from source, copy them from the fugassem source package to the fugassem/examples/input/ directory. The input TSV files include:

Let's see the contents of the additional evidence files:

demo_proteinfamilies.GO.homology.tsv

This file provides information on GO annotations of protein families based on homology (Uniref50) between protein families and raw GO annotations that will be used as one type of vector evidence.

To view this file, type:

less -S input/demo_proteinfamilies.GO.homology.tsv 

which yields:

ID	GO:0000155__seqSimilarity	GO:0000271__seqSimilarity	GO:0003677__seqSimilarity	GO:0003700__seqSimilarity
Cluster_100559	0	0	0	0
Cluster_100569	0	0	0	0
Cluster_1008935	0	0	0	0
Cluster_101048	0	1	0	0
...
demo_proteinfamilies.DDI.simple.tsv

This file provides the domain-domain interaction annotations of protein families that will be used to build DDI networks as one type of matrix evidence.

To view this file, type:

less -S input/demo_proteinfamilies.DDI.simple.tsv 

which yields:

Cluster_100559  PF00501:PF00109
Cluster_100559  PF00501:PF00975
Cluster_100559  PF00501:PF02801
Cluster_100559  PF00668:PF00109
...
demo_proteinfamilies.contig.simple.tsv

This file provides the contig annotations of protein families that will be used to build co-contig networks as one type of matrix evidence.

To view this file, type:

less -S input/demo_proteinfamilies.contig.simple.tsv

which yields:

Cluster_7792    CSM7KORG_contig_k105_16193
Cluster_10199   CSM79HI7_contig_k105_24044
Cluster_10250   HSM7J4MW_contig_k105_17216
Cluster_12531   CSM79HI7_contig_k105_3486
...

2.2.2 Running FUGAsseM-full model for prediction

To run integrated function prediction, make sure you are in the fugassem/examples/ directory and run:

fugassem --basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv \
--output my_output/fugassem_output

Note:

  • The evidence data can be supplied as vector-based evidence (via --vector-list) that will be formatted into a single evidence vector, representing gene-over-function relationships.
  • Alternatively, evidence data can be supplied as matrix-based evidence (via --matrix-list) that will be formatted as an evidence matrix, representing the interplay networks among genes within species (e.g. co-expression, co-annotation, or co-occurrence patterns).
  • In the command, replace my_output/fugassem_output with the path to the folder where you want to write the output files.

2.2.3 Outputs

All output files will be stored in the my_output/fugassem_output directory. The architecture of this directory is similar to that of fugassem_output_mtx. Prediction results for each taxon are stored in the main folder and combined results in the merged folder. Further, each taxon subdirectory contains function prediction results based on integrated evidence (demo_GO.Escherichia_coli.finalized_ML.prediction.tsv) as well as individual evidence (other output TSV files).

Outputs of each taxon

To view all output files for Escherichia coli, type:

ls my_output/fugassem_output/main/Escherichia_coli/

which yields:

data						      demo_GO.Escherichia_coli.vector1_ML.prediction.tsv
demo_GO.Escherichia_coli.coexp_ML.prediction.tsv      Escherichia_coli.fugassem.log
demo_GO.Escherichia_coli.finalized_ML.prediction.tsv  feature_maps.txt
demo_GO.Escherichia_coli.matrix1_ML.prediction.tsv    prediction
demo_GO.Escherichia_coli.matrix2_ML.prediction.tsv    preprocessing

Here, the most important output file of each taxon is demo_GO.Escherichia_coli.finalized_ML.prediction.tsv.

To view this file, type:

less -S my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv

which yields:

feature func    category        score   raw_ann
Cluster_100559  GO:0000271      GO      0.16    0
Cluster_100559  GO:0003677      GO      0.17    0
Cluster_100559  GO:0003700      GO      0.29    0
Cluster_100559  GO:0003979      GO      0.15    0

Note: The prediction results by using individual evidence per taxon are in the file: my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.$EVIDENCE_TYPE_ML.prediction.tsv (where $EVIDENCE_TYPE = the basename of each evidence).

Merged outputs

To view the merged finalized output file, type:

less -S my_output/fugassem_output/merged/demo_GO.finalized_ML.prediction.tsv

which yields:

taxon   feature func    category        score   raw_ann
Bacteroides_thetaiotaomicron    Cluster_1024034 GO:0000155      GO      0.97    1
Bacteroides_thetaiotaomicron    Cluster_1024034 GO:0003700      GO      0.97    1
Bacteroides_thetaiotaomicron    Cluster_1024034 GO:0003824      GO      0.34    0
.
.
.
Escherichia_coli        Cluster_100559  GO:0000271      GO      0.16    0
Escherichia_coli        Cluster_100559  GO:0003677      GO      0.17    0
Escherichia_coli        Cluster_100559  GO:0003700      GO      0.29    0

Note: The prediction results by using individual evidence per taxon are in the file: my_output/fugassem_output/merged/demo_GO.$EVIDENCE_TYPE_ML.prediction.tsv (where $EVIDENCE_TYPE = the basename of each evidence).

2.2.4 Visualizing results

FUGAsseM provides a visualization utility to quantify the performance of prediction using a cross-validation approach (i.e. treating the original annotation as true labels).

For example, to quantify the performance of prediction based on only coexpression in E. coli, type:

fugassem_performance_vis -i my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.coexp_ML.prediction.tsv \ 
-o my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.coexp_ML

which yields a PDF file including the AUC figures of each GO term: my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.coexp_ML.each.auc.pdf:

To quantify the performance of prediction based on integrated evidence in E. coli, type:

fugassem_performance_vis -i my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv \ 
-o my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML

which yields a PDF file including the AUC figures of each GO term: my_output/fugassem_output/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.each.auc.pdf:

2.2.5 Intermediate files

  • Preprocessing features of each taxon
    • FUGAsseM preprocesses input evidence data and prepares feature tables for machine learning per taxon. Each type of feature will be used to build an ML classifier. These input data are in the folder: $OUTPUT_DIR/main/$TAXON_NAME/data/.
    • All intermediate preprocessing results are in the folder: $OUTPUT_DIR/main/$TAXON_NAME/preprocessing/.
    • All intermediate prediction results are in the folder per taxon: $OUTPUT_DIR/main/$TAXON_NAME/prediction/.

3. Predicting functions using FUGAsseM with advanced settings

Considering the hierarchical structures of gene ontology, if a term is super broad, it provides us with limited information to understand the function of a person. Meanwhile, if a term is super specific with few proteins assigned to it, it will be hard to predict this term by learning limited existing annotation information. To address this concern, FUGAsseM provides several approaches to select informative terms for predictions, which contain a certain number of genes without any child term passing the criteria. Using the informative GO terms for prediction will help us exclude very broad terms (that represent super general functions and non-informative) as well as too specific terms (that are specific to a few genes lacking enough data for function prediction).

3.1 Run FUGAsseM with union informative GO terms

FUGAsseM can build a "union" informative GO set that includes informative GO terms and will be applied to all taxa. To build this set, an informative term per taxon is defined as each term containing k (e.g. >5) genes without any k-size children; then these taxon-specific informative terms are combined across taxa with solving parent-child relationships, resulting in a “union” set of terms without any informative children. This "union" set makes a balance between general and specific terms and most terms are partially covered by taxa, indicating less general functions. Thus, we recommend it as default.

fugassem --go-mode union --go-level 5 --threads 2 --local-job 2 \
--basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv \
--output my_output/fugassem_output_union

Note:

  • In the command, replace my_output/fugassem_output_union with the path to the folder where you want to write the output files.
  • Architecture of the output folder is similar to that of fugassem_output with the main and merged subdirectories.
  • This run will assign annotations of integrated informative GO terms across taxa to the protein families.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).

To view the main output file type:

less -S my_output/fugassem_output_union/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv

which yields:

feature func    category        score   raw_ann
Cluster_100559  GO:0000271:[BP] polysaccharide biosynthetic process     BP      0.05    0
Cluster_100559  GO:0003700:[MF] DNA-binding transcription factor activity       MF      0.22    0
Cluster_100559  GO:0003979:[MF] UDP-glucose 6-dehydrogenase activity    MF      0.04    0
Cluster_100559  GO:0005737:[CC] cytoplasm       CC      0.43    0
...

3.2 Run FUGAsseM with universal informative GO terms

Alternatively, FUGAsseM can build a "universal" informative GO set that includes informative GO terms among all taxa and will be applied to all taxa. In this set, each informative term is defined as each term containing k (e.g. >5) genes without any k-size children among all taxa. This "universal" set makes cross-taxa comparison simpler but may lose sensitivity to some taxon-specific terms.

fugassem --go-mode universal --go-level 5 --threads 2 --local-job 2 \
--basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv  \
--output my_output/fugassem_output_universal

Note:

  • In the command, replace my_output/fugassem_output_universal with the path to the folder where you want to write the output files.
  • This run will assign annotations of universal informative GO terms across species to the protein families.

To view the main output file type:

less -S my_output/fugassem_output_universal/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv

which yields:

feature func    category        score   raw_ann
Cluster_100559  GO:0000271:[BP] polysaccharide biosynthetic process     BP      0.0     0
Cluster_100559  GO:0003700:[MF] DNA-binding transcription factor activity       MF      0.1     0
Cluster_100559  GO:0003979:[MF] UDP-glucose 6-dehydrogenase activity    MF      0.01    0
Cluster_100559  GO:0005737:[CC] cytoplasm       CC      0.45    0
...

3.3 Run FUGAsseM with taxon-specific informative GO terms

Additionally, FUGAsseM can also provide the option to build "bug-specific" informative GO sets that include informative GO terms per taxon and will be applied to each taxon with the corresponding taxon-specific terms. In this set, each informative term is defined as each term containing k (e.g. >5) genes without any k-size children per taxon. This approach can predict taxon-specific terms per taxon but will make cross-taxa comparison much more complex.

fugassem --go-mode "taxon-specific" --go-level 5 \
--threads 2 --local-job 2 -\
-basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv \
--output my_output/fugassem_output_bug_specific

Note:

  • In the command, replace my_output/fugassem_output_bug_specific with the path to the folder where you want to write the output files.
  • This run will assign annotations of informative GO terms specific to each taxon to the protein families.

To view the main output file type:

less -S my_output/fugassem_output_bug_specific/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv

which yields:

feature func    category        score   raw_ann
Cluster_100559  GO:0000271:[BP] polysaccharide biosynthetic process     BP      0.03    0
Cluster_100559  GO:0003979:[MF] UDP-glucose 6-dehydrogenase activity    MF      0.02    0
Cluster_100559  GO:0005737:[CC] cytoplasm       CC      0.61    0
Cluster_100559  GO:0005886:[CC] plasma membrane CC      0.28    0
...

3.4 Run FUGAsseM with selecting taxa

Moreover, FUGAsseM provides advanced options to further select taxa for function prediction based on their detection by MTX and annotation coverage of functions for predictions.

fugassem --minimum-prevalence 0.01 --minimum-number 20 --minimum-coverage 0.1 \
--go-mode union --go-level 5 --threads 2 --local-job 2 \
--vector-list input/demo_proteinfamilies.GO.homology.tsv \
--matrix-list input/demo_proteinfamilies.DDI.simple.tsv,input/demo_proteinfamilies.contig.simple.tsv \
--basename demo_GO --input input/demo_proteinfamilies_rna_CPM.stratified_Species_mtx.tsv \
--input-annotation input/demo_proteinfamilies.GO.simple.tsv \
--output my_output/fugassem_output_union_advanced

Note:

  • In the command, replace my_output/fugassem_output_union_advanced with the path to the folder where you want to write the output files.
  • This process will select species that contain at least a certain fraction of annotated proteins with higher abundance and prevalence than the minimums.
  • This run will assign annotations of integrated informative GO terms across selected taxa to the protein families.

Notice that only Escherichia coli has been selected for the function prediction. The list of selected taxa can be found in taxa_abunds_files.txt. To view the main output file type:

less -S my_output/fugassem_output_union_advanced/main/Escherichia_coli/demo_GO.Escherichia_coli.finalized_ML.prediction.tsv

which yields:

feature func    category        score   raw_ann
Cluster_100559  GO:0000271:[BP] polysaccharide biosynthetic process     BP      0.02    0
Cluster_100559  GO:0003979:[MF] UDP-glucose 6-dehydrogenase activity    MF      0.06    0
Cluster_100559  GO:0005737:[CC] cytoplasm       CC      0.59    0
Cluster_100559  GO:0005886:[CC] plasma membrane CC      0.19    0
...

4. Advanced topics

4.1 Preparing stratified MTX abundance input

FUGAsseM takes a stratified MTX abundance table as input (that is normalized within each taxon). Users may provide this table using their own analysis or use one of two options as fellow.

Option 1: Reference-based approach

Users can also obtain this type of input using other tools such as HUMAnN that aligns MTX shortgun reads against reference proteins and quantifies MTX abundance within species. This stratified MTX abundance table should be normalized within each taxon by either HUMAnN’s utility humann_renorm_table or FUGAsseM’s utility fugassem_abundance_normalization.

Option 2: Reference-based approach

  • FUGAsseM includes a utility called fugassem_generate_stratified_mtx_input to prepare the required MTX abundance input file in which protein family abundances are normalized within each taxon. Users can also obtain this type of input using other tools such as HUMAnN.
  • Five inputs are required including (1) QC'ed shotgun sequencing MTX file, (2) MGX-assembled, non-redundant gene catalogs, (3) nucleotide sequences of the representative genes in the catalogs, (4) protein family level clusters of representative genes, and (5) taxonomy of the protein families.
  • This utility (1) maps MTX shotgun reads against MGX-assembled gene catalogs, (2) sums up the quantified abundance of gene catalogs to the level of protein families, and (3) normalizes the protein-family-based MTX abundance within each taxon. Example files are provided in the raw_input directory in the VM. The stratified MTX abundance file can be created with the command:
fugassem_generate_stratified_mtx_input --taxon-level Species \
--gene-catalog raw_input/demo_genecatalogs.clstr \
--gene-catalog-seq raw_input/demo_genecatalogs.centroid.fna \
--protein-family raw_input/demo_proteinfamilies.clstr \
--family-taxonomy raw_input/demo_proteinfamilies_annotation.taxonomy.tsv \
--basename demo --input raw_input --output my_output/fugassem_input_prep
  • The main output is a TSV file my_output/fugassem_input_prep/demo.proteinfamilies.nrm.tsv containing MTX abundances of protein families normalized within a taxon which looks like:
ID	sample1	sample2
Cluster_100569|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli	26585.4	5199.89
Cluster_1022788|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli	0	19218.2
Cluster_1022791|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli	0	18934.2
Cluster_1022793|k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacterales.f__Enterobacteriaceae.g__Escherichia.s__Escherichia_coli	0	5432.9
...

4.2 Preparing evidence input

FUGAsseM provides other utilities to prepare optional evidence data used for prediction. This utility generates evidence files such as raw functional annotations, domain-based annotation evidence, assembly-based annotation evidence.

  • FUGAsseM provides fugassem_generate_annotation_input to prepare evidence input files based on the outputs of MetaWIBELE and homology-based annotation that are used by FUGAsseM for function prediction.
  • Inputs may include the annotation file of MetaWIBELE (which includes Pfam-based domain-domain interactions) and information about gene neighborhood (contig), and homology (Uniref 50) of protein families.
  • The utility will then create the respective vector and matrix based evidence files that are used in the integrated (full model) prediction. Example files are provided in the raw_input directory in the VM. The output files can be created with the command:
fugassem_generate_annotation_input --input raw_input/demo_proteinfamilies_annotation.tsv \
--clust-file raw_input/demo_proteinfamilies.clstr \
--contig --gene-info raw_input/demo_gene_info.tsv \
--homology --homology-ann raw_input/demo_map_proteinfamilies.ident50.tsv \
--basename demo_proteinfamilies  --output my_output/fugassem_input_prep 
  • The following are the main output files (contents shown under Section 2.2.1)
my_output/fugassem_input_prep/demo_proteinfamilies.GO.simple.tsv
my_output/fugassem_input_prep/demo_proteinfamilies.DDI.simple.tsv
my_output/fugassem_input_prep/demo_proteinfamilies.contig.simple.tsv
my_output/fugassem_input_prep/demo_proteinfamilies.GO.homology.tsv
Clone this wiki locally