Skip to content

metawibele

Yancong Zhang edited this page Jun 11, 2023 · 76 revisions

Welcome to the MetaWIBELE tutorial

MetaWIBELE (Workflow to Identify novel Bioactive Elements in the microbiome) is a workflow to efficiently and systematically identify and prioritize potentially bioactive (and often uncharacterized) gene products in microbial communities. It prioritizes candidate gene products from assembled metagenomes using a combination of sequence homology, secondary-structure-based functional annotations, phylogenetic binning, ecological distribution, and association with environmental parameters or phenotypes to target candidate bioactives.

For more information about the method and its associated publication, visit our summary page and read the MetaWIBELE User Manual.

Support for MetaWIBELE is available via the MetaWIBELE channel of the bioBakery Support Forum.


Contents


Overview

workflow.png


1. Setup

1.1 Requirements

  1. Python (version >= 3.6, requiring numpy, pandas, scipy packages; tested 3.6, 3.7)
  2. AnADAMA2 (version >= 0.7.4; tested 0.7.4, 0.8.0)
  3. CD-hit (version >= 4.7; tested 4.7)
  4. Diamond (version >= 0.9.24; tested 0.9.24)
  5. MSPminer (version >= 1.0.0; licensed software; tested version 1.0.0)
  6. MaAsLin2 (version >= 1.5.1; tested 1.5.1)
  7. Interproscan (version >= 5.31-70) (installing with activating Phobius/SignalP/TMHMM analyses; InterProScan 5.51-85.0 or later are recommended for potential simpler installation; used for domain/motif annotation; tested 5.31-70, 5.51-85.0)
  8. Signalp (version >= 4.1; licensed software; installing with integrating in interproscan; used for domain/motif annotation; tested 4.1)
  9. TMHMM (version >= 2.0; licensed software; installing with integrating in interproscan; used for domain/motif annotation; tested 2.0)
  10. Phobius (version >= 1.01; licensed software; installing with integrating in interproscan; used for domain/motif annotation; tested 1.01)
  11. PSORTb (version >= 3.0) (licensed software; used for domain/motif annotation; tested 3.0)
  12. Optional: only required if using MetaWIBELE utilities to prepare inputs for MetaWIBELE using metagenomic sequencing reads
    • MEGAHIT (version >= 1.1.3; tested 1.1.3)
    • Prokka (version >= 1.14-dev; recommend to not set '-c' parameter when running prodigal with metagenome mode; tested 1.14-dev)
    • Prodigal (version >= 2.6; tested 2.6)
    • USEARCH (version >= 9.0.2132; licensed software; tested 9.0.2132)
    • Bowtie2 (version >= 2.3.2; tested 2.3.2)
    • SAMtools (version >= 1.9; tested 1.9)
    • featureCounts (version >= 1.6.2; tested 1.6.2)

Note: Please install the required software in a location in your $PATH. If you always run with gene families (non-redundant gene catalogs), the optional softwares are not required. Also if you always run with one or more bypass options (for information on bypass options, see optional arguments to the section Workflow by bypass mode).The software required for the steps you bypass does not need to be installed.

1.2 Installation

Note: If you are using a bioBakery machine image (e.g. in Google Cloud) you do not need to install MetaWIBELE because the tool and its dependencies are already installed.

1.2.1 Install MetaWIBELE

You can choose one of the following options to install the MetaWIBELE package.

Option 1: Installing with docker

  • $ docker pull biobakery/metawibele
  • This docker image includes most of the dependent software packages.
  • Large software packages and those with licenses are NOT included in this image and needed to be installed additionally:
    • Users should review the license terms and install these packages manually.
    • Softwares with the license : MSPminer, Signalp, TMHMM, Phobius, PSORTb
    • Softwares with large size: Interproscan (Note: We recommend installing InterProScan 5.51-85.0 (requiring at least Java 11) or later for potential simpler installation, and active Phobius/SignalP/TMHMM analyses by customizing your interproscan.properties configuration, see more details from InterProScan document).

Option 2: Installing with pip

  • $ pip install metawibele
  • If you do not have write permissions to /usr/lib/, then add the option --user to the install command. This will install the python package into subdirectories of ~/.local/. Please note when using the --user install option on some platforms, you might need to add ~/.local/bin/ to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message metawibele: command not found when trying to run MetaWIBELE after installing with the --user option.

Option 3: Installing with conda

  • $ conda install -c biobakery metawibele

1.2.2 Install dependent databases

1.2.2.1 Demo database for the purposes of running demo data

Note: If you are using a bioBakery machine image (e.g. in Google Cloud) you do not need to install dependent databases that are already installed

If not installed in the bioBakery machine image, for the purposes of this tutorial alone, please download demo databases to reduce run time:

  • Download the annotation files and uncompress into demo_uniref_database folder: demo_uniref_database.tar.gz
  • Alternatively, run the following command to download the annotation files into demo_uniref_database: $ metawibele_download_database --database DEMO --build demo_uniref --install-location $UNIREF_LOCATION
1.2.2.2 Whole database for general use

UniRef database is required if you will use MetaWIBELE to do global-homology based annotation and taxonomic annotation. We have built the database based on UniProt/UniRef90 2019_01 sequences and annotations. You can use any one of the following options to install the database with providing $UNIREF_LOCATION as the location for installation.

NOTE: Please point to this location for the default uniref database in the global config file (metawibele.cfg, see details in the section 1.2.3 Prepare configuration files). Alternatively, you can either set the location with the environment variable $UNIREF_LOCATION, or move the downloaded database to the folder named uniref_database in the current working directory.

Option 1: Download and uncompress them into location $UNIREF_LOCATION

Option 2: Run the following commands to install the databases into the location $UNIREF_LOCATION

  • download the UniRef90 sequences indexed by Diamond v0.9.24 into $UNIREF_LOCATION: $ metawibele_download_database --database uniref --build uniref90_diamond --install-location $UNIREF_LOCATION
  • download the UniRef90 annotation files into $UNIREF_LOCATION: $ metawibele_download_database --database uniref --build uniref90_annotation --install-location $UNIREF_LOCATION

1.2.3 Check install

To check out the install of MetaWIBELE packages and all dependencies (tools and databases), run the command:

  • $ metawibele_check_install

    Which yields (highly abbreviated):

     #----------------- Checking the install of metawibele packages -----------------#
     ###### Checking anadama2 package install ######
     Checking anadama2 install...........OK
     
     ###### Checking metawibele package install ######
     Checking metawibele install...........OK
     
     #----------------- Checking the install of required dependent tools -----------------#
     ###### Checking CD-hit install ######
     ...
     
     #----------------- Checking the install of optional dependent tools -----------------#
     ###### Checking MEGAHIT install ######
     MEGAHIT v1.1.3
    
     Checking MEGAHIT install...........OK
     ...
    
  • By default, it will check both MetaWIBEKE packages and all dependencies.

  • Alternatively, add the option "--types {metawibele,required,optional,all}" to check the install of specific package or dependency.

1.2.4 Prepare configuration files

To run MetaWIBELE, you are required to customize the global configuration file metawibele.cfg and make sure that it's in your current working directory.

NOTE: De default, MetaWIBELE will use the global configurations from metawibele.cfg in the current working directory. Alternatively you can always provide the location of the global configuration file you would like to use with the "--global-config " option to metawibele (see more in the section How to run).

  1. Download global configuration template

    You can use one of the following options to get the configuration template.

    • Option 1: obtain copies by right-clicking the link and selecting "save link as": metawibele.cfg
    • Option 2: run this command to download global configuration file: $ metawibele_download_config --config-type global
  2. Customize the global configuration

    The path of the dependent databases is required for customization. Most of other sections can be left as defaults.

    • Customize the path of dependent databases (required):

       [database]
       # The absolute path of UniRef databases folder.
       uniref_db = 
       # The domain databases used by MetaWIBELE. [data_path] provide the absolute path of the domain databases folder; [none] use the dedault domain databases installed in the metawibele package. [ Default: none ]
       domain_db = none 
      
    • Customize basic information for outputs (e.g. prefix name of output files, etc.) (optional):

       [basic]
       # Study name. [ Default: MGX ]
       study = MGX
       # The prefix name for output results. [ Default: metawibele ]
       basename = metawibele
      
    • Customize applied computational resources (e.g. CPU cores, memory, etc.) (optional):

       [computation]
       # The number of cores that you are requesting. [ Default: 1 ]
       threads = 1
       # The amount of memory (in MB) that you will be using for your job. [ Default: 20000 ] 
       memory = 20000
       # The amount of time (in minute) that you will be using for your job. [ Default: 60 ]
       time = 60
      
    • Customize parameter settings for abundance-based and domain/motif-based annotations (optional):

       [abundance]
       # The method for normalization [Choices: cpm, relab]. [cpm] copies per million units (sum to 1 million); [relab] relative abundance (sum to 1). [ Default: cpm ]  
       normalize = cpm
       # The minimum abundance for each feature [ Default: 0 ]   
       abundance_detection_level = 0
      
       [msp]
       # The absolute path of the config file used by MSPminer. [config_file] provide the mspminer config file; [none] use the dedault config files installed in the metawibele package. [ Default: none ]
       mspminer = none
       # The minimum fraction of taxonomy classified genes in each MSP [0-1]. [Default: 0.10]
       tshld_classified = 0.10 
       # The minimum percent differences between the most and second dominant taxon for each MSP [0-1]. [Default: 0.50] 
       tshld_diff = 0.50 
      
       [interproscan]
       # Interproscan executable file, e.g. /my/path/interproscan/interproscan.sh [ Default: interproscan.sh ]
       interproscan_cmmd = interproscan.sh
       # The appls used by interroiscan: [appls] comma separated list of analyses, [ Choices: CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,ProDom,PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM,Phobius,SignalP,TMHMM ]; [all] use all all analyses for running. [ Default: Pfam ]
       interproscan_appl = Pfam
       # The number of spliting files which can be annotated in parallel. [ Default: 1 ]
       split_number = 1
      
    • Customize parameter settings for association with environmental/host phenotypes (optional; but if you plan to run MetaWIBELE for supervised prioritization you are required to specify your settings for MaAsLin2 and specify the main phenotype metadata used for prioritization):

       [maaslin2]
       # The absolute path of Maaslin2 executable file, e.g. /my/path/Maaslin2/R/Maaslin2.R [ Default: Maaslin2.R ]
       maaslin2_cmmd = Maaslin2.R
       # The minimum abundance for each feature. [ Default: 0 ]  
       min_abundance = 0
       # The minimum percent of samples for which a feature is detected at minimum abundance. [ Default: 0.1 ]
       min_prevalence = 0.1
       # Keep features with variance greater than. [Default: 0.0]
       min_variance = 0
       # The q-value threshold for significance. [ Default: 0.25 ]
       max_significance = 0.25
       # The normalization method to apply [ Choices: TSS, CLR, CSS, NONE, TMM ]. [ Default: TSS ]
       normalization = NONE
       # The transform to apply [ Choices: LOG, LOGIT, AST, NONE ].  [ Default: LOG ]
       transform = LOG
       # The analysis method to apply [ Choices: LM, CPLM, ZICP, NEGBIN, ZINB ]. [ Default: LM ]
       analysis_method = LM
       # The fixed effects for the model, comma-delimited for multiple effects. [ Default: all ]
       fixed_effects = all
       # The random effects for the model, comma-delimited for multiple effects. [ Default: none ]
       random_effects = none
       # The correction method for computing the q-value. [ Default: BH ]
       correction = BH
       # Apply z-score so continuous metadata are on the same scale. [ Default: TRUE ]
       standardize = TRUE
       # Generate a heatmap for the significant associations. [ Default: FALSE ]
       plot_heatmap = FALSE
       # In heatmap, plot top N features with significant associations. [ Default: FALSE ]
       heatmap_first_n = FALSE
       # Generate scatter plots for the significant associations. [ Default: FALSE ]
       plot_scatter = FALSE
       # The number of R processes to run in parallel. [ Default: 1 ]
       maaslin2_cores = 1
       # The factor to use as a reference for a variable with more than two levels provided as a string of 'variable,reference' semi-colon delimited for multiple variables. NOTE: A space between the variable and reference will not error but will cause an inaccurate result. [ Default: NA ]
       reference = NA
       # The minimum percent of case-control samples used for comparision in which a feature is detected. [ Default: 0.1 ]
       tshld_prevalence = 0.10
       # The q-value threshold for significance used as DA annotations. [ Default: 0.05 ]
       tshld_qvalue = 0.05
       # The statistic used as effect size [ Choices: coef, mean(log) ]. [coef] represents the coefficient from the model; [mean(log)] represents the difference of mean values between case and control conditions. [  Default: mean(log) ]
       effect_size = mean(log)
       # The main phenotype metadata used for prioritization, e.g. metadata1. [ Default: none ]: skip the association with environmental/phenotypic parameters
       phenotype = none
      

2. MetaWIBELE utility: preprocessing metagenomic sequencing reads

2.1 MetaWIBELE-preprocess input files

MetaWIBELE provides utilities to build gene families starting from quality-controlled shotgun sequencing reads.

This tutorial will use demo input files distributed with the MetaWIBELE installation under the examples/raw_reads/ directory. If you installed with pip or conda and don't have these handy, you can obtain copies by right-clicking these links and selecting "save link as":

These two demo samples were taken from the PRISM (Prospective Registry in IBD Study at MGH) cohort. One sample is from a subject with Crohn’s disease (CD) and the other is from a non-IBD control.

2.2 Running MetaWIBELE utility: preprocessing module

The preprocessing module is a utility in the MetaWIBELE package for preprocessing metagenomic reads. It performs the following steps: (i) metagenomic assembly, (ii) open reading frame prediction, (iii) gene family (i.e. non-redundant gene catalogs) construction, and (iv) gene abundance estimation.

If you installed from conda / pip, download them here and save them to metawibele/raw_reads directory. If you installed from source, copy them from the metawibele source package to metawibele/raw_reads.

In the VM, navigate to the correct directory with:

cd ~/Tutorials/metawibele

Use the fastq.gz files from metawibele/raw_reads to run the command:

metawibele preprocess --input raw_reads/ --output output/preprocess/ \
--output-basename demo --extension-paired "_R1.fastq.gz,_R2.fastq.gz" \
--extension ".fastq.gz" --local-jobs 4

On the terminal, you will notice tasks being initiated and completed.

(Jun 10 15:33:56) [ 2/28 -   7.14%] **Started  ** Task  7: ln__sample2.contigs.fa
(Jun 10 15:33:57) [ 4/28 -  14.29%] **Started  ** Task  9: format_contig_table
(Jun 10 15:33:58) [16/28 -  57.14%] **Started  ** Task 26: extract_protein_coding_genes
(Jun 10 15:33:59) [20/28 -  71.43%] **Started  ** Task 33: extract_cluster_CD-hit
(Jun 10 15:34:04) [24/28 -  85.71%] **Started  ** Task 43: ln__sample1.sort.bed
(Jun 10 15:33:36) [ 0/28 -   0.00%] **Started  ** Task  0: sample2__megahit
(Jun 10 15:33:57) [ 3/28 -  10.71%] **Started  ** Task  8: ln__sample1.contigs.fa
...

Note:

  • The input directory should contain quality-controlled shotgun sequencing metagenome files (fastq, fastq.gz, fasta, or fasta.gz format)
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The preprocessing workflow will run with the default settings for all subtasks. You can change the command line parameters to customize your settings for the preprocessing workflow.
    • For example, if your samples have a different file type extension or different paired-end labels you can modify the default settings when running the assembly task using the command line parameters --extension and --extension-paired. Using the text that follows after $SAMPLE in the input file name specify: --extension-paired "$R1_suffix,$R2_suffix", --extension "$fastq_suffix"
  • To write the log and err files in the working directory, add >preprocess.log 2>preprocess.err after the command above.

2.3 Preprocessing outputs

Main output files

The following are the two main output files of the preprocessing utility that are used for MetaWIBELE:

output/preprocess/finalized/demo_genecatalogs.centroid.faa
output/preprocess/finalized/demo_genecatalogs_counts.all.tsv

The pre-generated demo files can be directly downloaded here:

demo_genecatalogs.centroid.faa

To view this file, enter:

less -S output/preprocess/finalized/demo_genecatalogs.centroid.faa		

which yields:

>sample1_10-1
MFQDKYVFAQLASFLNRSKFNRIVAKYDGDKYVKHFTCWNQLLALMFGQLSNRESLRDLIVALEAHHSKCYHLGMGKNVS
KSSLARANQDRDYHIFEEYAYYLVSEAREKRVTHIFKLGGNVYAFDSTTIDLCLSVFWWAKFRKKKGGIKVHTLYDIETQ
IPAFFHITEASVHDSKVMNEIPYEPGSYYIFDRGYNNFKMLYKIHQIEAYFVVRAKKNLQYKSIKWKRRLPKNVFSDASI
LLTGFYPKQYYPEPLRLVKYWDEEQEREFTFITNAMHISALQVAELYKNRWQVELFFKWLKQHLKIKRFWGTTENAVRIQ
IYAAICTYCLVAIIQHDMQLDRSTYEVLQILSISLTDKTHLRDLFDKTKFQNDKERFGPNGPSLFN
  • This file provides the amino acid sequences for each gene identified as the representative of its gene family (Fasta format file).
  • Each gene is given a MetaWIBELE-specific ID (i.e. sample1_10-1) based on the sample and order in which it was identified.
demo_genecatalogs_counts.all.tsv

To view this file, enter:

less -S output/preprocess/finalized/demo_genecatalogs_counts.all.tsv		

which yields:

ID      sample1 sample2
sample1_10-1    67      5
  • This file provides the counts of the number of reads mapped to each gene family (rows) in each sample (columns) (TSV format file).

Intermediate files

  1. Assembly results:

    • output/preprocess/finalized/demo_contig_sequence.fasta: contig sequences (Fasta format file).
    • The assembly outputs for each sample are in the output/preprocess/assembly/ folder.
  2. Gene-calling results:

    • output/preprocess/finalized/demo_gene_info.tsv: all gene calling information (TSV format file).
    • output/preprocess/finalized/demo_combined_gene_protein_coding.sorted.fna: nucleotide sequences for all ORFs sorted by gene length (Fasta format file).
    • output/preprocess/finalized/demo_combined_protein.sorted.faa: protein sequences for all ORFs sorted by gene length (Fasta format file).
    • output/preprocess/finalized/demo_combined_gene_protein_coding.complete.sorted.fna: nucleotide sequences for all complete ORFs sorted by gene length (Fasta format file).
    • output/preprocess/finalized/demo_combined_protein.complete.sorted.faa: protein sequences for all complete ORFs sorted by gene length (Fasta format file).
    • The gene-calling outputs from prodigal are in the output/preprocess/gene_calls/ folder.
    • The gene-annotation outputs from prokka are in the output/preprocess/gene_annotation/ folder.
  3. Mapping results:

    • All mapping outputs for each sample are in the output/preprocess/mapping/ folder.
  4. Gene families:

    • output/preprocess/finalized/demo_genecatalogs.clstr: clustering information for non-redudant gene families (Fasta format file).
    • output/preprocess/finalized/demo_genecatalogs.centroid.fna: nucleotide sequences of gene family representatives (Fasta format file).

  • Single-sample assembly allows us to estimate the genomes per metagenomic sample using countable computational resources. What might be some pros/cons of this approach in real data? How about co-assembly by assembling metagenomes across samples?

3. Characterizing assembled metagenomic genes

3.1 MetaWIBELE-characterize input files

The MetaWIBELE-characterize module is run using gene families with protein sequence and read count information:

This tutorial will use the demo input files distributed with the MetaWIBELE installation under the examples/input/ directory. If you installed with pip or conda and don't have these handy, you can obtain copies by right-clicking these links and selecting "save link as":

Let's see the contents of these files.

demo_mgx_metadata.tsv

column -t -s $'\t' input/demo_mgx_metadata.tsv | less -S

which yields:

SID         study  sex     consent_age  antibiotic  mesalamine  steroids  immunosuppressant  diagnosis
PRISM_7122  PRISM  Female  38           No          No          No        Yes                CD
PRISM_7147  PRISM  Male    50           No          Yes         No        No                 CD
PRISM_7150  PRISM  Male    41           No          No          No        Yes                CD
PRISM_7153  PRISM  Male    51           No          Yes         No        No                 CD

The first column is the sample ID of a stool donor in the PRISM cohort which contains Crohn's disease (CD) patients and healthy controls (nonIBD). Other relevant metadata is also listed for these individuals. There are 102 samples (68 CD). In this tutorial, we are interested in prioritizing gene families associated with CD. Some of the relevant lines in the metawibele.cfg file (under [maaslin2] options) are:

# The factor to use as a reference for a variable with more than two levels provided as a string of 'variable,reference' semi-colon delimited for multiple variables. NOTE: A space between the variable and reference will not error but will cause an inaccurate result. [ Default: NA ]
reference = diagnosis,nonIBD
# The statistic used as effect size [ Choices: coef, mean(log) ]. [coef] represents the coefficient from the model; [mean(log)] represents the difference of mean values between case and control conditions. [  Default: mean(log) ]
effect_size = mean(log)
# The main phenotype metadata used for prioritization, e.g. metadata1. [ Default: none ]: skip the association with environmental/phenoty
phenotype = diagnosis

demo_genecatalogs.centroid.faa

To view this file, type:

less -S input/demo_genecatalogs.centroid.faa

Let's count the number of genes (centroids):

grep ">" input/demo_genecatalogs.centroid.faa | wc -l

which yields 604 gene families.

demo_genecatalogs_counts.all.tsv

To view this file, type:

less -S input/demo_genecatalogs_counts.all.tsv 

which yields:

ID      PRISM_7122      PRISM_7147      PRISM_7150      PRISM_7153      PRISM_7184      PRISM_7238      PRISM_7406      PRISM_7408
PRISM_7122_03875        197     16      2       15      87      0       0       80      92      0       33      0       1       90
PRISM_7122_38863        16      3       2       8       2       34      905     62      12      38      5       6       0       0
PRISM_7147_05905        0       1934    2       11      0       0       0       16      129     1       18      0       84      83
PRISM_7147_105078       0       2304    132     506     0       8       3       123     718     1       952     1       646     240
PRISM_7147_276043       0       2611    126     944     4       32      2       183     1108    3       1740    5       1037    352

3.2 Running MetaWIBELE: characterization module

A canonical MetaWIBELE run uses gene families (non-redundant gene catalogs) to build protein families and annotate them functionally and taxonomically. By default, protein families are annotated using global-homology-based, domain/motif-based, and abundance-based approaches. From the examples folder (or the location where you saved the gene families and sample metadata files above) execute the following:

If you're on the VM, use the files from metawibele/input/. If you installed with pip or conda, download them here and save them to the metawibele/raw_reads directory. If you installed from source, copy them from the metawibele source package to the metawibele/input/ directory. To run the following command, change to the metawibele/ directory and run:

metawibele characterize --input-sequence input/demo_genecatalogs.centroid.faa \
--input-count input/demo_genecatalogs_counts.all.tsv \
--input-metadata input/demo_mgx_metadata.tsv \
--output output/characterization/ --local-jobs 4 --bypass-psortb \
>characterize.log 2>characterize.err &

Note:

  • Make sure the global configuration file metawibele.cfg is in your working directory (i.e. metawibele/ directory).
  • In the command replaces output/characterization/ with the path to the folder where you want to write the output files.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).

Note the characterization phases of MetaWIBELE-characterize in the above log: 1) building protein families, 2) mapping proteins to UniRef (global-homology based annotation), 3) characterizing protein domains and motifs for protein families (domain-motif based annotation), 4) profiling taxonomy by binning co-abundant gene families and propagating the taxa to unclassified genes, 5) building an abundance table for protein families and associating it with environmental parameters or phenotypes.


  • How would you customize the workflow settings to skip the domain-motif based annotation step in order to reduce computational consumption?

3.3 MetaWIBELE characterization outputs

MetaWIBELE-characterize creates four main output subfolders that contain the main outputs of each step:

output/characterization/abundance_annotation/
output/characterization/clustering/
output/characterization/domain_motif_annotation/
output/characterization/finalized/
output/characterization/global_homology_annotation/

These are the main finalized characterization output files:

output/characterization/finalized/demo_proteinfamilies_annotation.tsv
output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv
output/characterization/finalized/demo_proteinfamilies_annotation.taxonomy.tsv
output/characterization/finalized/demo_proteinfamilies_nrm.tsv

The pre-generated demo files can be directly downloaded here:

Let's inspect the contents of the finalized annotation files (in the sub-folder finalized) and the intermediate subfolders for each annotation step:

3.3.1 finalized annotations

demo_proteinfamilies_annotation.tsv

To view this file, type:

less -S output/characterization/finalized/demo_proteinfamilies_annotation.tsv

which yields:

familyID	annotation	feature	category	method	AID
Cluster_1	good	quality	note	Quality_control	NA
Cluster_1	demo	study	project	Shotgun	NA
Cluster_1	Cluster_1	protein_family	Denovo_clustering	CD-hit	Cluster_1__Denovo_clustering
Cluster_1	UniRef90_A7B522	strong_homology	UniRef90_homology	UniRef90	Cluster_1__UniRef90_homology
Cluster_1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	Terminal	Taxonomy_characterization	Taxonomy_annotation	Cluster_1__Taxonomy_characterization
Cluster_1	23863.237926470596	DNA-CD_abundance	Denovo_characterization	DNA	Cluster_1__DNA-CD_abundance
Cluster_1	0.9264705882352942	DNA-CD_prevalence	Denovo_characterization	DNA	Cluster_1__DNA-CD_prevalence
Cluster_1	2429.2891764705887	DNA-nonIBD_abundance	Denovo_characterization	DNA	Cluster_1__DNA-nonIBD_abundance
Cluster_1	1.0	DNA-nonIBD_prevalence	Denovo_characterization	DNA	Cluster_1__DNA-nonIBD_prevalence
Cluster_1	23863.237926470596	DNA-within-phenotype_abundance	Denovo_characterization	DNA	Cluster_1__DNA-within-phenotype_abundance
Cluster_1	1.0	DNA-within-phenotype_prevalence	Denovo_characterization	DNA	Cluster_1__DNA-within-phenotype_prevalence
Cluster_1	16718.588343137264	DNA_abundance	Denovo_characterization	DNA	Cluster_1__DNA_abundance
Cluster_1	0.9509803921568627	DNA_prevalence	Denovo_characterization	DNA	Cluster_1__DNA_prevalence
Cluster_1	PF01176:PF00164;PF01176:PF11987	DOMINE_interaction	Denovo_characterization	DOMINE	Cluster_1__DOMINE_interaction
Cluster_1	PF01176:PF00164;PF01176:PF11987	DOMINE_interaction-human	Denovo_characterization	DOMINE	Cluster_1__DOMINE_interaction-human
Cluster_1	PF01176:PF00164;PF01176:PF11987	ExpAtlas_interaction	Denovo_characterization	ExpAtlas	Cluster_1__ExpAtlas_interaction
Cluster_1	PF01176	InterProScan_PfamDomain	Denovo_characterization	InterProScan	Cluster_1__InterProScan_PfamDomain
Cluster_1	msp_unknown	MSPminer_MSP	Denovo_characterization	MSPminer	Cluster_1__MSPminer_MSP
Cluster_1	up(CD_vs_nonIBD)	MaAsLin2_DA	Denovo_characterization	MaAsLin2	Cluster_1__MaAsLin2_DA
...
  • This file details the annotation of each protein family in the community (TSV format file).
  • Protein families are groups of evolutionarily-related protein-coding sequences that often perform similar functions.
  • MetaWIBELE annotates protein families by combining global-homology similarity, local-homology similarity, and non-homology based methods.
  • The annotations for each protein family come from multiple information sources, e.g. biochemical annotation, taxonomic annotation, ecological properties and association with environmental parameters or phenotypes, etc.
demo_proteinfamilies_annotation.attribute.tsv

To view this file, type:

less -S output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv

which yields:

TID	AID	key	value
1	Cluster_1__Denovo_clustering	repID	PRISM_7861_211956
2	Cluster_1__Denovo_clustering	rep_length	85
3	Cluster_1__Denovo_clustering	cluster_size	24
4	Cluster_1__UniRef90_homology	UniProtKB	A7B522
5	Cluster_1__UniRef90_homology	Protein_names	Translation initiation factor IF-1
6	Cluster_1__UniRef90_homology	query_cov_type	high_confidence
...
  • This is the supplementary results for characterization (TSV format file).
  • Each line in the file provides supplemental information about the corresponding annotation results. AID is the key to connect demo_proteinfamilies_annotation.tsv with demo_proteinfamilies_annotation.attribute.tsv.
demo_proteinfamilies_annotation.taxonomy.tsv

To view this file, type:

less -S output/characterization/finalized/demo_proteinfamilies_annotation.taxonomy.tsv

which yields:

familyID	study	map_type	query_type	mutual_type	identity	query_coverage	mutual_coverage	detail	TaxTaxID	Rep_Tax	Rep_TaxID	UniProtKB	unirefID	note	msp_name	msp_taxa_name	msp_taxa_id	MSP_Tax	MSP_TaxID	MSP_Rep_Tax	MSP_Rep_TaxID	taxa_id	taxa_name	taxa_rank	taxa_lineage
Cluster_1	demo	UniRef90_characterized	high_confidence	high_confidence	100	1.0	1.0	Translation initiation factor IF-1	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522	good	msp_unknownNA	NA	Lachnospiraceae	186803	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	411470	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	Terminal	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)
Cluster_10	demo	UniRef90_uncharacterized	high_confidence	high_confidence	99.1	1.0	1.0	Phenylalanine--tRNA ligase beta subunit	Clostridiales	186802	[Eubacterium] rectale	39491	A0A395ZH25	UniRef90_A0A395ZH25	good	msp_1	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_100	demo	UniRef90_uncharacterized	high_confidence	high_confidence	98.9	1.0	1.0	PTS fructose transporter subunit IIC	Clostridiales	186802	[Eubacterium] rectale	39491	A0A395UVP7	UniRef90_A0A395UVP7	good	msp_1	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_101	demo	UniRef90_uncharacterized	high_confidence	high_confidence	93.6	1.0	1.0	RecQ family ATP-dependent DNA helicase	Bacteria	2	[Eubacterium] rectale	39491	A0A396FK20	UniRef90_A0A396FK20	good	msp_1	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
...
  • This file provides detailed information about taxonomic annotation (TSV format file).
  • demo_proteinfamilies_annotation.taxonomy.all.tsv shows the taxonomic information at each taxonomic level.
  • Main columns: taxa_id: the finalized the taxon ID of this protein family (either taxon for LCA or representative based on parameter setting, default representative taxon); taxa_name: the finalized the taxon name of this protein family; taxa_rank: the finalized the taxonomic level of this protein family; taxa_lineage: the finalized the taxonomic lineage of this protein family.
demo_proteinfamilies_nrm.tsv

To view this file, type:

less -S output/characterization/finalized/demo_proteinfamilies_nrm.tsv

which yields:

ID	PRISM_7122	PRISM_7147	PRISM_7150	PRISM_7153	PRISM_7184	PRISM_7238	PRISM_7406	PRISM_7408	PRISM_7421
Cluster_1	9165.52	562.073	3127.42	592.641	91252.4	17300.3	25623.1	8615.57	180.455
Cluster_10	80.549	2873.28	2402.93	3313.99	801.949	1655.54	5.23934	1639.27	2993.25
Cluster_100	202.251	1817.23	1632.6	2618.05	503.404	933.184	4.93331	925.316	1612.71
Cluster_101	206.759	1736.68	1729.47	1543.63	0	1084.08	8.40548	1068.26	1638.2
...
  • This file contains the relative abundance per protein family across samples (TSV format file).
  • Protein family abundance is reported in copies per million (CPM) units by default, which is "total sum scaling (TSS)"-style normalization (i.e. each sample is constrained to sum to 1 million). First, each protein family is normalized to RPK (reads per kilobase) units for gene length normalization; RPK units reflect relative gene (or transcript) copy numbers in the community. Then, RPK values are further sum-normalized (CPM) to adjust for differences in sequencing depth across samples. For further information, refer to the normalization approach in HUMAnN.

3.3.2 protein families

The clustering folder contains the outputs of clustering proteins into protein families using the same criteria for building UniRef90 clusters.

The pre-generated demo files can be directly downloaded here:

demo_proteinfamilies.clstr

To view this file, enter:

less -S output/characterization/clustering/demo_proteinfamilies.clstr		

which yields:

>PRISM_7861_211956;Cluster_1;length=85;size=24;cluster=1
PRISM_7861_211956
PRISM_7791_142577
PRISM_8794_132719
PRISM_7910_97769
...
>PRISM_7855_54982;Cluster_2;length=559;size=12;cluster=2
PRISM_7855_54982
PRISM_8776_12844
PRISM_8467_139927
PRISM_9148_54414
...
  • This file shows the clustered proteins (Fasta format file).
  • The top ID (e.g. PRISM_7861_211956) is the representative of the cluster and it is followed by a list of cluster members.
  • Each protein is given a MetaWIBELE-specific ID (i.e. PRISM_7861_211956 and PRISM_7861_54982) based on the sample and order in which it was identified.
demo_proteinfamilies.centroid.faa

To view this file, enter:

less -S output/characterization/clustering/demo_proteinfamilies.centroid.faa		

which yields:

>PRISM_7861_211956
MRSGSVKSAGGFSMSKADVIEIEGVVVEKLPNAMFKVELENGHIVLAHISGKLRMNFIKILPGDKVTLEMSPYDLSKGRI
...
>PRISM_7855_54982
MISDNARAGMQQAPARSLFNALGFTPEEMKKPMVGIVSSFNEIVPGHMNIDKIVEAVKLGVAEAGGVPVVFPAIAVCDGI
AMGHVGMKYSLVTRDLIADSTECMAIAHQFDALVMVPNCDKNVPGLLMAAARLNLPTVFVSGGPMLAGHVKGRKRSLSSM
...
  • This file provides the amino acid sequences for each protein identified as the representative of its protein family (Fasta format file).

3.3.3 global-homology based annotations

The global_homology_annotation folder contains the outputs from querying each sequence in the protein families against the UniRef90 database via a protein-level search.

The pre-generated demo files can be directly downloaded here:

demo_proteinfamilies_annotation.uniref90_annotation.tsv

To view this file, enter:

less -S output/characterization/global_homology_annotation/demo_proteinfamilies_annotation.uniref90_annotation.tsv	

which yields:

familyID	study	map_type	query_type	mutual_type	identity	query_coverage	mutual_coverage	detail	TaxTaxID	Rep_Tax	Rep_TaxID	UniProtKB	unirefID	taxa_id	taxa_name	taxa_rank	taxa_lineage	note
Cluster_1	demo	UniRef90_characterized	high_confidence	high_confidence	100	1.0	1.0	Translation initiation factor IF-1	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522	411470	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	Terminal	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)	good
Cluster_100	demo	UniRef90_uncharacterized	high_confidence	high_confidence	98.9	1.0	1.0	PTS fructose transporter subunit IIC	Clostridiales	186802	[Eubacterium] rectale	39491	A0A395UVP7	UniRef90_A0A395UVP7	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
Cluster_106	demo	UniRef90_weak_homology	low_confidence	low_confidence	97.4	0.5571895424836601	0.5571895424836601	NA	NA	NA	NA	NA	NA	UniRef90_R6U3V8	NA	NA	Unclassified	NA	good
Cluster_285	demo	UniRef90_worse_homology	low_confidence	low_confidence	94.8	0.7584269662921348	0.14594594594594595	UniRef90_unknown	NA	NA	NA	NA	NA	NA	39491	NA	Unclassified	NA	good
...
  • This file shows the mapping results of protein families to UniRef90 (TSV format file).
  • Each row in this file corresponds to a protein family (i.e. Cluster_1, Cluster_10). Based on the sequence identity to and coverage of the UniRef90 sequence, each protein family is labeled as:
    • UniRef90_characterized: proteins with strong homology to known characterized proteins
    • UniRef90_uncharacterized: proteins with strong homology to proteins of uncharacterized function (i.e. lacking a GO term assignment)
    • UniRef90_weak_homology: potentially novel proteins that only had weak homology to any UniRef90 family
    • UniRef90_worse_homology: completely novel proteins without any significant homology
demo_UniRef90_proteinfamilies.detail.tsv

To view this file, enter:

less -S output/characterization/global_homology_annotation/demo_UniRef90_proteinfamilies.detail.tsv	

which yields:

familyID	type	detail	Protein_names	Gene_names	Tax	TaxID	Rep_Tax	Rep_TaxID	UniProtKB	unirefID
Cluster_1	UniRef90_GO	GO:0003743;GO:0005737;GO:0019843;GO:0043022	Translation initiation factor IF-1	RUMGNA_02659{ECO:0000313|EMBL:EDN77045.1}	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522
Cluster_1	UniRef90_PfamDomain	PF01176	Translation initiation factor IF-1	RUMGNA_02659{ECO:0000313|EMBL:EDN77045.1}	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522
Cluster_1	UniRef90_eggNOG	COG0361;ENOG4105K9U	Translation initiation factor IF-1	RUMGNA_02659{ECO:0000313|EMBL:EDN77045.1}	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522
Cluster_10	UniRef90_uncharacterized	UniRef90_uncharacterized	Phenylalanine--tRNA ligase beta subunit	DW951_09675{ECO:0000313|EMBL:RHA03586.1}	Clostridiales	186802	[Eubacterium] rectale	39491	A0A395ZH25	UniRef90_A0A395ZH25
Cluster_106	UniRef90_unknown	UniRef90_unknown	UniRef90_unknown	NA	NA	NA	NA	NA	NA	NA
...
  • This file details the annotations of the UniRef90s to other groupings, including GO terms, KEGG Orthogroups, EggNOG (including COGs), and Pfams (TSV format file).
  • Protein families may also be labeled as:
    • UniRef90_XYZ: proteins with strong homology to known characterized proteins and annotated with specific type of functions
    • UniRef90_uncharacterized: proteins with strong homology to proteins of uncharacterized function (i.e. lacking GO annotation)
    • UniRef90_unknown: potentially novel proteins that lacked functional annotations and only had weak homology or without any significant homology to any UniRef90 family

3.3.4 domain-motif annotations

The domain_motif_annotation folder contains the outputs created by using the local-homology approach (domain/motif search) to characterize the secondary structures of protein families.

The pre-generated demo files can be directly downloaded here:

demo_Denovo_proteinfamilies.signaling.detail.tsv

To view this file, enter:

less -S output/characterization/domain_motif_annotation/demo_Denovo_proteinfamilies.signaling.detail.tsv	

which yields:

familyID	type	detail	consistency
Cluster_108	Denovo_signaling	signaling	1.0
Cluster_11	Denovo_signaling	signaling	1.0
Cluster_111	Denovo_signaling	signaling	1.0
Cluster_114	Denovo_signaling	signaling	1.0
...
  • This file lists potential signaling proteins based on de novo prediction of function (TSV format file).
  • Proteins that are not predicted to be signaling proteins are excluded from the file.
  • If signal peptides were predicted by both SignalP and Phobius, the consistency score (column 4) contains the consistency scores from SignalP and Phobius predictions. When both SignalP and Phobius predict the same type of signal and have high consistency scores the prediction is believed to be true and the protein family is included.
demo_Denovo_proteinfamilies.transmembrane.detail.tsv

To view this file, enter:

less -S output/characterization/domain_motif_annotation/demo_Denovo_proteinfamilies.transmembrane.detail.tsv	

which yields:

familyID	type	detail	consistency
Cluster_100	Denovo_transmembrane	transmembrane	1.0
Cluster_102	Denovo_transmembrane	transmembrane	1.0
Cluster_105	Denovo_transmembrane	transmembrane	1.0
Cluster_108	Denovo_transmembrane	transmembrane	1.0
...
  • Similar to the above, this file lists potential transmembrane proteins based on de novo prediction of function (TSV format file).
demo_InterProScan_proteinfamilies.detail.tsv

To view this file, enter:

less -S output/characterization/domain_motif_annotation/demo_InterProScan_proteinfamilies.detail.tsv	

which yields:

familyID	type	detail	description	InterPro_accession	consistency
Cluster_257	InterProScan_PfamDomain	PF00005;PF02518	ABC transporter-like;Histidine kinase/HSP90-like ATPase	IPR003439;IPR0035941.0;1.0
Cluster_191	InterProScan_PfamDomain	PF07005;PF17042	Four-carbon acid sugar kinase, N-terminal domain;Four-carbon acid sugar kinase, C-terminal domain	IPR010737;IPR031475	1.0;1.0
Cluster_130	InterProScan_PfamDomain	PF13437;PF13533	HlyD family secretion protein;Membrane fusion protein, biotin-lipoyl like domain	NA;IPR039562	1.0;1.0
Cluster_71	InterProScan_PfamDomain	PF00343	Glycosyl transferase, family 35	IPR000811	1.0
...
  • This file lists the results of using InterProScan to predict Pfam IDs or other domain signatures for each protein family (TSV format file).
  • The consistency score is calculated as the proportion of members within the cluster that receive the same predicted Pfam ID as the protein family representative.
demo_DOMINE_proteinfamilies.detail.tsv

To view this file, enter:

less -S output/characterization/domain_motif_annotation/demo_DOMINE_proteinfamilies.detail.tsv	

which yields:

familyID	type	detail	description	consistency
Cluster_257	DOMINE_interaction	PF00005:PF00009;PF00005:PF00037;PF00005:PF00571;PF00005:PF00664;PF00005:PF03459;PF02518:PF00012;PF02518:PF00072;PF02518:PF00165;PF02518:PF00183;PF02518:PF00204;PF02518:PF00240;PF02518:PF00515;PF02518:PF00582;PF02518:PF00672;PF02518:PF00785;PF02518:PF00989;PF02518:PF01119;PF02518:PF01584;PF02518:PF01590;PF02518:PF01624;PF02518:PF01627;PF02518:PF01740;PF02518:PF01751;PF02518:PF02895;PF00005:PF00528;PF00005:PF01458;PF00005:PF08402;PF02518:PF00512;PF02518:PF04539;PF02518:PF04542;PF02518:PF04968;PF02518:PF04969;PF02518:PF05833;PF02518:PF07730;PF02518:PF08565;PF02518:PF09239;PF02518:PF10436	ABC transporter:Elongation factor Tu GTP binding domain;ABC transporter:4Fe-4S binding domain;ABC transporter:CBS domain;ABC transporter:ABC transporter transmembrane region;ABC transporter:TOBE domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Hsp70 protein;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Response regulator receiver domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Bacterial regulatory helix-turn-helix proteins, AraC family;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Hsp90 protein;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:DNA gyrase B;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Ubiquitin family;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Tetratricopeptide repeat;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Universal stress protein family;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:HAMP domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:NA;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:PAS fold;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:DNA mismatch repair protein, C-terminal domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:CheW-like domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:GAF domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:MutS domain I;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Hpt domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:STAS domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Toprim domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Signal transducing histidine kinase, homodimeric domain;ABC transporter:Binding-protein-dependent transport system inner membrane component;ABC transporter:Uncharacterized protein family (UPF0051);ABC transporter:TOBE domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:His Kinase A (phospho-acceptor) domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Sigma-70 region 3;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Sigma-70 region 2;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:CHORD;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:CS domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Fibronectin-binding protein A N-terminus (FbpA);Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Histidine kinase;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Cdc37 Hsp90 binding domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Topoisomerase VI B subunit, transducer;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Mitochondrial branched-chain alpha-ketoacid dehydrogenase kinase	1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0
Cluster_256	DOMINE_interaction	PF00293:PF00300;PF00293:PF05026;PF00293:PF09296	NUDIX domain:Histidine phosphatase superfamily (branch 1);NUDIX domain:Dcp2, box A domain;NUDIX domain:NADH pyrophosphatase-like rudimentary NUDIX domain	1.0;1.0;1.0
Cluster_16	DOMINE_interaction	PF00009:PF00005;PF00009:PF00679;PF00009:PF01008;PF00009:PF01423;PF00009:PF01583;PF00009:PF03143;PF00009:PF03144;PF00009:PF03764;PF03144:PF00679;PF03144:PF01583;PF03144:PF03143;PF03144:PF03764;PF00009:PF00129;PF00009:PF00736;PF00009:PF01507;PF00009:PF01926;PF00009:PF03431;PF00009:PF06421;PF00009:PF07686;PF00009:PF09173;PF00009:PF11987;PF03144:PF00736;PF03144:PF01507;PF03144:PF01873;PF03144:PF03431;PF03144:PF07541;PF03144:PF09173;PF03144:PF11987	Elongation factor Tu GTP binding domain:ABC transporter;Elongation factor Tu GTP binding domain:Elongation factor G C-terminus;Elongation factor Tu GTP binding domain:Initiation factor 2 subunit family;Elongation factor Tu GTP binding domain:LSM domain;Elongation factor Tu GTP binding domain:Adenylylsulphate kinase;Elongation factor Tu GTP binding domain:Elongation factor Tu C-terminal domain;Elongation factor Tu GTP binding domain:Elongation factor Tu domain 2;Elongation factor Tu GTP binding domain:Elongation factor G, domain IV;Elongation factor Tu domain 2:Elongation factor G C-terminus;Elongation factor Tu domain 2:Adenylylsulphate kinase;Elongation factor Tu domain 2:Elongation factor Tu C-terminal domain;Elongation factor Tu domain 2:Elongation factor G, domain IV;Elongation factor Tu GTP binding domain:Class I Histocompatibility antigen, domains alpha 1 and 2;Elongation factor Tu GTP binding domain:EF-1 guanine nucleotide exchange domain;Elongation factor Tu GTP binding domain:Phosphoadenosine phosphosulfate reductase family;Elongation factor Tu GTP binding domain:50S ribosome-binding GTPase;Elongation factor Tu GTP binding domain:RNA replicase, beta-chain;Elongation factor Tu GTP binding domain:GTP-binding protein LepA C-terminus;Elongation factor Tu GTP binding domain:Immunoglobulin V-set domain;Elongation factor Tu GTP binding domain:Initiation factor eIF2 gamma, C terminal;Elongation factor Tu GTP binding domain:Translation-initiation factor 2;Elongation factor Tu domain 2:EF-1 guanine nucleotide exchange domain;Elongation factor Tu domain 2:Phosphoadenosine phosphosulfate reductase family;Elongation factor Tu domain 2:Domain found in IF2B/IF5;Elongation factor Tu domain 2:RNA replicase, beta-chain;Elongation factor Tu domain 2:Eukaryotic translation initiation factor 2 alpha subunit;Elongation factor Tu domain 2:Initiation factor eIF2 gamma, C terminal;Elongation factor Tu domain 2:Translation-initiation factor 2	1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0
Cluster_298	DOMINE_interaction	PF07963:PF00114	Prokaryotic N-terminal methylation motif:Pilin (bacterial filament)	1.0
...
  • This file lists the results of using DOMINE to predict domain-domain interactions for each protein (TSV format file).
  • In this step, MetaWIBELE searches the DOMINE database of known and predicted protein domain (domain-domain) interactions in order to identify microbial proteins that interact with other microbial proteins or host proteins.
  • Looking at the first row for protein family Cluster_298, the details column contains "PF07963:PF00114". The first value is the Pfam ID for this protein family, and the second value is the Pfam ID of the protein it is known or predicted to interact with.
  • Protein families can interact with multiple other proteins and multiple interactions are separated by ";" in the details column. For example, the third row for protein family Cluster_257 has Pfam ID PF00005 and interacts with multiple other proteins including PF00009, PF00037, PF00571, and many others.

3.3.5 abundance annotations

The abundance_annotation folder contains the outputs created from utilizing a non-homology based strategy compromising (i) taxonomic annotation with phylogenetic binning, (ii) abundance profiling for protein families, and (iii) association with environmental parameters or phenotypes based on differential abundance.

The pre-generated demo files can be directly downloaded here:

demo_DNA_proteinfamilies.abundance.detail.tsv

To view this file, enter:

less -S output/characterization/abundance_annotation/demo_DNA_proteinfamilies.abundance.detail.tsv	

which yields:

familyID	type	detail	mean_abundance	mean_prevalent_abundance	prevalence	total_abundance	total_samples
Cluster_1	DNA-CD_abundance	23863.237926470596	23863.237926470596	25757.145698412707	0.9264705882352942	1622700.1790000005	68
Cluster_1	DNA-CD_prevalence	0.9264705882352942	23863.237926470596	25757.145698412707	0.9264705882352942	1622700.1790000005	68
Cluster_1	DNA_abundance	16718.588343137264	16718.588343137264	17580.37124742269	0.9509803921568627	1705296.0110000009	102
Cluster_1	DNA_prevalence	0.9509803921568627	16718.588343137264	17580.37124742269	0.9509803921568627	1705296.0110000009	102
Cluster_1	DNA-nonIBD_abundance	2429.2891764705887	2429.2891764705887	2429.2891764705887	1.0	82595.83200000001	34
Cluster_1	DNA-nonIBD_prevalence	1.0	2429.2891764705887	2429.2891764705887	1.0	82595.83200000001	34
Cluster_1	DNA-within-phenotype_abundance	23863.237926470596	23863.237926470596	25757.145698412707	0.9264705882352942	1622700.1790000005	68
Cluster_1	DNA-within-phenotype_prevalence	1.0	2429.2891764705887	2429.2891764705887	1.0	82595.83200000001	34
...

  • This file lists the results of abundance annotations for each protein (TSV format file).
  • In this file, the number of rows per protein family is equal to 4+(2*the number of levels of the phenotype variable of interest). In this case, our phenotype variable of interest contains two levels (CD and non-IBD) and so there are 8 rows per protein family.
  • We calculate the mean abundance and mean prevalence of the protein family within levels of the phenotype variable. The "within phenotype" abundance and prevalence are used later by MetaWIBELE and are equal to the largest of these values, respectively. For example, for Cluster_1 the "within phenotype" abundance is equal to the CD abundance, as this is larger than the non-IBD abundance.
  • Additionally, we calculate the mean abundance and prevalence of the protein family across all the samples ("DNA_abundance", "DNA_prevalence").
demo_MaAsLin2_proteinfamilies.DA.detail.tsv

To view this file, enter:

less -S output/characterization/abundance_annotation/demo_MaAsLin2_proteinfamilies.DA.detail.tsv	

which yields:

familyID	type	detail	cmp_type	prevalence	prevalence_case	prevalence_control	coef	stderr	pvalue	qvalue	log(FC)	Cohen's_d	mean(log)	mean_abundance	mean_abundance_case	mean_abundance_control	mean_prevalent_abundance	mean_prevalent_abundance_case	mean_prevalent_abundance_control
Cluster_253	MaAsLin2_DA	up(CD_vs_nonIBD)	CD_vs_nonIBD	0.6862745098039216	0.8382352941176471	0.38235294117647056	6.59932556080405	1.48799879543361	2.57037708535903e-05	0.000176053225024591	2.48608582454593	1.0766611399179797	4.492591935109035	4587.649978039216	6606.8036835294115	549.3425670588235	6684.861396571429	7881.800885614034	1436.7420984615385
Cluster_247	MaAsLin2_DA	up(CD_vs_nonIBD)	CD_vs_nonIBD	0.7450980392156863	0.8676470588235294	0.5	8.09168252755146	1.46613110483452	3.18585281861428e-07	3.5953503081526e-05	2.4377401279384125	1.1489923257331105	4.4243947777191845	5935.187654901962	8530.409459705887	744.7440452941177	7965.646589473685	9831.658360338986	1489.4880905882353
Cluster_252	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.5	0.35294117647058826	0.7941176470588235	-5.94413824963104	1.30440935252719	1.60881176921761e-05	0.00013494033537789	-1.7040901947843266	-1.230317145260927	-3.8803544049272	703.886934117647	281.32940911764706	1549.001984117647	1407.773868235294	797.0999925	1950.595091111111
Cluster_180	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.7745098039215687	0.6764705882352942	0.9705882352941176	-3.90013337098607	1.23666141445333	0.00218410898383837	0.00392204763002695	-0.991862676094047	-1.7721309728447108	-3.367940610001413	1749.8750674509804	1117.6126011764707	3014.4	2259.3323655696204	1652.122975652174	3105.745454545455
...
  • This file contains the results of running MaAsLin2 linear models to assess the differential abundance of the protein families across levels of the phenotype variable of interest (TSV format file).
  • The detail type of "down" indicates that the abundance of the protein family was lower in CD compared to the non-IBD control, while a detail type of "up" indicates that the abundance of the protein family was higher in CD compared to the non-IBD control.
demo_MaAsLin2_proteinfamilies.DA-sig.detail.tsv

To view this file, enter:

less -S output/characterization/abundance_annotation/demo_MaAsLin2_proteinfamilies.DA-sig.detail.tsv	

which yields:

familyID	type	detail	cmp_type	prevalence	prevalence_case	prevalence_control	coef	stderr	pvalue	qvalue	log(FC)	Cohen's_d	mean(log)	mean_abundance	mean_abundance_case	mean_abundance_control	mean_prevalent_abundance	mean_prevalent_abundance_case	mean_prevalent_abundance_control
Cluster_253	MaAsLin2_DA-sig	up(CD_vs_nonIBD)	CD_vs_nonIBD	0.6862745098039216	0.8382352941176471	0.38235294117647056	6.59932556080405	1.48799879543361	2.57037708535903e-05	0.000176053225024591	2.48608582454593	1.0766611399179797	4.492591935109035	4587.649978039216	6606.8036835294115	549.3425670588235	6684.861396571429	7881.800885614034	1436.7420984615385
Cluster_247	MaAsLin2_DA-sig	up(CD_vs_nonIBD)	CD_vs_nonIBD	0.7450980392156863	0.8676470588235294	0.5	8.09168252755146	1.46613110483452	3.18585281861428e-07	3.5953503081526e-05	2.4377401279384125	1.1489923257331105	4.4243947777191845	5935.187654901962	8530.409459705887	744.7440452941177	7965.646589473685	9831.658360338986	1489.4880905882353
Cluster_252	MaAsLin2_DA-sig	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.5	0.35294117647058826	0.7941176470588235	-5.94413824963104	1.30440935252719	1.60881176921761e-05	0.00013494033537789	-1.7040901947843266	-1.230317145260927	-3.8803544049272	703.886934117647	281.32940911764706	1549.001984117647	1407.773868235294	797.0999925	1950.595091111111
Cluster_180	MaAsLin2_DA-sig	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.7745098039215687	0.6764705882352942	0.9705882352941176	-3.90013337098607	1.23666141445333	0.00218410898383837	0.00392204763002695	-0.991862676094047	-1.7721309728447108	-3.367940610001413	1749.8750674509804	1117.6126011764707	3014.4	2259.3323655696204	1652.122975652174	3105.745454545455
...
  • This file contains similar data to the demo_MaAsLin2_proteinfamilies.DA.detail.tsv file above (TSV format file); however, it only includes results for protein families that were significantly (q<0.05) differentially abundant by default.

  • Under what circumstances would you want to skip the association analysis?

demo_MSPminer_proteinfamilies.detail.tsv

To view this file, enter:

less -S output/characterization/abundance_annotation/demo_MSPminer_proteinfamilies.detail.tsv	

which yields:

familyID	type	detail	gene_class	module_name	msp_description	msp_component	multi_msp	total_member	core_gene	accessory_gene	module_compoment	multi_module	taxa_id	taxa_name	taxa_rank	taxa_lineage	note
Cluster_1	MSPminer_MSP	msp_unknown	NA	NA	NA	msp_unknown	1	24	0	0	NA	0	NA	NA	Unclassified	NA	good
Cluster_10	MSPminer_MSP	msp_1	msp_1__core	msp_1__core	num_genes(460),num_unclassified_genes(91),num_hit_genes(334msp_1;msp_unknown	2	2	0.5	0	msp_1__core	1	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
Cluster_100	MSPminer_MSP	msp_1	msp_1__core	msp_1__core	num_genes(460),num_unclassified_genes(91),num_hit_genes(334msp_1	1	1	1.0	0	msp_1__core	1	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
Cluster_101	MSPminer_MSP	msp_1	msp_1__core	msp_1__core	num_genes(460),num_unclassified_genes(91),num_hit_genes(334msp_1	1	1	1.0	0	msp_1__core	1	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
...
  • MetaWIBELE uses MSPminer to construct Metagenomic Species Pan-genomes (MSPs) by binning co-abundant genes across samples (TSV format file).
  • For each protein family, the detail column lists the MSP that protein family has been assigned to. For example, protein family Cluster_1 belongs to the first MSP (msp_1).
demo_proteinfamilies_annotation.taxonomy.tsv

To view this file, enter:

less -S output/characterization/abundance_annotation/demo_proteinfamilies_annotation.taxonomy.tsv	

which yields:

familyID	study	map_type	query_type	mutual_type	identity	query_coverage	mutual_coverage	detail	TaxTaxID	Rep_Tax	Rep_TaxID	UniProtKB	unirefID	note	msp_name	msp_taxa_name	msp_taxa_id	MSP_Tax	MSP_TaxID	MSP_Rep_Tax	MSP_Rep_TaxID	taxa_id	taxa_name	taxa_rank	taxa_lineage
Cluster_1	demo	UniRef90_characterized	high_confidence	high_confidence	100	1.0	1.0	Translation initiation factor IF-1	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522	good	msp_unknownNA	NA	Lachnospiraceae	186803	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	411470	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	Terminal	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)
Cluster_10	demo	UniRef90_uncharacterized	high_confidence	high_confidence	99.1	1.0	1.0	Phenylalanine--tRNA ligase beta subunit	Clostridiales	186802	[Eubacterium] rectale	39491	A0A395ZH25	UniRef90_A0A395ZH25	good	msp_1	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_106	demo	UniRef90_weak_homology	low_confidence	low_confidence	97.4	0.5571895424836601	0.5571895424836601	NA	NA	NA	NA	NA	NA	UniRef90_R6U3V8	good	msp_1	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_285	demo	UniRef90_worse_homology	low_confidence	low_confidence	94.8	0.7584269662921348	0.14594594594594595UniRef90_unknown	NA	NA	NA	NA	NA	NA	good	msp_1	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
...
  • In this step, MetaWIBELE uses Metagenomic Species Pan-genomes (MSPs) to assign taxonomy to protein families that were not able to be identified through step 3.3.3 global-homology based annotations (TSV format file).
  • For example, protein family Cluster_106 and Cluster_285 had only weak or lacked homology with UniRef90s at the sequence level. However, using co-abundance data it was assigned to msp_1, which we know from other protein family information corresponds to Eubacterium rectale.
  • Main columns: taxa_id: the finalized the taxon ID of this protein family (either taxon for LCA or representative based on parameter setting, default representative taxon); taxa_name: the finalized the taxon name of this protein family; taxa_rank: the finalized the taxonomic level of this protein family; taxa_lineage: the finalized the taxonomic lineage of this protein family.

  • MetaWIBELE uses a non-homology based approach to perform taxonomic classification. What would be some pros/cons of this approach in real data with different levels of characterization? (Hint: consider well-characterized microbiome communities vs. poorly-characterized microbiome communities)

4. Prioritizing metagenomic protein families

4.1 MetaWIBELE-prioritize input files

MetaWIBELE can perform prioritization to identify important proteins using the following outputs of the MetaWIBELE-characterize module:

4.2 Running MetaWIBELE: prioritization module

A canonical MetaWIBELE run will prioritize protein families using unsupervised and supervised approaches. By default, protein families are 1) quantitatively ranked based on ecological properties (abundance and prevalence) and/or associations with environmental parameters or phenotypes (q-value and effect size), 2) prioritized based on specified functions of interest via binary filtering.

From the working directory where you executed the MetaWIBELE-prioritize module (metawibele/ directory), execute the following:

metawibele prioritize --input-annotation output/characterization/finalized/demo_proteinfamilies_annotation.tsv \
--input-attribute output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv \
--output output/prioritization/ --local-jobs 4

Note:

  • Make sure the global configuration file metawibele.cfg is in your working directory.
  • In the command replaces output/prioritization/ with the name of the directory to write the output files to.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).

This command yields:

(Jun 10 15:43:28) [0/8 -   0.00%] **Started  ** Task 0: quantify_prioritization__unsupervised
(Jun 10 15:43:32) [4/8 -  50.00%] **Started  ** Task 13: finalize_prioritization__selected_supervised_priority
(Jun 10 15:43:28) [0/8 -   0.00%] **Started  ** Task 5: quantify_prioritization__supervised
(Jun 10 15:43:31) [3/8 -  37.50%] **Started  ** Task 6: filter_prioritization
(Jun 10 15:43:30) [1/8 -  12.50%] **Started  ** Task 8: filter_prioritization
...
(Jun 10 15:43:33) [8/8 - 100.00%] **Completed** Task 11: finalize_prioritization__selected_unsupervised_priority
Run Finished

4.3 MetaWIBELE prioritization outputs

Three main output files will be created:

output/prioritization/demo_unsupervised_prioritization.rank.table.tsv
output/prioritization/demo_supervised_prioritization.rank.table.tsv
output/prioritization/demo_supervised_prioritization.rank.selected.table.tsv

The pre-generated demo files can be directly downloaded here:

demo_unsupervised_prioritization.rank.table.tsv To view this file, enter:

less -S output/prioritization/demo_unsupervised_prioritization.rank.table.tsv	

which yields:

TID	familyID	evidence	value	rank	description	note
1	Cluster_2	DNA_abundance	13143.781063725488	0.992	ranking based on single evidence	
2	Cluster_2	DNA_prevalence	0.9803921568627451	0.997	ranking based on single evidence	
3	Cluster_2	priority_score	0.994493715434892	0.994493715434892	meta ranking based on multiple evidences	
4	Cluster_412	DNA_abundance	20994.94988235294	0.998	ranking based on single evidence	
5	Cluster_412	DNA_prevalence	0.9607843137254902	0.988	ranking based on single evidence	
6	Cluster_412	priority_score	0.9929748237663647	0.9929748237663647	meta ranking based on multiple evidences	
7	Cluster_1	DNA_abundance	16718.588343137264	0.996	ranking based on single evidence	
8	Cluster_1	DNA_prevalence	0.9509803921568627	0.982	ranking based on single evidence	
9	Cluster_1	priority_score	0.9889504550050557	0.9889504550050557	meta ranking based on multiple evidences
...
  • These are the results of unsupervised prioritization based on ecological properties (TSV format file).
  • Each protein family has a numeric priority score based on its meta ranking.
  • demo_unsupervised_prioritization.rank.tsv is the overall ranking for all protein families.

demo_supervised_prioritization.rank.table.tsv To view this file, enter:

less -S output/prioritization/demo_supervised_prioritization.rank.table.tsv	

which yields:

TID	familyID	evidence	value	rank	description	note
1	Cluster_162	DNA_within_phenotype_abundance	3360.590970588234	0.82	ranking based on single evidence	CD_vs_nonIBD
2	Cluster_162	DNA_within_phenotype_prevalence	1.0	1.0	ranking based on single evidence	CD_vs_nonIBD
3	Cluster_162	MaAsLin2_DA__mean_log	-2.4957861926317024	0.744	ranking based on single evidence	CD_vs_nonIBD
4	Cluster_162	MaAsLin2_DA__qvalue	5.77786226411476e-05	0.957	ranking based on single evidence	CD_vs_nonIBD
5	Cluster_162	priority_score	0.8679556698228121	0.8679556698228121	meta ranking based on multiple evidences	CD_vs_nonIBD
6	Cluster_349	DNA_within_phenotype_abundance	2890.855411764706	0.646	ranking based on single evidence	CD_vs_nonIBD
7	Cluster_349	DNA_within_phenotype_prevalence	1.0	1.0	ranking based on single evidence	CD_vs_nonIBD
8	Cluster_349	MaAsLin2_DA__mean_log	-2.7772748787138024	0.848	ranking based on single evidence	CD_vs_nonIBD
9	Cluster_349	MaAsLin2_DA__qvalue	4.91142235527697e-05	0.969	ranking based on single evidence	CD_vs_nonIBD
10	Cluster_349	priority_score	0.8404730391805891	0.8404730391805891	meta ranking based on multiple evidences	CD_vs_nonIBD
...
  • These are the results of supervised prioritization combining ecological properties and environmental/phenotypic properties (TSV format file).
  • Each protein family has a numeric priority score based on its meta ranking.
  • demo_supervised_prioritization.rank.tsv is the overall ranking for all protein families.

demo_supervised_prioritization.rank.selected.table.tsv To view this file, enter:

less -S output/prioritization/demo_supervised_prioritization.rank.selected.table.tsv	

which yields:

TID	familyID	evidence	value	rank	description	note
1	Cluster_197	DNA_within_phenotype_abundance	3540.796235294117	0.866	ranking based on single evidence	CD_vs_nonIBD
2	Cluster_197	DNA_within_phenotype_prevalence	1.0	1.0	ranking based on single evidence	CD_vs_nonIBD
3	Cluster_197	MaAsLin2_DA__mean_log	-2.317348408740017	0.672	ranking based on single evidence	CD_vs_nonIBD
4	Cluster_197	MaAsLin2_DA__qvalue	0.000783512050628084	0.659	ranking based on single evidence	CD_vs_nonIBD
5	Cluster_197	priority_score	0.7751516860758866	0.7751516860758866	meta ranking based on multiple evidences	CD_vs_nonIBD
6	Cluster_42	DNA_within_phenotype_abundance	3127.4102941176475	0.722	ranking based on single evidence	CD_vs_nonIBD
7	Cluster_42	DNA_within_phenotype_prevalence	0.9705882352941176	0.7047120418848167	ranking based on single evidence	CD_vs_nonIBD
8	Cluster_42	MaAsLin2_DA__mean_log	-2.5611076578004637	0.77	ranking based on single evidence	CD_vs_nonIBD
9	Cluster_42	MaAsLin2_DA__qvalue	0.000366143186526455	0.755	ranking based on single evidence	CD_vs_nonIBD
10	Cluster_42	priority_score	0.737019409947045	0.737019409947045	meta ranking based on multiple evidences	CD_vs_nonIBD
  • This file is the result of binary filtering of the protein families based on biochemical annotations (TSV format file).
  • The settings require that each of the prioritized protein families should 1) be predicted as signaling peptides, and 2) have at least one of the following features: extracellular, cellWall, outerMembrane, transmembrane, Pfam domains, domain-domain interaction annotations.

  • Under what circumstances would skipping supervised prioritization be reasonable? Under what circumstances would skipping supervised prioritization be inadvisable?

4.4 Visualization of highly prioritized protein families

Unsupervised hierarchical clustering of the 30 protein families most prioritized in IBD which are potentially secreted proteins with signaling functions:

MetaWIBELE_demo_heatmap.png


5. Advanced MetaWIBELE topics

5.1 MetaWIBELE supervised prioritization with binary filtering

By default, MetaWIBELE will perform binary filtering using the prioritization configuration file installed in the package. Optionally, you can also make your own configuration files and provide them as optional arguments to MetaWIBELE. For example, the local prioritization configuration file can be provided with --prioritization-config $PRIORITIZE_CONF where $PRIORITIZE_CONF is the file with prioritization configurations.

5.1.1 Select subset of interest annotated with multiple specific biochemical annotations simultaneously

Setting $PRIORITIZE_CONF as following:

```
##Binary filtering for selection subset
[filtering]
# Filter for interested functional vignettes type [Choices: pilin | secreted_system | other user defined | none]
vignettes = none

# significant associations for filtering: [required] required item, [optional] optional item, [none] ignoring
MaAsLin2_DA-sig = required

# biochemical annotation for filtering: [required] required item, [optional] optional item, [none] ignoring
ExpAtlas_interaction = required
DOMINE_interaction = none
SIFTS_interaction = none
Denovo_signaling = required
Denovo_transmembrane = optional
PSORTb_extracellular = optional
PSORTb_cellWall = optional
PSORTb_outerMembrane = optional
UniRef90_extracellular = optional
UniRef90_signaling = none
UniRef90_transmembrane = optional
UniRef90_cellWall = optional
UniRef90_outerMembrane = optional
UniRef90_PfamDomain = none
InterProScan_PfamDomain = none
InterProScan_SUPERFAMILY = none
InterProScan_ProSiteProfiles = none 
InterProScan_ProSitePatterns = none
InterProScan_Gene3D = none
InterProScan_PANTHER = none
InterProScan_TIGRFAM = none
InterProScan_SFLD = none
InterProScan_ProDom = none
InterProScan_Hamap = none
InterProScan_SMART = none
InterProScan_CDD = none
InterProScan_PRINTS = none
InterProScan_PIRSF = none
InterProScan_MobiDBLite = none
```
  • Re-run prioritization workflow for filtering:
  • $ metawibele prioritize --prioritization-config $PRIORITIZE_CONF --bypass-mandatory --input-annotation output/characterization/finalized/demo_proteinfamilies_annotation.tsv --input-attribute output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv --output output/prioritization/ --selected-output demo_prioritized.selected.tsv
  • Output file name: $OUTPUT_DIR/prioritization demo_prioritized.selected.tsv
  • This file is the results of binary filtering of the protein families based on biochemical annotations.
  • These settings require that each of prioritized protein family should 1) significantly associated with the main phenotype, 2) be annotated to domain-domain interaction with the host, 3) predicted as signal peptides, and 4) have at least one of the following features: extracellular, cellWall, outerMembrane, transmembrane

5.1.2 Select subset of interest based on specific functions

Setting $PRIORITIZE_CONF as following:

```
[filtering]
# Filter for interested functional vignettes type [Choices: pilin | secreted_system | other user defined | none]
vignettes = pilin

# Filter for significant associations: [required] required item, [optional] optional item, [none] ignoring [ Default: required ]
MaAsLin2_DA-sig = none

# biochemical annotation for filtering: [required] required item, [optional] optional item, [none] ignoring
ExpAtlas_interaction = optional
DOMINE_interaction = optional
SIFTS_interaction = optional
Denovo_signaling = optional
Denovo_transmembrane = optional
PSORTb_extracellular = optional
PSORTb_cellWall = optional
PSORTb_outerMembrane = optional
UniRef90_extracellular = optional
UniRef90_signaling = optional
UniRef90_transmembrane = optional
UniRef90_cellWall = optional
UniRef90_outerMembrane = optional
UniRef90_PfamDomain = optional
InterProScan_PfamDomain = optional
InterProScan_SUPERFAMILY = optional
InterProScan_ProSiteProfiles = optional 
InterProScan_ProSitePatterns = optional
InterProScan_Gene3D = optional
InterProScan_PANTHER = optional
InterProScan_TIGRFAM = optional
InterProScan_SFLD = optional
InterProScan_ProDom = optional
InterProScan_Hamap = optional
InterProScan_SMART = optional
InterProScan_CDD = optional
InterProScan_PRINTS = optional
InterProScan_PIRSF = optional
InterProScan_MobiDBLite = optional
```
  • Re-run prioritization workflow for filtering:
  • $ metawibele prioritize --prioritization-config $PRIORITIZE_CONF --bypass-mandatory --vignette-config $my_vignette_function_file --input-annotation output/characterization/finalized/demo_proteinfamilies_annotation.tsv --input-attribute output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv --output output/prioritization/ --selected-output demo_prioritized_pilin.tsv
  • Provide your own vignette function file for filtering specific functions.
  • Output file name: $OUTPUT_DIR/prioritization/demo_prioritized_pilin.tsv
  • This file is the result of binary filtering of the protein families based on pilin related functions (TSV format file).

  • How would you customize your local configuration file for prioritization to select the subset of protein families annotated with domain-domain interactions?
  • How would you customize your local configuration for prioritization to select the subset of protein families which are 1) significantly associated with the phenotype or environmental parameter, 2) annotated to domain-domain interactions, and 3) predicted as signal peptides?

5.2 Building dependent databases

5.2.1: create local UniRef databases

You can also create these databases locally using MetaWIBELE utility scripts based on the latest release version of UniProt/UniRef, and provide $UNIREF_LOCATION as the location to install the database.

  • Download and obtain UniProt annotations:

    $ metawibele_prepare_uniprot_annotation --output $UNIREF_LOCATION

  • Download UniRef sequences and obtain annotations:

    $ metawibele_prepare_uniref_annotation -t uniref90 --output $UNIREF_LOCATION

  • Use diamond to index sequences

    diamond makedb --in uniref90.fasta -d uniref90.fasta

5.2.2: create local domain databases

Protein domain databases are required if you will use MetaWIBELE to perform domain-based annotation. By default, these databases have already been installed when you install the MetaWIBELE package. Alternatively, you can also create these domain databases locally and provide $DOMAIN_LOCATION as the location to install the database.

  • Download and obtain dependent protein domain information:

    $ metawibele_prepare_domain_databases -t Pfam33.0 --output $DOMAIN_LOCATION

    • Pfam domains from the specified version of Pfam database will be downloaded.
    • PDB information from the latest version of SIFT database will be downloaded.
    • Domain-domain interactions from version 2.0 of DOMINE database will be downloaded.
Clone this wiki locally