Skip to content

metawibele

Yancong Zhang edited this page Aug 31, 2022 · 68 revisions
Clone this wiki locally

Welcome to the MetaWIBELE tutorial

MetaWIBELE (Workflow to Identify novel Bioactive Elements in the microbiome) is a workflow to efficiently and systematically identify and prioritize potentially bioactive (and often uncharacterized) gene products in microbial communities. It prioritizes candidate gene products from assembled metagenomes using a combination of sequence homology, secondary-structure-based functional annotations, phylogenetic binning, ecological distribution, and association with environmental parameters or phenotypes to target candidate bioactives.

For more information about the method and its associated publication, visit our summary page and read the MetaWIBELE User Manual.

Support for MetaWIBELE is available via the MetaWIBELE channel of the bioBakery Support Forum.


Contents


Overview

workflow.png


1. Setup

1.1 Requirements

  1. Python (version >= 3.6, requiring numpy, pandas, scipy packages; tested 3.6, 3.7)
  2. AnADAMA2 (version >= 0.7.4; tested 0.7.4, 0.8.0)
  3. CD-hit (version >= 4.7; tested 4.7)
  4. Diamond (version >= 0.9.24; tested 0.9.24)
  5. MSPminer (version >= 1.0.0; licensed software; tested version 1.0.0)
  6. MaAsLin2 (version >= 1.5.1; tested 1.5.1)
  7. Interproscan (version >= 5.31-70) (installing with activating Phobius/SignalP/TMHMM analyses; InterProScan 5.51-85.0 or later are recommended for potential simpler installation; used for domain/motif annotation; tested 5.31-70, 5.51-85.0)
  8. Signalp (version >= 4.1; licensed software; installing with integrating in interproscan; used for domain/motif annotation; tested 4.1)
  9. TMHMM (version >= 2.0; licensed software; installing with integrating in interproscan; used for domain/motif annotation; tested 2.0)
  10. Phobius (version >= 1.01; licensed software; installing with integrating in interproscan; used for domain/motif annotation; tested 1.01)
  11. PSORTb (version >= 3.0) (licensed software; used for domain/motif annotation; tested 3.0)
  12. Optional: only required if using MetaWIBELE utilities to prepare inputs for MetaWIBELE using metagenomic sequencing reads
    • MEGAHIT (version >= 1.1.3; tested 1.1.3)
    • Prokka (version >= 1.14-dev; recommend to not set '-c' parameter when running prodigal with metagenome mode; tested 1.14-dev)
    • Prodigal (version >= 2.6; tested 2.6)
    • USEARCH (version >= 9.0.2132; licensed software; tested 9.0.2132)
    • Bowtie2 (version >= 2.3.2; tested 2.3.2)
    • SAMtools (version >= 1.9; tested 1.9)
    • featureCounts (version >= 1.6.2; tested 1.6.2)

Note: Please install the required software in a location in your $PATH. If you always run with gene families (non-redundant gene catalogs), the optional softwares are not required. Also if you always run with one or more bypass options (for information on bypass options, see optional arguments to the section Workflow by bypass mode).The software required for the steps you bypass does not need to be installed.

1.2 Installation

Note: If you are using a bioBakery machine image (e.g. in Google Cloud) you do not need to install MetaWIBELE because the tool and its dependencies are already installed.

1.2.1 Install MetaWIBELE

You can choose one of the following options to install the MetaWIBELE package.

Option 1: Installing with docker

  • $ docker pull biobakery/metawibele
  • This docker image includes most of the dependent software packages.
  • Large software packages and those with licenses are NOT included in this image and needed to be installed additionally:
    • Users should review the license terms and install these packages manually.
    • Softwares with the license : MSPminer, Signalp, TMHMM, Phobius, PSORTb
    • Softwares with large size: Interproscan (Note: We recommend installing InterProScan 5.51-85.0 (requiring at least Java 11) or later for potential simpler installation, and active Phobius/SignalP/TMHMM analyses by customizing your interproscan.properties configuration, see more details from InterProScan document).

Option 2: Installing with pip

  • $ pip install metawibele
  • If you do not have write permissions to /usr/lib/, then add the option --user to the install command. This will install the python package into subdirectories of ~/.local/. Please note when using the --user install option on some platforms, you might need to add ~/.local/bin/ to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message metawibele: command not found when trying to run MetaWIBELE after installing with the --user option.

Option 3: Installing with conda

  • $ conda install -c biobakery metawibele

1.2.2 Install dependent databases

1.2.2.1 Demo database for the purposes of running demo data

Note: If you are using a bioBakery machine image (e.g. in Google Cloud) you do not need to install dependent databases that are already installed

If not installed in the bioBakery machine image, for the purposes of this tutorial alone, please download demo databases to reduce run time:

  • Download the annotation files and uncompress into demo_uniref_database folder: demo_uniref_database.tar.gz
  • Alternatively, run the following command to download the annotation files into demo_uniref_database: $ metawibele_download_database --database DEMO --build demo_uniref --install-location $UNIREF_LOCATION
1.2.2.2 Whole database for general use

UniRef database is required if you will use MetaWIBELE to do global-homology based annotation and taxonomic annotation. We have built the database based on UniProt/UniRef90 2019_01 sequences and annotations. You can use any one of the following options to install the database with providing $UNIREF_LOCATION as the location for installation.

NOTE: Please point to this location for the default uniref database in the global config file (metawibele.cfg, see details in the section 1.2.3 Prepare configuration files). Alternatively, you can either set the location with the environment variable $UNIREF_LOCATION, or move the downloaded database to the folder named uniref_database in the current working directory.

Option 1: Download and uncompress them into location $UNIREF_LOCATION

Option 2: Run the following commands to install the databases into the location $UNIREF_LOCATION

  • download the UniRef90 sequences indexed by Diamond v0.9.24 into $UNIREF_LOCATION: $ metawibele_download_database --database uniref --build uniref90_diamond --install-location $UNIREF_LOCATION
  • download the UniRef90 annotation files into $UNIREF_LOCATION: $ metawibele_download_database --database uniref --build uniref90_annotation --install-location $UNIREF_LOCATION

1.2.3 Check install

To check out the install of MetaWIBELE packages and all dependencies (tools and databases), run the command:

  • $ metawibele_check_install

    Which yields (highly abbreviated):

     #----------------- Checking the install of metawibele packages -----------------#
     ###### Checking anadama2 package install ######
     Checking anadama2 install...........OK
     
     ###### Checking metawibele package install ######
     Checking metawibele install...........OK
     
     #----------------- Checking the install of required dependent tools -----------------#
     ###### Checking CD-hit install ######
     ...
     
     #----------------- Checking the install of optional dependent tools -----------------#
     ###### Checking MEGAHIT install ######
     MEGAHIT v1.1.3
    
     Checking MEGAHIT install...........OK
     ...
    
  • By default, it will check both MetaWIBEKE packages and all dependencies.

  • Alternatively, add the option "--types {metawibele,required,optional,all}" to check the install of specific package or dependency.

1.2.4 Prepare configuration files

To run MetaWIBELE, you are required to customize the global configuration file metawibele.cfg and make sure that it's in your current working directory.

NOTE: De default, MetaWIBELE will use the global configurations from metawibele.cfg in the current working directory. Alternatively you can always provide the location of the global configuration file you would like to use with the "--global-config " option to metawibele (see more in the section How to run).

  1. Download global configuration template

    You can use one of the following options to get the configuration template.

    • Option 1: obtain copies by right-clicking the link and selecting "save link as": metawibele.cfg
    • Option 2: run this command to download global configuration file: $ metawibele_download_config --config-type global
  2. Customize the global configuration

    The path of the dependent databases is required for customization. Most of other sections can be left as defaults.

    • Customize the path of dependent databases (required):

       [database]
       # The absolute path of UniRef databases folder.
       uniref_db = 
       # The domain databases used by MetaWIBELE. [data_path] provide the absolute path of the domain databases folder; [none] use the dedault domain databases installed in the metawibele package. [ Default: none ]
       domain_db = none 
      
    • Customize basic information for outputs (e.g. prefix name of output files, etc.) (optional):

       [basic]
       # Study name. [ Default: MGX ]
       study = MGX
       # The prefix name for output results. [ Default: metawibele ]
       basename = metawibele
      
    • Customize applied computational resources (e.g. CPU cores, memory, etc.) (optional):

       [computation]
       # The number of cores that you are requesting. [ Default: 1 ]
       threads = 1
       # The amount of memory (in MB) that you will be using for your job. [ Default: 20000 ] 
       memory = 20000
       # The amount of time (in minute) that you will be using for your job. [ Default: 60 ]
       time = 60
      
    • Customize parameter settings for abundance-based and domain/motif-based annotations (optional):

       [abundance]
       # The absolute path of the config file used by MSPminer. [config_file] provide the mspminer config file; [none] use the dedault config files installed in the metawibele package. [ Default: none ]
       mspminer = none
       # The method for normalization [Choices: cpm, relab]. [cpm] copies per million units (sum to 1 million); [relab] relative abundance (sum to 1). [ Default: cpm ]  
       normalize = cpm
       # The minimum abundance for each feature [ Default: 0 ]   
       abundance_detection_level = 0
      
       [msp]
       # The minimum fraction of taxonomy classified genes in each MSP [0-1]. [Default: 0.10]
       tshld_classified = 0.10 
       # The minimum percent differences between the most and second dominant taxon for each MSP [0-1]. [Default: 0.50] 
       tshld_diff = 0.50 
      
       [interproscan]
       # Interproscan executable file, e.g. /my/path/interproscan/interproscan.sh [ Default: interproscan.sh ]
       interproscan_cmmd = interproscan.sh
       # The appls used by interroiscan: [appls] comma separated list of analyses, [ Choices: CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,ProDom,PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM,Phobius,SignalP,TMHMM ]; [all] use all all analyses for running. [ Default: all ]
       interproscan_appl = all
       # The number of spliting files which can be annotated in parallel. [ Default: 1 ]
       split_number = 1
      
    • Customize parameter settings for association with environmental/host phenotypes (optional; but if you plan to run MetaWIBELE for supervised prioritization you are required to specify your settings for MaAsLin2 and specify the main phenotype metadata used for prioritization):

       [maaslin2]
       # The absolute path of Maaslin2 executable file, e.g. /my/path/Maaslin2/R/Maaslin2.R [ Default: Maaslin2.R ]
       maaslin2_cmmd = Maaslin2.R
       # The minimum abundance for each feature. [ Default: 0 ]  
       min_abundance = 0
       # The minimum percent of samples for which a feature is detected at minimum abundance. [ Default: 0.1 ]
       min_prevalence = 0.1
       # Keep features with variance greater than. [Default: 0.0]
       min_variance = 0
       # The q-value threshold for significance. [ Default: 0.25 ]
       max_significance = 0.25
       # The normalization method to apply [ Choices: TSS, CLR, CSS, NONE, TMM ]. [ Default: TSS ]
       normalization = NONE
       # The transform to apply [ Choices: LOG, LOGIT, AST, NONE ].  [ Default: LOG ]
       transform = LOG
       # The analysis method to apply [ Choices: LM, CPLM, ZICP, NEGBIN, ZINB ]. [ Default: LM ]
       analysis_method = LM
       # The fixed effects for the model, comma-delimited for multiple effects. [ Default: all ]
       fixed_effects = all
       # The random effects for the model, comma-delimited for multiple effects. [ Default: none ]
       random_effects = none
       # The correction method for computing the q-value. [ Default: BH ]
       correction = BH
       # Apply z-score so continuous metadata are on the same scale. [ Default: TRUE ]
       standardize = TRUE
       # Generate a heatmap for the significant associations. [ Default: FALSE ]
       plot_heatmap = FALSE
       # In heatmap, plot top N features with significant associations. [ Default: FALSE ]
       heatmap_first_n = FALSE
       # Generate scatter plots for the significant associations. [ Default: FALSE ]
       plot_scatter = FALSE
       # The number of R processes to run in parallel. [ Default: 1 ]
       maaslin2_cores = 1
       # The factor to use as a reference for a variable with more than two levels provided as a string of 'variable,reference' semi-colon delimited for multiple variables. NOTE: A space between the variable and reference will not error but will cause an inaccurate result. [ Default: NA ]
       reference = NA
       # The minimum percent of case-control samples used for comparision in which a feature is detected. [ Default: 0.1 ]
       tshld_prevalence = 0.10
       # The q-value threshold for significance used as DA annotations. [ Default: 0.05 ]
       tshld_qvalue = 0.05
       # The statistic used as effect size [ Choices: coef, mean(log) ]. [coef] represents the coefficient from the model; [mean(log)] represents the difference of mean values between case and control conditions. [  Default: mean(log) ]
       effect_size = mean(log)
       # The main phenotype metadata used for prioritization, e.g. metadata1. [ Default: none ]: skip the association with environmental/phenotypic parameters
       phenotype = none
      

2. MetaWIBELE utility: preprocessing metagenomic sequencing reads

2.1 MetaWIBELE-preprocess input files

MetaWIBELE provides utilities to build gene families starting from quality-controlled shotgun sequencing reads.

This tutorial will use demo input files distributed with the MetaWIBELE installation under the examples/raw_reads/ directory. If you installed with pip or conda and don't have these handy, you can obtain copies by right-clicking these links and selecting "save link as":

These two demo samples were taken from the PRISM (Prospective Registry in IBD Study at MGH) cohort. One sample is from a subject with Crohn’s disease (CD) and the other is from a non-IBD control.

2.2 Running MetaWIBELE utility: preprocessing module

The preprocessing module is a utility in the MetaWIBELE package for preprocessing metagenomic reads. It performs the following steps: (i) metagenomic assembly, (ii) open reading frame prediction, (iii) gene family (i.e. non-redundant gene catalogs) construction, and (iv) gene abundance estimation.

If you installed from conda / pip, download them here and save them to metawibele/raw_reads directory. If you installed from source, copy them from the metawibele source package to metawibele/raw_reads.

In the VM, navigate to the correct directory with:

cd ~/Tutorials/metawibele

Use the fastq.gz files from metawibele/raw_reads to run the command:

metawibele preprocess --input raw_reads/ --output output/preprocess/ \
--output-basename demo --extension-paired "_R1.fastq.gz,_R2.fastq.gz" --extension ".fastq.gz" --local-jobs 4

On the terminal, you will notice tasks being initiated and completed.

(May 26 00:04:46) [ 2/28 -   7.14%] **Started  ** Task 13: sample1__prodigal
(May 26 00:04:47) [ 5/28 -  17.86%] **Started  ** Task 18: ln__sample1.faa
(May 26 00:04:47) [ 9/28 -  32.14%] **Started  ** Task 14: ln__sample2.faa
(May 26 00:04:47) [12/28 -  42.86%] **Started  ** Task 22: format_protein_sequences
(May 26 00:04:47) [16/28 -  57.14%] **Started  ** Task 26: extract_protein_coding_genes

Note:

  • The input directory should contain quality-controlled shotgun sequencing metagenome files (fastq, fastq.gz, fasta, or fasta.gz format)
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The preprocessing workflow will run with the default settings for all subtasks. You can change the command line parameters to customize your settings for the preprocessing workflow.
    • For example, if your samples have a different file type extension or different paired-end labels you can modify the default settings when running the assembly task using the command line parameters --extension and --extension-paired. Using the text that follows after $SAMPLE in the input file name specify: --extension-paired "$R1_suffix,$R2_suffix", --extension "$fastq_suffix"
  • To write the log and err files in the working directory, add >preprocess.log 2>preprocess.err after the command above.

2.3 Preprocessing outputs

Main output files

The following are the two main output files of the preprocessing utility that are used for MetaWIBELE:

output/preprocess/finalized/demo_genecatalogs.centroid.faa
output/preprocess/finalized/demo_genecatalogs_counts.all.tsv

The pre-generated demo files can be directly downloaded here:

demo_genecatalogs.centroid.faa

To view this file, enter:

less output/preprocess/finalized/demo_genecatalogs.centroid.faa		

which yields:

>sample1_14-1
MFQDKYVFAQLASFLNRSKFNRIVAKYDGDKYVKHFTCWNQLLALMFGQLSNRESLRDLIVALEAHHSKCYHLGMGKNVS
KSSLARANQDRDYHIFEEYAYYLVSEAREKRVTHIFKLGGNVYAFDSTTIDLCLSVFWWAKFRKKKGGIKVHTLYDIETQ
IPAFFHITEASVHDSKVMNEIPYEPGSYYIFDRGYNNFKMLYKIHQIEAYFVVRAKKNLQYKSIKWKRRLPKNVFSDASI
LLTGFYPKQYYPEPLRLVKYWDEEQEREFTFITNAMHISALQVAELYKNRWQVELFFKWLKQHLKIKRFWGTTENAVRIQ
IYAAICTYCLVAIIQHDMQLDRSTYEVLQILSISLTDKTHLRDLFDKTKFQNDKERFGPNGPSLFNF
  • This file provides the amino acid sequences for each gene identified as the representative of its gene family (Fasta format file).
  • Each gene is given a MetaWIBELE-specific ID (i.e. sample1_11-1) based on the sample and order in which it was identified.
demo_genecatalogs_counts.all.tsv

To view this file, enter:

less output/preprocess/finalized/demo_genecatalogs_counts.all.tsv		

which yields:

ID	sample1	sample2
sample1_14-1	67	5
  • This file provides the counts of the number of reads mapped to each gene family (rows) in each sample (columns) (TSV format file).

Intermediate files

  1. Assembly results:

    • output/preprocess/finalized/demo_contig_sequence.fasta: contig sequences (Fasta format file).
    • The assembly outputs for each sample are in the output/preprocess/assembly/ folder.
  2. Gene-calling results:

    • output/preprocess/finalized/demo_gene_info.tsv: all gene calling information (TSV format file).
    • output/preprocess/finalized/demo_combined_gene_protein_coding.sorted.fna: nucleotide sequences for all ORFs sorted by gene length (Fasta format file).
    • output/preprocess/finalized/demo_combined_protein.sorted.faa: protein sequences for all ORFs sorted by gene length (Fasta format file).
    • output/preprocess/finalized/demo_combined_gene_protein_coding.complete.sorted.fna: nucleotide sequences for all complete ORFs sorted by gene length (Fasta format file).
    • output/preprocess/finalized/demo_combined_protein.complete.sorted.faa: protein sequences for all complete ORFs sorted by gene length (Fasta format file).
    • The gene-calling outputs from prodigal are in the output/preprocess/gene_calls/ folder.
    • The gene-annotation outputs from prokka are in the output/preprocess/gene_annotation/ folder.
  3. Gene families:

    • output/preprocess/finalized/demo_genecatalogs.clstr: clustering information for non-redudant gene families (Fasta format file).
    • output/preprocess/finalized/demo_genecatalogs_counts.all.tsv: read counts of gene families across samples (TSV format file).
    • output/preprocess/finalized/demo_genecatalogs.centroid.fna: nucleotide sequences of gene family representatives (Fasta format file).
    • output/preprocess/finalized/demo_genecatalogs.centroid.faa: protein sequences of gene family representatives (Fasta format file).
    • All mapping outputs for each sample are in the output/preprocess/mapping/ folder.

  • Single-sample assembly allows us to estimate the genomes per metagenomic sample using countable computational resources. What might be some pros/cons of this approach in real data? How about co-assembly by assembling metagenomes across samples?

3. Characterizing assembled metagenomic genes

3.1 MetaWIBELE-characterize input files

The MetaWIBELE-characterize module is run using gene families with protein sequence and read count information:

This tutorial will use the demo input files distributed with the MetaWIBELE installation under the examples/input/ directory. If you installed with pip or conda and don't have these handy, you can obtain copies by right-clicking these links and selecting "save link as":

Let's see the contents of these files.

demo_mgx_metadata.tsv

column -t -s $'\t' input/demo_mgx_metadata.tsv | less -S

which yields:

SID         study  sex     consent_age  antibiotic  mesalamine  steroids  immunosuppressant  diagnosis
PRISM_7122  PRISM  Female  38           No          No          No        Yes                CD
PRISM_7147  PRISM  Male    50           No          Yes         No        No                 CD
PRISM_7150  PRISM  Male    41           No          No          No        Yes                CD
PRISM_7153  PRISM  Male    51           No          Yes         No        No                 CD
PRISM_7184  PRISM  Male    68           No          No          No        No                 CD

The first column is the sample ID of a stool donor in the PRISM cohort which contains Crohn's disease (CD) patients and healthy controls (nonIBD). Other relevant metadata is also listed for these individuals. There are 102 samples (68 CD). In this tutorial, we are interested in prioritizing gene families associated with CD. Some of the relevant lines in the metawibele.cfg file (under [maaslin2] options) are:

reference = diagnosis,nonIBD
# The minimum percent of case-control samples used for comparison in which a feature is detected [ Default: 0.1 ]
effect_size = mean(log)
# The main phenotype metadata used for prioritization, e.g. metadata1. [ Default: none ]: skip the association with environmental/phenoty
phenotype = diagnosis

demo_genecatalogs.centroid.faa

To view this file, type:

less -S input/demo_genecatalogs.centroid.faa

Let's count the number of genes (centroids):

grep ">" input/demo_genecatalogs.centroid.faa | wc -l

which yields 1899 genes.

demo_genecatalogs_counts.all.tsv

To view this file, type:

less -S input/demo_genecatalogs_counts.all.tsv 

which yields:

ID      PRISM_7122      PRISM_7147      PRISM_7150      PRISM_7153      PRISM_7184      PRISM_7238
PRISM_7122_03545        42      2       0       4       2       22      1098    117     16      26
PRISM_7122_03875        197     16      2       15      87      0       0       80      92      0
PRISM_7122_12067        216     20      6       28      2       12      2006    258     17      102
PRISM_7122_131770       36      6       6       2       0       70      2274    24      22      58

3.2 Running MetaWIBELE: characterization module

A canonical MetaWIBELE run uses gene families (non-redundant gene catalogs) to build protein families and annotate them functionally and taxonomically. By default, protein families are annotated using global-homology-based, domain/motif-based, and abundance-based approaches. From the examples folder (or the location where you saved the gene families and sample metadata files above) execute the following:

If you're on the VM, use the files from metawibele/input/. If you installed with pip or conda, download them here and save them to the metawibele/raw_reads directory. If you installed from source, copy them from the metawibele source package to the metawibele/input/ directory. To run the following command, change to the metawibele/ directory and run:

metawibele characterize --input-sequence input/demo_genecatalogs.centroid.faa \
--input-count input/demo_genecatalogs_counts.all.tsv --input-metadata input/demo_mgx_metadata.tsv \
--output output/characterization/ --local-jobs 4 --bypass-psortb

Note:

  • Make sure the global configuration file metawibele.cfg is in your working directory (i.e. metawibele/ directory).
  • In the command replaces output/characterization/ with the path to the folder where you want to write the output files.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).

Note the characterization phases of MetaWIBELE-characterize in the above log: 1) building protein families, 2) mapping proteins to UniRef (global-homology based annotation), 3) characterizing protein domains and motifs for protein families (domain-motif based annotation), 4) profiling taxonomy by binning co-abundant gene families and propagating the taxa to unclassified genes, 5) building an abundance table for protein families and associating it with environmental parameters or phenotypes.


  • How would you customize the workflow settings to skip the domain-motif based annotation step in order to reduce computational consumption?

3.3 MetaWIBELE characterization outputs

MetaWIBELE-characterize creates four main output subfolders that contain the main outputs of each step:

output/characterization/abundance_annotation/
output/characterization/clustering/
output/characterization/domain_motif_annotation/
output/characterization/finalized/
output/characterization/global_homology_annotation/

These are the main finalized characterization output files:

output/characterization/finalized/demo_proteinfamilies_annotation.tsv
output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv
output/characterization/finalized/demo_proteinfamilies_annotation.taxonomy.tsv
output/characterization/finalized/demo_proteinfamilies_nrm.tsv

The pre-generated demo files can be directly downloaded here:

Let's inspect the contents of the finalized annotation files (in the sub-folder finalized) and the intermediate subfolders for each annotation step:

3.3.1 finalized annotations

demo_proteinfamilies_annotation.tsv
familyID    annotation  feature category    method  AID
Cluster_1   good    quality note    Quality_control NA
Cluster_1   demo    study   project Shotgun NA
Cluster_1   Cluster_1   protein_family  Denovo_clustering   CD-hit  Cluster_1__Denovo_clustering
Cluster_1   UniRef90_A7B522 strong_homology UniRef90_homology   UniRef90    Cluster_1__UniRef90_homology
Cluster_1   Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)  Terminal    Taxonomy_characterization   Taxonomy_annotation Cluster_1__Taxonomy_characterization
Cluster_1   3505.6086774509786  DNA_abundance   Denovo_characterization DNA Cluster_1__DNA_abundance
Cluster_1   0.9509803921568627  DNA_prevalence  Denovo_characterization DNA Cluster_1__DNA_prevalence
Cluster_1   PF01176:PF00164;PF01176:PF11987 DOMINE_interaction  Denovo_characterization DOMINE  Cluster_1__DOMINE_interaction
Cluster_1   PF01176:PF00164;PF01176:PF11987 ExpAtlas_interaction    Denovo_characterization ExpAtlas    Cluster_1__ExpAtlas_interaction
Cluster_1   PF01176 InterProScan_PfamDomain Denovo_characterization InterProScan    Cluster_1__InterProScan_PfamDomain
Cluster_1   msp_1   MSPminer_MSP    Denovo_characterization MSPminer    Cluster_1__MSPminer_MSP
Cluster_1   up(CD_vs_nonIBD)    MaAsLin2_DA Denovo_characterization MaAsLin2    Cluster_1__MaAsLin2_DA
...
  • This file details the annotation of each protein family in the community (TSV format file).
  • Protein families are groups of evolutionarily-related protein-coding sequences that often perform similar functions.
  • MetaWIBELE annotates protein families by combining global-homology similarity, local-homology similarity, and non-homology based methods.
  • The annotations for each protein family come from multiple information sources, e.g. biochemical annotation, taxonomic annotation, ecological properties and association with environmental parameters or phenotypes, etc.
demo_proteinfamilies_annotation.attribute.tsv
TID AID key value
1   Cluster_1__Denovo_clustering    repID   PRISM_7861_211956
2   Cluster_1__Denovo_clustering    rep_length  85
3   Cluster_1__Denovo_clustering    cluster_size    24
4   Cluster_1__UniRef90_homology    UniProtKB   A7B522 
5   Cluster_1__UniRef90_homology    Protein_names   Translation initiation factor IF-1
6   Cluster_1__UniRef90_homology    query_cov_type  high_confidence
7   Cluster_1__UniRef90_homology    mutual_cov_type high_confidence
8   Cluster_1__UniRef90_homology    identity    100.0  
9   Cluster_1__UniRef90_homology    query_coverage  1.0
10  Cluster_1__UniRef90_homology    mutual_coverage 1.0
11  Cluster_1__UniRef90_homology    taxa_id 1
12  Cluster_1__UniRef90_homology    taxa_name   Unclassified
13  Cluster_1__Taxonomy_characterization    taxa_id 411470 
14  Cluster_1__Taxonomy_characterization    taxa_lineage    k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)
15  Cluster_1__Taxonomy_characterization    LCA_Tax Lachnospiraceae
16  Cluster_1__Taxonomy_characterization    LCA_TaxID   186803
...
  • This is the supplementary results for characterization (TSV format file).
  • Each line in the file provides supplemental information about the corresponding annotation results. AID is the key to connect demo_proteinfamilies_annotation.tsv with demo_proteinfamilies_annotation.attribute.tsv.
demo_proteinfamilies_annotation.taxonomy.tsv
familyID    study   map_type    query_type  mutual_type identity    query_coverage  mutual_coverage detail  Tax TaxID   Rep_Tax Rep_TaxID   UniProtKB   unirefID    note    msp_name    msp_taxa_name   msp_taxa_id MSP_Tax MSP_TaxID   MSP_Rep_Tax MSP_Rep_TaxID   taxa_id taxa_name   taxa_rank   taxa_lineage   
Cluster_1   demo    UniRef90_characterized  high_confidence high_confidence 100.0   1.0 1.0 Translation initiation factor IF-1  root    1   Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)  411470  A7B522  UniRef90_A7B522 good    msp_1   Ruminococcus gnavus 33038   Lachnospiraceae 186803  Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)  411470  411470  Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)  Terminal    k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)
Cluster_10  demo    UniRef90_uncharacterized    high_confidence high_confidence 97.2    0.8294573643410853  0.8294573643410853  Uncharacterized protein Clostridiales   186802  [Eubacterium] rectale   39491   A0A0M6W8Q5  UniRef90_A0A0M6W8Q5 good    msp_2   [Eubacterium] rectale   39491   [Eubacterium] rectale   39491   [Eubacterium] rectale   39491   39491   [Eubacterium] rectale   Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_100 demo    UniRef90_characterized  high_confidence high_confidence 99.7    1.0 1.0 S-adenosylmethionine:tRNA ribosyltransferase-isomerase  Clostridiales   186802  [Eubacterium] rectale   39491   A0A0M6WLI9  UniRef90_A0A0M6WLI9 good    msp_2   [Eubacterium] rectale   39491   [Eubacterium] rectale   39491   [Eubacterium] rectale   39491   39491   [Eubacterium] rectale   Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_1000    demo    UniRef90_uncharacterized    high_confidence high_confidence 99.5    1.0 1.0 Domain of uncharacterized function (DUF1836)    Clostridiales   186802  [Eubacterium] rectale   39491   A0A174GTW6  UniRef90_A0A174GTW6 good    msp_2   [Eubacterium] rectale   39491   [Eubacterium] rectale   39491   [Eubacterium] rectale   39491   39491   [Eubacterium] rectale   Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_1001    demo    UniRef90_characterized  high_confidence high_confidence 97.7    1.0 1.0 DUF624 domain-containing protein    Clostridiales   186802  [Eubacterium] rectale   39491   A0A173U990  UniRef90_A0A173U990 good    msp_2   [Eubacterium] rectale   39491   [Eubacterium] rectale   39491   [Eubacterium] rectale   39491   39491   [Eubacterium] rectale   Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
...
  • This file provides detailed information about taxonomic annotation (TSV format file).
  • demo_proteinfamilies_annotation.taxonomy.all.tsv shows the taxonomic information at each taxonomic level.
demo_proteinfamilies_nrm.tsv
ID            PRISM_7122  PRISM_7147  PRISM_7150  PRISM_7153  PRISM_7184  PRISM_7238  
Cluster_1     1628.57     344.191     1779.04     361.815     21274.3     2531.96     
Cluster_10    0           1775.32     1406.68     1703.73     1168.16     61.7906     
Cluster_100   0           1787.4      1288.79     1809.2      0           203.804     
Cluster_1000  0           1726.58     4340.43     1776.75     0           385.693     
Cluster_1001  0           1742.53     1174.7      1638.48     0           293.861     
Cluster_1002  0           0           0           0           0           1597.87     
...
  • This file contains the relative abundance per protein family across samples (TSV format file).
  • Protein family abundance is reported in copies per million (CPM) units by default, which is "total sum scaling (TSS)"-style normalization (i.e. each sample is constrained to sum to 1 million). First, each protein family is normalized to RPK (reads per kilobase) units for gene length normalization; RPK units reflect relative gene (or transcript) copy numbers in the community. Then, RPK values are further sum-normalized (CPM) to adjust for differences in sequencing depth across samples. For further information, refer to the normalization approach in HUMAnN.

3.3.2 protein families

The clustering folder contains the outputs of clustering proteins into protein families using the same criteria for building UniRef90 clusters.

The pre-generated demo files can be directly downloaded here:

demo_proteinfamilies.clstr
>PRISM_7861_211956;Cluster_1;length=85;size=24;cluster=1
PRISM_7861_211956
PRISM_7791_142577
PRISM_8794_132719
PRISM_7910_97769
PRISM_8774_132255
PRISM_8496_12696
PRISM_7847_96543
PRISM_9126_111168
...
>PRISM_7855_54982;Cluster_2;length=559;size=12;cluster=2
PRISM_7855_54982
PRISM_8776_12844
PRISM_8467_139927
PRISM_9148_54414
PRISM_8475_22744
PRISM_7122_03875
PRISM_8589_39036
PRISM_8776_341235
...
  • This file shows the clustered proteins (Fasta format file).
  • The top ID (ex. PRISM_7861_211956) is the representative of the cluster and it is followed by a list of cluster members.
  • Each protein is given a MetaWIBELE-specific ID (i.e. PRISM_7861_211956 and PRISM_7861_54982) based on the sample and order in which it was identified.
demo_proteinfamilies.centroid.faa
>PRISM_7938_21108
MKTRRKKQTKRVLAGTLAALMTVSAVPVSNSVVHAEESQDRSELKLRYSSAAPDSYAGWEKWSLPIGNSGIGASVFGGVQTERIQLNEKSLWSGGPSDSRPNYNGGNLEEKGKNGQTVKEIQQLFANGDNDAASSKCGELVGLSDDAGVNGYGY
...
>PRISM_8496_61239
MKKGKKGLALAMTLAVTMTSVPTAFALDGNAQPTKQPDNQLGQRDTKAISYKYTKLDQAPMKESVQASSENTPMQWSEGPANLAFDGDVNTGWHTDYNTHQGPYTIEWNLGGTYKIGKVEYTRKKTGAAGIWKEIKIEVKNGETGSWQTAYNGT
...
  • This file provides the amino acid sequences for each protein identified as the representative of its protein family (Fasta format file).

3.3.3 global-homology based annotations

The global_homology_annotation folder contains the outputs from querying each sequence in the protein families against the UniRef90 database via a protein-level search.

The pre-generated demo files can be directly downloaded here:

demo_proteinfamilies_annotation.uniref90_annotation.tsv
familyID	study	map_type	query_type	mutual_type	identity	query_coverage	mutual_coverage	detail	Tax	TaxID	Rep_Tax	Rep_TaxID	UniProtKB	unirefID	taxa_id	taxa_name	taxa_rank	taxa_lineage	note
Cluster_1	demo	UniRef90_characterized	high_confidence	high_confidence	100.0	1.0	1.0	Translation initiation factor IF-1	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522	411470	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	Terminal	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)	good
Cluster_10	demo	UniRef90_uncharacterized	high_confidence	high_confidence	97.2	0.8294573643410853	0.8294573643410853	Uncharacterized protein	Clostridiales	186802	[Eubacterium] rectale	39491	A0A0M6W8Q5	UniRef90_A0A0M6W8Q5	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
Cluster_100	demo	UniRef90_characterized	high_confidence	high_confidence	99.7	1.0	1.0	S-adenosylmethionine:tRNA ribosyltransferase-isomerase	Clostridiales	186802	[Eubacterium] rectale	39491	A0A0M6WLI9	UniRef90_A0A0M6WLI9	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
Cluster_1000	demo	UniRef90_uncharacterized	high_confidence	high_confidence	99.5	1.0	1.0	Domain of uncharacterized function (DUF1836)	Clostridiales	186802	[Eubacterium] rectale	39491	A0A174GTW6	UniRef90_A0A174GTW6	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
Cluster_1001	demo	UniRef90_characterized	high_confidence	high_confidence	97.7	1.0	1.0	DUF624 domain-containing protein	Clostridiales	186802	[Eubacterium] rectale	39491	A0A173U990	UniRef90_A0A173U990	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
...
  • This file shows the mapping results of protein families to UniRef90 (TSV format file).
  • Each row in this file corresponds to a protein family (i.e. Cluster_1, Cluster_10). Based on the sequence identity to and coverage of the UniRef90 sequence, each protein family is labeled as:
    • UniRef90_characterized: proteins with strong homology to known characterized proteins
    • UniRef90_uncharacterized: proteins with strong homology to proteins of uncharacterized function (i.e. lacking a GO term assignment)
    • UniRef90_weak_homology: potentially novel proteins that only had weak homology to any UniRef90 family
    • UniRef90_worse_homology: completely novel proteins without any significant homology
demo_UniRef90_proteinfamilies.detail.tsv
familyID	type	detail	Protein_names	Gene_names	Tax	TaxID	Rep_Tax	Rep_TaxID	UniProtKB	unirefID
Cluster_1	UniRef90_GO	GO:0003743;GO:0005737;GO:0019843;GO:0043022	Translation initiation factor IF-1	infA{ECO:0000256|HAMAP-Rule:MF_00075,;RUMGNA_02659{ECO:0000313|EMBL:EDN77045.1}	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522
Cluster_1	UniRef90_PfamDomain	PF01176	Translation initiation factor IF-1	infA{ECO:0000256|HAMAP-Rule:MF_00075,;RUMGNA_02659{ECO:0000313|EMBL:EDN77045.1}	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522
Cluster_1	UniRef90_eggNOG	COG0361;ENOG4105K9U	Translation initiation factor IF-1	infA{ECO:0000256|HAMAP-Rule:MF_00075,;RUMGNA_02659{ECO:0000313|EMBL:EDN77045.1}	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522
Cluster_10	UniRef90_uncharacterized	UniRef90_uncharacterized	Uncharacterized protein	T1815_02191{ECO:0000313|EMBL:CRL31850.1}	Clostridiales	186802	[Eubacterium] rectale	39491	A0A0M6W8Q5	UniRef90_A0A0M6W8Q5
Cluster_100	UniRef90_GO	GO:0005737;GO:0008616;GO:0016740;GO:0016853	S-adenosylmethionine:tRNA ribosyltransferase-isomerase	queA{ECO:0000256|HAMAP-Rule:MF_00113,;T1815_15831{ECO:0000313|EMBL:CRL37473.1}	Clostridiales	186802	[Eubacterium] rectale	39491	A0A0M6WLI9	UniRef90_A0A0M6WLI9
Cluster_100	UniRef90_PfamDomain	PF02547	S-adenosylmethionine:tRNA ribosyltransferase-isomerase	queA{ECO:0000256|HAMAP-Rule:MF_00113,;T1815_15831{ECO:0000313|EMBL:CRL37473.1}	Clostridiales	186802	[Eubacterium] rectale	39491	A0A0M6WLI9	UniRef90_A0A0M6WLI9
...
  • This file details the annotations of the UniRef90s to other groupings, including GO terms, KEGG Orthogroups, EggNOG (including COGs), and Pfams (TSV format file).
  • Protein families may also be labeled as:
    • UniRef90_characterized: proteins with strong homology to known characterized proteins
    • UniRef90_uncharacterized: proteins with strong homology to proteins of uncharacterized function (i.e. lacking a GO term assignment)
    • UniRef90_weak_homology: potentially novel proteins that only had weak homology to any UniRef90 family
    • UniRef90_worse_homology: completely novel proteins without any significant homology

3.3.4 domain-motif annotations

The domain_motif_annotation folder contains the outputs created by using the local-homology approach (domain/motif search) to characterize the secondary structures of protein families.

The pre-generated demo files can be directly downloaded here:

demo_Denovo_proteinfamilies.signaling.detail.tsv
familyID	type	detail	consistency
Cluster_1012	Denovo_signaling	signaling	1.0;1.0
Cluster_1021	Denovo_signaling	signaling	1.0;1.0
Cluster_1051	Denovo_signaling	signaling	1.0;1.0
Cluster_1089	Denovo_signaling	signaling	1.0;1.0
Cluster_1153	Denovo_signaling	signaling	1.0;1.0
Cluster_1205	Denovo_signaling	signaling	1.0;1.0
Cluster_1259	Denovo_signaling	signaling	1.0;1.0
Cluster_1279	Denovo_signaling	signaling	1.0;1.0
...
  • This file lists potential signaling proteins based on de novo prediction of function (TSV format file).
  • Proteins that are not predicted to be signaling proteins are excluded from the file.
  • The consistency score (column 4) contains the consistency scores from SignalP and Phobius predictions. When both SignalP and Phobius predict the same type of signal and have high consistency scores the prediction is believed to be true and the protein family is included in this file.
demo_Denovo_proteinfamilies.transmembrane.detail.tsv
familyID	type	detail	consistency
Cluster_1001	Denovo_transmembrane	transmembrane	1.0;1.0
Cluster_1002	Denovo_transmembrane	transmembrane	1.0;1.0
Cluster_1005	Denovo_transmembrane	transmembrane	1.0;1.0
Cluster_1008	Denovo_transmembrane	transmembrane	1.0;1.0
Cluster_1011	Denovo_transmembrane	transmembrane	1.0;1.0
Cluster_1015	Denovo_transmembrane	transmembrane	1.0;1.0
Cluster_1018	Denovo_transmembrane	transmembrane	1.0;1.0
Cluster_1020	Denovo_transmembrane	transmembrane	1.0;1.0
...
  • Similar to the above, this file lists potential transmembrane proteins based on de novo prediction of function (TSV format file).
demo_InterProScan_proteinfamilies.detail.tsv
familyID	type	detail	description	InterPro_accession	consistency
Cluster_860	InterProScan_PfamDomain	PF01226	Formate/nitrite transporter	IPR000292	1.0
Cluster_328	InterProScan_PfamDomain	PF01293	Phosphoenolpyruvate carboxykinase, ATP-utilising	IPR001272	1.0
Cluster_1523	InterProScan_PfamDomain	PF00301	Rubredoxin domain	IPR024935	1.0
Cluster_1201	InterProScan_PfamDomain	PF00583	GNAT domain	IPR000182	1.0
Cluster_1068	InterProScan_PfamDomain	PF00005;PF02518	ABC transporter-like;Histidine kinase/HSP90-like ATPase	IPR003439;IPR003594	1.0;1.0
Cluster_1249	InterProScan_PfamDomain	PF13782	Stage V sporulation protein AB	IPR020144	1.0
Cluster_784	InterProScan_PfamDomain	PF00005	ABC transporter-like	IPR003439	1.0
Cluster_200	InterProScan_PfamDomain	PF01797	Transposase IS200-like	IPR002686	1.0
...
  • This file lists the results of using InterProScan to predict Pfam IDs or other domain signatures for each protein family (TSV format file).
  • The consistency score is calculated as the proportion of members within the cluster that receive the same predicted Pfam ID as the protein family representative.
demo_DOMINE_proteinfamilies.detail.tsv
familyID	type	detail	description	consistency
Cluster_328	DOMINE_interaction	PF01293:PF00342	Phosphoenolpyruvate carboxykinase:Phosphoglucose isomerase	1.0
Cluster_1523	DOMINE_interaction	PF00301:PF07992	Rubredoxin:Pyridine nucleotide-disulphide oxidoreductase	1.0
Cluster_1201	DOMINE_interaction	PF00583:PF00439;PF00583:PF00514;PF00583:PF00696;PF00583:PF02629;PF00583:PF04055;PF00583:PF04112;PF00583:PF07479;PF00583:PF00244	Acetyltransferase (GNAT) family:Bromodomain;Acetyltransferase (GNAT) family:Armadillo/beta-catenin-like repeat;Acetyltransferase (GNAT) family:Amino acid kinase family;Acetyltransferase (GNAT) family:CoA binding domain;Acetyltransferase (GNAT) family:Radical SAM superfamily;Acetyltransferase (GNAT) family:Mak10 subunit, NatC N(alpha)-terminal acetyltransferase;Acetyltransferase (GNAT) family:NAD-dependent glycerol-3-phosphate dehydrogenase C-terminus;Acetyltransferase (GNAT) family:14-3-3 protein	1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0
Cluster_1068	DOMINE_interaction	PF00005:PF00009;PF00005:PF00037;PF00005:PF00571;PF00005:PF00664;PF02518:PF00012;PF02518:PF00072;PF02518:PF00183;PF02518:PF00204;PF02518:PF00240;PF02518:PF00515;PF02518:PF00989;PF02518:PF01119;PF02518:PF01590;PF02518:PF01624;PF02518:PF01740;PF02518:PF01751;PF02518:PF00512;PF02518:PF04968;PF02518:PF04969;PF02518:PF08565;PF02518:PF10436	ABC transporter:Elongation factor Tu GTP binding domain;ABC transporter:4Fe-4S binding domain;ABC transporter:CBS domain;ABC transporter:ABC transporter transmembrane region;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Hsp70 protein;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Response regulator receiver domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Hsp90 protein;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:DNA gyrase B;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Ubiquitin family;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Tetratricopeptide repeat;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:PAS fold;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:DNA mismatch repair protein, C-terminal domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:GAF domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:MutS domain I;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:STAS domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Toprim domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:His Kinase A (phospho-acceptor) domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:CHORD;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:CS domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Cdc37 Hsp90 binding domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Mitochondrial branched-chain alpha-ketoacid dehydrogenase kinase	1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0
Cluster_784	DOMINE_interaction	PF00005:PF00009;PF00005:PF00037;PF00005:PF00571;PF00005:PF00664	ABC transporter:Elongation factor Tu GTP binding domain;ABC transporter:4Fe-4S binding domain;ABC transporter:CBS domain;ABC transporter:ABC transporter transmembrane region	1.0;1.0;1.0;1.0
Cluster_533	DOMINE_interaction	PF00589:PF02022	Phage integrase family:Integrase Zinc binding domain	1.0
Cluster_908	DOMINE_interaction	PF00528:PF00005	Binding-protein-dependent transport system inner membrane component:ABC transporter	1.0
Cluster_1067	DOMINE_interaction	PF00293:PF00300;PF00293:PF05026;PF00293:PF09296	NUDIX domain:Histidine phosphatase superfamily (branch 1);NUDIX domain:Dcp2, box A domain;NUDIX domain:NADH pyrophosphatase-like rudimentary NUDIX domain	1.0;1.0;1.0
...
  • This file lists the results of using DOMINE to predict domain-domain interactions for each protein (TSV format file).
  • In this step, MetaWIBELE searches the DOMINE database of known and predicted protein domain (domain-domain) interactions in order to identify microbial proteins that interact with other microbial proteins or host proteins.
  • Looking at the first row for protein family Cluster_328, the details column contains "PF01293:PF00342". The first value is the Pfam ID for this protein family, and the second value is the Pfam ID of the protein it is known or predicted to interact with.
  • Protein families can interact with multiple other proteins and multiple interactions are separated by ";" in the details column. For example, the third row for protein family Cluster_1201 has Pfam ID PF00583 and interacts with multiple other proteins including PF00439, PF00514, PF00696, and many others.

3.3.5 abundance annotations

The abundance_annotation folder contains the outputs created from utilizing a non-homology based strategy compromising (i) taxonomic annotation with phylogenetic binning, (ii) abundance profiling for protein families, and (iii) association with environmental parameters or phenotypes based on differential abundance.

The pre-generated demo files can be directly downloaded here:

demo_DNA_proteinfamilies.abundance.detail.tsv
familyID	type	detail	mean_abundance	mean_prevalent_abundance	prevalence	total_abundance	total_samples
Cluster_1	DNA-CD_abundance	4734.313442647057	4734.313442647057	5110.052604761903	0.9264705882352942	321933.3140999999	68
Cluster_1	DNA-CD_prevalence	0.9264705882352942	4734.313442647057	5110.052604761903	0.9264705882352942	321933.3140999999	68
Cluster_1	DNA_abundance	3505.6086774509786	3505.6086774509786	3686.310155670101	0.9509803921568627	357572.0850999998	102
Cluster_1	DNA_prevalence	0.9509803921568627	3505.6086774509786	3686.310155670101	0.9509803921568627	357572.0850999998	102
Cluster_1	DNA-nonIBD_abundance	1048.1991470588237	1048.1991470588237	1048.1991470588237	1.0	35638.771	34
Cluster_1	DNA-nonIBD_prevalence	1.0	1048.1991470588237	1048.1991470588237	1.0	35638.771	34
Cluster_1	DNA-within-phenotype_abundance	4734.313442647057	4734.313442647057	5110.052604761903	0.9264705882352942	321933.3140999999	68
Cluster_1	DNA-within-phenotype_prevalence	1.0	1048.1991470588237	1048.1991470588237	1.0	35638.771	34
...

  • This file lists the results of abundance annotations for each protein (TSV format file).
  • In this file, the number of rows per protein family is equal to 4+(2*the number of levels of the phenotype variable of interest). In this case, our phenotype variable of interest contains two levels (CD and non-IBD) and so there are 8 rows per protein family.
  • We calculate the mean abundance and mean prevalence of the protein family within levels of the phenotype variable. The "within phenotype" abundance and prevalence are used later by MetaWIBELE and are equal to the largest of these values, respectively. For example, for Cluster_1 the "within phenotype" abundance is equal to the CD abundance, as this is larger than the non-IBD abundance.
  • Additionally, we calculate the mean abundance and prevalence of the protein family across all the samples ("DNA_abundance", "DNA_prevalence").
demo_MaAsLin2_proteinfamilies.DA.detail.tsv
familyID	type	detail	cmp_type	prevalence	prevalence_case	prevalence_control	coef	stderr	pvalue	qvalue	log(FC)	Cohen's_d	mean(log)	mean_abundance	mean_abundance_case	mean_abundance_control	mean_prevalent_abundance	mean_prevalent_abundance_case	mean_prevalent_abundance_control
Cluster_501	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.7745098039215687	0.6764705882352942	0.9705882352941176	-5.58372315590506	1.5310181090399	0.000441653383784318	0.000831504617354514	-1.204565604476706	-1.9886542663994446	-4.5777634174796376	959.4644850392153	539.4822569705882	1799.428941176471	1238.8022465063286	797.4955103043478	1853.9570909090912
Cluster_1250	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.6862745098039216	0.5441176470588235	0.9705882352941176	-7.1718818072073	1.40264174413871	1.74393840488004e-06	1.87992698862794e-05	-1.0859781754829547	-1.7675293091074744	-4.52091979609633	870.8175474705881	526.3853064999998	1559.6820294117651	1268.9055691714284	967.4108335675674	1606.9451212121216
Cluster_1320	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.6274509803921569	0.4411764705882353	1.0	-6.20349834654513	1.15147197796565	5.56499760870556e-07	9.83186205942098e-06	-1.0694276667956981	-1.3164688998078924	-4.430593814114001	283.20678581372556	172.8054846029412	504.0093882352942	451.3608148906251	391.69243176666674	504.0093882352942
Cluster_1418	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.6666666666666666	0.5147058823529411	0.9705882352941176	-6.28435889075268	1.34243910968233	9.90530740415313e-06	5.13580581611423e-05	-1.0969135855462964	-1.781512985703419	-4.3588123134077055	894.3034949607845	537.0061542058824	1608.8981764705882	1341.4552424411768	1043.3262424571428	1657.6526666666666
Cluster_1463	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.6666666666666666	0.5147058823529411	0.9705882352941176	-6.58191770300358	1.32440076206318	3.13169816154589e-06	2.62153442877793e-05	-1.1512547063916818	-1.8899914209991773	-4.337871755508877	870.681370784314	505.8497326470589	1600.344647058824	1306.0220561764709	982.7937662857144	1648.8399393939399
Cluster_1109	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.7156862745098039	0.5882352941176471	0.9705882352941176	-6.68871156016538	1.47548693289107	1.76343110261956e-05	7.66082133675763e-05	-0.9897351577021917	-1.610820456043556	-4.3292683292822165	1023.5636736960785	654.6116723088236	1761.4676764705882	1430.1848591369865	1112.839842925	1814.8454848484848
Cluster_851	MaAsLin2_DA	up(CD_vs_nonIBD)	CD_vs_nonIBD	0.5784313725490197	0.7647058823529411	0.20588235294117646	6.72808441059811	1.36985906524513	3.96243685449574e-06	2.95192066145927e-05	3.7032921337370386	1.1561240497664915	4.325883574007152	448.4572443137256	664.6057397058825	16.16025352941177	775.2989647457629	869.0998134615386	78.49266000000001
Cluster_1357	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.5686274509803921	0.39705882352941174	0.9117647058823529	-6.63240896014994	1.3998488401409	7.92308869232625e-06	4.45867408129641e-05	-1.1277371583077196	-1.1938428944361867	-4.317084302332613	617.8856939019607	364.1507633529411	1125.355555	1086.626565137931	917.1204410370368	1234.2609312903226
Cluster_1252	MaAsLin2_DA	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.6372549019607843	0.4852941176470588	0.9411764705882353	-5.64000231000308	1.44856135617507	0.000188164992197779	0.000419131463307499	-1.0050544474555525	-1.2973343785212454	-4.292311085975335	662.4116687058823	419.8753118823529	1147.484382352941	1039.4767724307692	865.1976123636363	1219.20215625
...
  • This file contains the results of running MaAsLin2 linear models to assess the differential abundance of the protein families across levels of the phenotype variable of interest (TSV format file).
  • The detail type of "down" indicates that the abundance of the protein family was lower in CD compared to the non-IBD control, while a detail type of "up" indicates that the abundance of the protein family was higher in CD compared to the non-IBD control.
demo_MaAsLin2_proteinfamilies.DA-sig.detail.tsv
familyID	type	detail	cmp_type	prevalence	prevalence_case	prevalence_control	coef	stderr	pvalue	qvalue	log(FC)	Cohen's_d	mean(log)	mean_abundance	mean_abundance_case	mean_abundance_control	mean_prevalent_abundance	mean_prevalent_abundance_case	mean_prevalent_abundance_control
Cluster_501	MaAsLin2_DA-sig	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.7745098039215687	0.6764705882352942	0.9705882352941176	-5.58372315590506	1.5310181090399	0.000441653383784318	0.000831504617354514	-1.204565604476706	-1.9886542663994446	-4.5777634174796376	959.4644850392153	539.4822569705882	1799.428941176471	1238.8022465063286	797.4955103043478	1853.9570909090912
Cluster_1250	MaAsLin2_DA-sig	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.6862745098039216	0.5441176470588235	0.9705882352941176	-7.1718818072073	1.40264174413871	1.74393840488004e-06	1.87992698862794e-05	-1.0859781754829547	-1.7675293091074744	-4.52091979609633	870.8175474705881	526.3853064999998	1559.6820294117651	1268.9055691714284	967.4108335675674	1606.9451212121216
Cluster_1320	MaAsLin2_DA-sig	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.6274509803921569	0.4411764705882353	1.0	-6.20349834654513	1.15147197796565	5.56499760870556e-07	9.83186205942098e-06	-1.0694276667956981	-1.3164688998078924	-4.430593814114001	283.20678581372556	172.8054846029412	504.0093882352942	451.3608148906251	391.69243176666674	504.0093882352942
Cluster_1418	MaAsLin2_DA-sig	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.6666666666666666	0.5147058823529411	0.9705882352941176	-6.28435889075268	1.34243910968233	9.90530740415313e-06	5.13580581611423e-05	-1.0969135855462964	-1.781512985703419	-4.3588123134077055	894.3034949607845	537.0061542058824	1608.8981764705882	1341.4552424411768	1043.3262424571428	1657.6526666666666
Cluster_1463	MaAsLin2_DA-sig	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.6666666666666666	0.5147058823529411	0.9705882352941176	-6.58191770300358	1.32440076206318	3.13169816154589e-06	2.62153442877793e-05	-1.1512547063916818	-1.8899914209991773	-4.337871755508877	870.681370784314	505.8497326470589	1600.344647058824	1306.0220561764709	982.7937662857144	1648.8399393939399
Cluster_1109	MaAsLin2_DA-sig	down(CD_vs_nonIBD)	CD_vs_nonIBD	0.7156862745098039	0.5882352941176471	0.9705882352941176	-6.68871156016538	1.47548693289107	1.76343110261956e-05	7.66082133675763e-05	-0.9897351577021917	-1.610820456043556	-4.3292683292822165	1023.5636736960785	654.6116723088236	1761.4676764705882	1430.1848591369865	1112.839842925	1814.8454848484848
...
  • This file contains similar data to the demo_MaAsLin2_proteinfamilies.DA.detail.tsv file above (TSV format file); however, it only includes results for protein families that were significantly (q<0.05) differentially abundant by default.

  • Under what circumstances would you want to skip the association analysis?

demo_MSPminer_proteinfamilies.detail.tsv
familyID	type	detail	gene_class	module_name	msp_description	msp_component	multi_msp	total_member	core_gene	accessory_gene	module_compoment	multi_module	taxa_id	taxa_name	taxa_rank	taxa_lineage	note
Cluster_1	MSPminer_MSP	msp_1	msp_1__core	msp_1__core	num_genes(742),num_unclassified_genes(102),num_hit_genes(563)	msp_1;msp_unknown	2	24	0.08333333333333333	0	msp_1__core	1	33038	Ruminococcus gnavus	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus	good
Cluster_10	MSPminer_MSP	msp_2	msp_2__core	msp_2__core	num_genes(686),num_unclassified_genes(107),num_hit_genes(522)	msp_2;msp_unknown	2	5	0.2	0.2	msp_2__accessory_module_067;msp_2__core	2	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
Cluster_100	MSPminer_MSP	msp_2	msp_2__core	msp_2__core	num_genes(686),num_unclassified_genes(107),num_hit_genes(522)	msp_2;msp_unknown	2	2	0.5	0	msp_2__core	1	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
Cluster_1000	MSPminer_MSP	msp_2	msp_2__core	msp_2__core	num_genes(686),num_unclassified_genes(107),num_hit_genes(522)	msp_2	1	1	1.0	0	msp_2__core	1	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
Cluster_1001	MSPminer_MSP	msp_2	msp_2__accessory	msp_2__accessory_module_101	num_genes(686),num_unclassified_genes(107),num_hit_genes(522)	msp_2	1	1	0	1.0	msp_2__accessory_module_101	1	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale	good
...
  • MetaWIBELE uses MSPminer to construct Metagenomic Species Pan-genomes (MSPs) by binning co-abundant genes across samples (TSV format file).
  • For each protein family, the detail column lists the MSP that protein family has been assigned to. For example, protein family Cluster_1 belongs to the first MSP (msp_1).
demo_proteinfamilies_annotation.taxonomy.tsv
familyID	study	map_type	query_type	mutual_type	identity	query_coverage	mutual_coverage	detail	Tax	TaxID	Rep_Tax	Rep_TaxID	UniProtKB	unirefID	note	msp_name	msp_taxa_name	msp_taxa_id	MSP_Tax	MSP_TaxID	MSP_Rep_Tax	MSP_Rep_TaxID	taxa_id	taxa_name	taxa_rank	taxa_lineage
Cluster_1	demo	UniRef90_characterized	high_confidence	high_confidence	100.0	1.0	1.0	Translation initiation factor IF-1	root	1	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B522	UniRef90_A7B522	good	msp_1	Ruminococcus gnavus	33038	Lachnospiraceae	186803	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	411470	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	Terminal	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)
Cluster_10	demo	UniRef90_uncharacterized	high_confidence	high_confidence	97.2	0.8294573643410853	0.8294573643410853	Uncharacterized protein	Clostridiales	186802	[Eubacterium] rectale	39491	A0A0M6W8Q5	UniRef90_A0A0M6W8Q5	good	msp_2	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_100	demo	UniRef90_characterized	high_confidence	high_confidence	99.7	1.0	1.0	S-adenosylmethionine:tRNA ribosyltransferase-isomerase	Clostridiales	186802	[Eubacterium] rectale	39491	A0A0M6WLI9	UniRef90_A0A0M6WLI9	good	msp_2	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_1000	demo	UniRef90_uncharacterized	high_confidence	high_confidence	99.5	1.0	1.0	Domain of uncharacterized function (DUF1836)	Clostridiales	186802	[Eubacterium] rectale	39491	A0A174GTW6	UniRef90_A0A174GTW6	good	msp_2	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_1001	demo	UniRef90_characterized	high_confidence	high_confidence	97.7	1.0	1.0	DUF624 domain-containing protein	Clostridiales	186802	[Eubacterium] rectale	39491	A0A173U990	UniRef90_A0A173U990	good	msp_2	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	[Eubacterium] rectale	39491	39491	[Eubacterium] rectale	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_1002	demo	UniRef90_characterized	high_confidence	high_confidence	93.5	1.0	0.9819004524886877	Proton-coupled thiamine transporter YuaJ	Clostridiales	186802	Ruminococcus gnavus	33038	A0A2N5PZR6	UniRef90_A0A2N5PZR6	good	msp_unknown	NA	NA	Ruminococcus gnavus	33038	Ruminococcus gnavus	33038	33038	Ruminococcus gnavus	Species	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus
Cluster_1003	demo	UniRef90_characterized	high_confidence	high_confidence	98.6	1.0	1.0	HAD hydrolase, family IA, variant 1	Clostridiales	186802	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B2S9	UniRef90_A7B2S9	good	msp_1	Ruminococcus gnavus	33038	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	411470	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	Terminal	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)
Cluster_1004	demo	UniRef90_uncharacterized	high_confidence	high_confidence	99.5	1.0	0.96	YheO-like protein	Clostridiales	186802	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	A7B3N1	UniRef90_A7B3N1	good	msp_1	Ruminococcus gnavus	33038	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	411470	411470	Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9)	Terminal	k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)
...
  • In this step, MetaWIBELE uses Metagenomic Species Pan-genomes (MSPs) to assign taxonomy to protein families that were not able to be identified through step 3.3.2 global-homology based annotations (TSV format file).
  • For example, in row 8 protein family Cluster_1068 had only weak homology with UniRef90s at the sequence level. However, using co-abundance data it was assigned to msp_1, which we know from other protein family information corresponds to Ruminococcus gnavus.

  • MetaWIBELE uses a non-homology based approach to perform taxonomic classification. What would be some pros/cons of this approach in real data with different levels of characterization? (Hint: consider well-characterized microbiome communities vs. poorly-characterized microbiome communities)

4. Prioritizing metagenomic protein families

4.1 MetaWIBELE-prioritize input files

MetaWIBELE can perform prioritization to identify important proteins using the following outputs of the MetaWIBELE-characterize module:

4.2 Running MetaWIBELE: prioritization module

A canonical MetaWIBELE run will prioritize protein families using unsupervised and supervised approaches. By default, protein families are 1) quantitatively ranked based on ecological properties (abundance and prevalence) and/or associations with environmental parameters or phenotypes (q-value and effect size), 2) prioritized based on specified functions of interest via binary filtering.

From the working directory where you executed the MetaWIBELE-prioritize module (metawibele/ directory), execute the following:

metawibele prioritize --input-annotation output/characterization/finalized/demo_proteinfamilies_annotation.tsv \
--input-attribute output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv \
--output output/prioritization/ --local-jobs 4

Note:

  • Make sure the global configuration file metawibele.cfg is in your working directory.
  • In the command replaces output/prioritization/ with the name of the directory to write the output files to.
  • See the section on parallelization options to optimize the workflow run based on your computing resources.
  • The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).

This command yields:

(May 09 14:24:50) [0/8 -   0.00%] **Ready    ** Task 5: quantify_prioritization__supervised
(May 09 14:24:50) [0/8 -   0.00%] **Started  ** Task 5: quantify_prioritization__supervised
(May 09 14:24:56) [1/8 -  12.50%] **Completed** Task 5: quantify_prioritization__supervised
(May 09 14:24:56) [1/8 -  12.50%] **Ready    ** Task 12: finalize_prioritization__supervised_rank
(May 09 14:24:56) [1/8 -  12.50%] **Started  ** Task 12: finalize_prioritization__supervised_rank
(May 09 14:24:57) [2/8 -  25.00%] **Completed** Task 12: finalize_prioritization__supervised_rank
(May 09 14:24:57) [2/8 -  25.00%] **Ready    ** Task 8: filter_prioritization
(May 09 14:24:57) [2/8 -  25.00%] **Started  ** Task 8: filter_prioritization
(May 09 14:25:00) [3/8 -  37.50%] **Completed** Task 8: filter_prioritization
(May 09 14:25:00) [3/8 -  37.50%] **Ready    ** Task 13: finalize_prioritization__selected_supervised_priority
(May 09 14:25:00) [3/8 -  37.50%] **Started  ** Task 13: finalize_prioritization__selected_supervised_priority
(May 09 14:25:01) [4/8 -  50.00%] **Completed** Task 13: finalize_prioritization__selected_supervised_priority
(May 09 14:25:01) [4/8 -  50.00%] **Ready    ** Task 0: quantify_prioritization__unsupervised
(May 09 14:25:01) [4/8 -  50.00%] **Started  ** Task 0: quantify_prioritization__unsupervised
(May 09 14:25:07) [5/8 -  62.50%] **Completed** Task 0: quantify_prioritization__unsupervised
(May 09 14:25:07) [5/8 -  62.50%] **Ready    ** Task 6: filter_prioritization
(May 09 14:25:07) [5/8 -  62.50%] **Started  ** Task 6: filter_prioritization
(May 09 14:25:11) [6/8 -  75.00%] **Completed** Task 6: filter_prioritization
(May 09 14:25:11) [6/8 -  75.00%] **Ready    ** Task 11: finalize_prioritization__selected_unsupervised_priority
(May 09 14:25:11) [6/8 -  75.00%] **Started  ** Task 11: finalize_prioritization__selected_unsupervised_priority
(May 09 14:25:11) [7/8 -  87.50%] **Completed** Task 11: finalize_prioritization__selected_unsupervised_priority
(May 09 14:25:11) [7/8 -  87.50%] **Ready    ** Task 9: finalize_prioritization__unsupervised_rank
(May 09 14:25:11) [7/8 -  87.50%] **Started  ** Task 9: finalize_prioritization__unsupervised_rank
(May 09 14:25:12) [8/8 - 100.00%] **Completed** Task 9: finalize_prioritization__unsupervised_rank
Run Finished

4.3 MetaWIBELE prioritization outputs

Three main output files will be created:

output/prioritization/demo_unsupervised_prioritization.rank.table.tsv
output/prioritization/demo_supervised_prioritization.rank.table.tsv
output/prioritization/demo_supervised_prioritization.rank.selected.table.tsv

The pre-generated demo files can be directly downloaded here:

demo_unsupervised_prioritization.rank.table.tsv

TID familyID    evidence    value   rank    description note 
1   Cluster_2   DNA_abundance   3976.7752878431365  0.9974747474747475  ranking based on single evidence    
2   Cluster_2   DNA_prevalence  0.9803921568627451  0.9987373737373737  ranking based on single evidence
3   Cluster_2   priority_score  0.998105661293517   0.998105661293517   meta ranking based on multiple evidences    
4   Cluster_1385    DNA_abundance   7579.266326862749   0.9993686868686869  ranking based on single evidence    5   Cluster_1385    DNA_prevalence  0.9607843137254902  0.9943181818181818  ranking based on single evidence
6   Cluster_1385    priority_score  0.9968370372502734  0.9968370372502734  meta ranking based on multiple evidences  
7   Cluster_4   DNA_abundance   3528.144675490197   0.9936868686868687  ranking based on single evidence    
8   Cluster_4   DNA_prevalence  0.9803921568627451  0.9987373737373737  ranking based on single evidence
9   Cluster_4   priority_score  0.9962057200650358  0.9962057200650358  meta ranking based on multiple evidences    
10  Cluster_1175    DNA_abundance   3682.0759901960787  0.9943181818181818  ranking based on single evidence
...
  • These are the results of unsupervised prioritization based on ecological properties (TSV format file).
  • Each protein family has a numeric priority score based on its meta ranking.
  • demo_unsupervised_prioritization.rank.tsv is the overall ranking for all protein families.

demo_supervised_prioritization.rank.table.tsv

TID familyID    evidence    value   rank    description note 
1   Cluster_459 DNA_within_phenotype_abundance  2006.0490676470586  0.9428387925497752  ranking based on single evidence    CD_vs_nonIBD
2   Cluster_459 DNA_within_phenotype_prevalence 1.0 1.0 ranking based on single evidence    CD_vs_nonIBD
3   Cluster_459 MaAsLin2_DA__mean_log   -3.558855750108834  0.892742453436095   ranking based on single evidence    CD_vs_nonIBD
4   Cluster_459 MaAsLin2_DA__qvalue 1.00258968598421e-05    0.9421221864951769  ranking based on single evidence    CD_vs_nonIBD
5   Cluster_459 priority_score  0.9429060853766776  0.9429060853766776  meta ranking based on multiple evidences    CD_vs_nonIBD
6   Cluster_1254    DNA_within_phenotype_abundance  1716.5679205882354  0.8464996788696211  ranking based on single evidence    CD_vs_nonIBD
7   Cluster_1254    DNA_within_phenotype_prevalence 1.0 1.0 ranking based on single evidence    CD_vs_nonIBD
8   Cluster_1254    MaAsLin2_DA__mean_log   -3.944110406138527  0.9743095696852923  ranking based on single evidence    CD_vs_nonIBD
9   Cluster_1254    MaAsLin2_DA__qvalue 7.56033011277545e-06    0.9546623794212219  ranking based on single evidence    CD_vs_nonIBD
10  Cluster_1254    priority_score  0.9400276626779672  0.9400276626779672  meta ranking based on multiple evidences    CD_vs_nonIBD
...
  • These are the results of supervised prioritization combining ecological properties and environmental/phenotypic properties (TSV format file).
  • Each protein family has a numeric priority score based on its meta ranking.
  • demo_supervised_prioritization.rank.tsv is the overall ranking for all protein families.

demo_supervised_prioritization.rank.selected.table.tsv

TID familyID    evidence    value   rank    description note 
1   Cluster_459 DNA_within_phenotype_abundance  2006.0490676470586  0.9428387925497752  ranking based on single evidence    CD_vs_nonIBD
2   Cluster_459 DNA_within_phenotype_prevalence 1.0 1.0 ranking based on single evidence    CD_vs_nonIBD
3   Cluster_459 MaAsLin2_DA__mean_log   -3.558855750108834  0.892742453436095   ranking based on single evidence    CD_vs_nonIBD
4   Cluster_459 MaAsLin2_DA__qvalue 1.00258968598421e-05    0.9421221864951769  ranking based on single evidence    CD_vs_nonIBD
5   Cluster_459 priority_score  0.9429060853766776  0.9429060853766776  meta ranking based on multiple evidences    CD_vs_nonIBD
6   Cluster_1254    DNA_within_phenotype_abundance  1716.5679205882354  0.8464996788696211  ranking based on single evidence    CD_vs_nonIBD
7   Cluster_1254    DNA_within_phenotype_prevalence 1.0 1.0 ranking based on single evidence    CD_vs_nonIBD
8   Cluster_1254    MaAsLin2_DA__mean_log   -3.944110406138527  0.9743095696852923  ranking based on single evidence    CD_vs_nonIBD
9   Cluster_1254    MaAsLin2_DA__qvalue 7.56033011277545e-06    0.9546623794212219  ranking based on single evidence    CD_vs_nonIBD
10  Cluster_1254    priority_score  0.9400276626779672  0.9400276626779672  meta ranking based on multiple evidences    CD_vs_nonIBD
  • This file is the result of binary filtering of the protein families based on biochemical annotations (TSV format file).
  • The settings require that each of the prioritized protein families should 1) be predicted as signaling peptides, and 2) have at least one of the following features: extracellular, cellWall, outerMembrane, transmembrane, Pfam domains, domain-domain interaction annotations.

  • Under what circumstances would skipping supervised prioritization be reasonable? Under what circumstances would skipping supervised prioritization be inadvisable?

4.4 Visualization of highly prioritized protein families

Unsupervised hierarchical clustering of the 30 protein families most prioritized in IBD which are potentially secreted proteins with signaling functions:

MetaWIBELE_demo_heatmap.png


5. Advanced MetaWIBELE topics

5.1 MetaWIBELE supervised prioritization with binary filtering

By default, MetaWIBELE will perform binary filtering using the prioritization configuration file installed in the package. Optionally, you can also make your own configuration files and provide them as optional arguments to MetaWIBELE. For example, the local prioritization configuration file can be provided with --prioritization-config $PRIORITIZE_CONF where $PRIORITIZE_CONF is the file with prioritization configurations.

5.1.1 Select subset of interest annotated with multiple specific biochemical annotations simultaneously

Setting $PRIORITIZE_CONF as following:

```
##Binary filtering for selection subset
[filtering]
# Filter for interested functional vignettes type [Choices: pilin | secreted_system | other user defined | none]
vignettes = none

# significant associations for filtering: [required] required item, [optional] optional item, [none] ignoring
MaAsLin2_DA-sig = required

# biochemical annotation for filtering: [required] required item, [optional] optional item, [none] ignoring
ExpAtlas_interaction = required
DOMINE_interaction = none
SIFTS_interaction = none
Denovo_signaling = required
Denovo_transmembrane = optional
PSORTb_extracellular = optional
PSORTb_cellWall = optional
PSORTb_outerMembrane = optional
UniRef90_extracellular = optional
UniRef90_signaling = none
UniRef90_transmembrane = optional
UniRef90_cellWall = optional
UniRef90_outerMembrane = optional
UniRef90_PfamDomain = none
InterProScan_PfamDomain = none
InterProScan_SUPERFAMILY = none
InterProScan_ProSiteProfiles = none 
InterProScan_ProSitePatterns = none
InterProScan_Gene3D = none
InterProScan_PANTHER = none
InterProScan_TIGRFAM = none
InterProScan_SFLD = none
InterProScan_ProDom = none
InterProScan_Hamap = none
InterProScan_SMART = none
InterProScan_CDD = none
InterProScan_PRINTS = none
InterProScan_PIRSF = none
InterProScan_MobiDBLite = none
```
  • Re-run prioritization workflow for filtering:
  • $ metawibele prioritize --prioritization-config $PRIORITIZE_CONF --bypass-mandatory --input-annotation output/characterization/demo_proteinfamilies_annotation.tsv --input-attribute output/characterization/demo_proteinfamilies_annotation.attribute.tsv --output output/prioritization/ --selected-output demo_prioritized.selected.tsv
  • Output file name: $OUTPUT_DIR/prioritization demo_prioritized.selected.tsv
  • This file is the results of binary filtering of the protein families based on biochemical annotations.
  • These settings require that each of prioritized protein family should 1) significantly associated with the main phenotype, 2) be annotated to domain-domain interaction with the host, 3) predicted as signal peptides, and 4) have at least one of the following features: extracellular, cellWall, outerMembrane, transmembrane

5.1.2 Select subset of interest based on specific functions

Setting $PRIORITIZE_CONF as following:

```
[filtering]
# Filter for interested functional vignettes type [Choices: pilin | secreted_system | other user defined | none]
vignettes = pilin

# Filter for significant associations: [required] required item, [optional] optional item, [none] ignoring [ Default: required ]
MaAsLin2_DA-sig = none

# biochemical annotation for filtering: [required] required item, [optional] optional item, [none] ignoring
ExpAtlas_interaction = optional
DOMINE_interaction = optional
SIFTS_interaction = optional
Denovo_signaling = optional
Denovo_transmembrane = optional
PSORTb_extracellular = optional
PSORTb_cellWall = optional
PSORTb_outerMembrane = optional
UniRef90_extracellular = optional
UniRef90_signaling = optional
UniRef90_transmembrane = optional
UniRef90_cellWall = optional
UniRef90_outerMembrane = optional
UniRef90_PfamDomain = optional
InterProScan_PfamDomain = optional
InterProScan_SUPERFAMILY = optional
InterProScan_ProSiteProfiles = optional 
InterProScan_ProSitePatterns = optional
InterProScan_Gene3D = optional
InterProScan_PANTHER = optional
InterProScan_TIGRFAM = optional
InterProScan_SFLD = optional
InterProScan_ProDom = optional
InterProScan_Hamap = optional
InterProScan_SMART = optional
InterProScan_CDD = optional
InterProScan_PRINTS = optional
InterProScan_PIRSF = optional
InterProScan_MobiDBLite = optional
```
  • Re-run prioritization workflow for filtering:
  • $ metawibele prioritize --prioritization-config $PRIORITIZE_CONF --bypass-mandatory --vignette-config my_vignette_function_file --input-annotation output/characterization/demo_proteinfamilies_annotation.tsv --input-attribute output/characterization/demo_proteinfamilies_annotation.attribute.tsv --output output/prioritization/ --selected-output demo_prioritized_pilin.tsv
  • Provide your own vignette function file for filtering specific functions.
  • Output file name: $OUTPUT_DIR/prioritization/demo_prioritized_pilin.tsv
  • This file is the result of binary filtering of the protein families based on pilin related functions (TSV format file).

  • How would you customize your local configuration file for prioritization to select the subset of protein families annotated with domain-domain interactions?
  • How would you customize your local configuration for prioritization to select the subset of protein families which are 1) significantly associated with the phenotype or environmental parameter, 2) annotated to domain-domain interactions, and 3) predicted as signal peptides?

5.2 Building dependent databases

5.2.1: create local UniRef databases

You can also create these databases locally using MetaWIBELE utility scripts based on the latest release version of UniProt/UniRef, and provide $UNIREF_LOCATION as the location to install the database.

  • Download and obtain UniProt annotations:

    $ metawibele_prepare_uniprot_annotation --output $UNIREF_LOCATION

  • Download UniRef sequences and obtain annotations:

    $ metawibele_prepare_uniref_annotation -t uniref90 --output $UNIREF_LOCATION

  • Use diamond to index sequences

    diamond makedb --in uniref90.fasta -d uniref90.fasta

5.2.2: create local domain databases

Protein domain databases are required if you will use MetaWIBELE to perform domain-based annotation. By default, these databases have already been installed when you install the MetaWIBELE package. Alternatively, you can also create these domain databases locally and provide $DOMAIN_LOCATION as the location to install the database.

  • Download and obtain dependent protein domain information:

    $ metawibele_prepare_domain_databases -t Pfam33.0 --output $DOMAIN_LOCATION

    • Pfam domains from the specified version of Pfam database will be downloaded.
    • PDB information from the latest version of SIFT database will be downloaded.
    • Domain-domain interactions from version 2.0 of DOMINE database will be downloaded.