-
Notifications
You must be signed in to change notification settings - Fork 75
metawibele
MetaWIBELE (Workflow to Identify novel Bioactive Elements in the microbiome) is a workflow to efficiently and systematically identify and prioritize potentially bioactive (and often uncharacterized) gene products in microbial communities. It prioritizes candidate gene products from assembled metagenomes using a combination of sequence homology, secondary-structure-based functional annotations, phylogenetic binning, ecological distribution, and association with environmental parameters or phenotypes to target candidate bioactives.
For more information about the method and its associated publication, visit our summary page and read the MetaWIBELE User Manual.
Support for MetaWIBELE is available via the MetaWIBELE channel of the bioBakery Support Forum.
- 1. Setup
- 2. MetaWIBELE utility: preprocessing metagenomic sequencing reads
- 3. Characterizing assembled metagenomic genes
- 4. Prioritizing metagenomic protein families
- 5. Advanced MetaWIBELE topics
- Python (version >= 3.6, requiring numpy, pandas, scipy packages; tested 3.6, 3.7)
- AnADAMA2 (version >= 0.7.4; tested 0.7.4, 0.8.0)
- CD-hit (version >= 4.7; tested 4.7)
- Diamond (version >= 0.9.24; tested 0.9.24)
- MSPminer (version >= 1.0.0; licensed software; tested version 1.0.0)
- MaAsLin2 (version >= 1.5.1; tested 1.5.1)
- Interproscan (version >= 5.31-70) (installing with activating Phobius/SignalP/TMHMM analyses; InterProScan 5.51-85.0 or later are recommended for potential simpler installation; used for domain/motif annotation; tested 5.31-70, 5.51-85.0)
- Signalp (version >= 4.1; licensed software; installing with integrating in interproscan; used for domain/motif annotation; tested 4.1)
- TMHMM (version >= 2.0; licensed software; installing with integrating in interproscan; used for domain/motif annotation; tested 2.0)
- Phobius (version >= 1.01; licensed software; installing with integrating in interproscan; used for domain/motif annotation; tested 1.01)
- PSORTb (version >= 3.0) (licensed software; used for domain/motif annotation; tested 3.0)
-
Optional: only required if using MetaWIBELE utilities to prepare inputs for MetaWIBELE using metagenomic sequencing reads
- MEGAHIT (version >= 1.1.3; tested 1.1.3)
- Prokka (version >= 1.14-dev; recommend to not set '-c' parameter when running prodigal with metagenome mode; tested 1.14-dev)
- Prodigal (version >= 2.6; tested 2.6)
- USEARCH (version >= 9.0.2132; licensed software; tested 9.0.2132)
- Bowtie2 (version >= 2.3.2; tested 2.3.2)
- SAMtools (version >= 1.9; tested 1.9)
- featureCounts (version >= 1.6.2; tested 1.6.2)
Note: Please install the required software in a location in your $PATH
. If you always run with gene families (non-redundant gene catalogs), the optional softwares are not required. Also if you always run with one or more bypass options (for information on bypass options, see optional arguments to the section Workflow by bypass mode).The software required for the steps you bypass does not need to be installed.
Note: If you are using a bioBakery machine image (e.g. in Google Cloud) you do not need to install MetaWIBELE because the tool and its dependencies are already installed.
You can choose one of the following options to install the MetaWIBELE package.
Option 1: Installing with docker
$ docker pull biobakery/metawibele
- This docker image includes most of the dependent software packages.
- Large software packages and those with licenses are NOT included in this image and needed to be installed additionally:
- Users should review the license terms and install these packages manually.
- Softwares with the license : MSPminer, Signalp, TMHMM, Phobius, PSORTb
- Softwares with large size: Interproscan (Note: We recommend installing InterProScan 5.51-85.0 (requiring at least Java 11) or later for potential simpler installation, and active Phobius/SignalP/TMHMM analyses by customizing your interproscan.properties configuration, see more details from InterProScan document).
Option 2: Installing with pip
$ pip install metawibele
- If you do not have write permissions to
/usr/lib/
, then add the option --user to the install command. This will install the python package into subdirectories of~/.local/
. Please note when using the --user install option on some platforms, you might need to add~/.local/bin/
to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following messagemetawibele: command not found
when trying to run MetaWIBELE after installing with the --user option.
Option 3: Installing with conda
$ conda install -c biobakery metawibele
Note: If you are using a bioBakery machine image (e.g. in Google Cloud) you do not need to install dependent databases that are already installed
If not installed in the bioBakery machine image, for the purposes of this tutorial alone, please download demo databases to reduce run time:
- Download the annotation files and uncompress into
demo_uniref_database
folder: demo_uniref_database.tar.gz - Alternatively, run the following command to download the annotation files into
demo_uniref_database
:$ metawibele_download_database --database DEMO --build demo_uniref --install-location $UNIREF_LOCATION
UniRef database is required if you will use MetaWIBELE to do global-homology based annotation and taxonomic annotation. We have built the database based on UniProt/UniRef90 2019_01 sequences and annotations. You can use any one of the following options to install the database with providing $UNIREF_LOCATION
as the location for installation.
NOTE: Please point to this location for the default uniref database in the global config file (metawibele.cfg
, see details in the section 1.2.3 Prepare configuration files). Alternatively, you can either set the location with the environment variable $UNIREF_LOCATION
, or move the downloaded database to the folder named uniref_database
in the current working directory.
Option 1: Download and uncompress them into location $UNIREF_LOCATION
- UniRef90 sequences indexed by Diamond v0.9.24 (20.0GB): uniref90.fasta.dmnd.tar.gz
- UniRef90 annotation files (2.8 GB): uniref90_annotations.tar.gz
Option 2: Run the following commands to install the databases into the location $UNIREF_LOCATION
- download the UniRef90 sequences indexed by Diamond v0.9.24 into
$UNIREF_LOCATION
:$ metawibele_download_database --database uniref --build uniref90_diamond --install-location $UNIREF_LOCATION
- download the UniRef90 annotation files into
$UNIREF_LOCATION
:$ metawibele_download_database --database uniref --build uniref90_annotation --install-location $UNIREF_LOCATION
To check out the install of MetaWIBELE packages and all dependencies (tools and databases), run the command:
-
$ metawibele_check_install
Which yields (highly abbreviated):
#----------------- Checking the install of metawibele packages -----------------# ###### Checking anadama2 package install ###### Checking anadama2 install...........OK ###### Checking metawibele package install ###### Checking metawibele install...........OK #----------------- Checking the install of required dependent tools -----------------# ###### Checking CD-hit install ###### ... #----------------- Checking the install of optional dependent tools -----------------# ###### Checking MEGAHIT install ###### MEGAHIT v1.1.3 Checking MEGAHIT install...........OK ...
-
By default, it will check both MetaWIBEKE packages and all dependencies.
-
Alternatively, add the option "--types {metawibele,required,optional,all}" to check the install of specific package or dependency.
To run MetaWIBELE, you are required to customize the global configuration file metawibele.cfg
and make sure that it's in your current working directory.
NOTE: De default, MetaWIBELE will use the global configurations from metawibele.cfg
in the current working directory. Alternatively you can always provide the location of the global configuration file you would like to use with the "--global-config " option to metawibele (see more in the section How to run).
-
Download global configuration template
You can use one of the following options to get the configuration template.
- Option 1: obtain copies by right-clicking the link and selecting "save link as": metawibele.cfg
- Option 2: run this command to download global configuration file:
$ metawibele_download_config --config-type global
-
Customize the global configuration
The path of the dependent databases is required for customization. Most of other sections can be left as defaults.
-
Customize the path of dependent databases (required):
[database] # The absolute path of UniRef databases folder. uniref_db = # The domain databases used by MetaWIBELE. [data_path] provide the absolute path of the domain databases folder; [none] use the dedault domain databases installed in the metawibele package. [ Default: none ] domain_db = none
-
Customize basic information for outputs (e.g. prefix name of output files, etc.) (optional):
[basic] # Study name. [ Default: MGX ] study = MGX # The prefix name for output results. [ Default: metawibele ] basename = metawibele
-
Customize applied computational resources (e.g. CPU cores, memory, etc.) (optional):
[computation] # The number of cores that you are requesting. [ Default: 1 ] threads = 1 # The amount of memory (in MB) that you will be using for your job. [ Default: 20000 ] memory = 20000 # The amount of time (in minute) that you will be using for your job. [ Default: 60 ] time = 60
-
Customize parameter settings for abundance-based and domain/motif-based annotations (optional):
[abundance] # The method for normalization [Choices: cpm, relab]. [cpm] copies per million units (sum to 1 million); [relab] relative abundance (sum to 1). [ Default: cpm ] normalize = cpm # The minimum abundance for each feature [ Default: 0 ] abundance_detection_level = 0 [msp] # The absolute path of the config file used by MSPminer. [config_file] provide the mspminer config file; [none] use the dedault config files installed in the metawibele package. [ Default: none ] mspminer = none # The minimum fraction of taxonomy classified genes in each MSP [0-1]. [Default: 0.10] tshld_classified = 0.10 # The minimum percent differences between the most and second dominant taxon for each MSP [0-1]. [Default: 0.50] tshld_diff = 0.50 [interproscan] # Interproscan executable file, e.g. /my/path/interproscan/interproscan.sh [ Default: interproscan.sh ] interproscan_cmmd = interproscan.sh # The appls used by interroiscan: [appls] comma separated list of analyses, [ Choices: CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,ProDom,PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM,Phobius,SignalP,TMHMM ]; [all] use all all analyses for running. [ Default: Pfam ] interproscan_appl = Pfam # The number of spliting files which can be annotated in parallel. [ Default: 1 ] split_number = 1
-
Customize parameter settings for association with environmental/host phenotypes (optional; but if you plan to run MetaWIBELE for supervised prioritization you are required to specify your settings for MaAsLin2 and specify the main phenotype metadata used for prioritization):
[maaslin2] # The absolute path of Maaslin2 executable file, e.g. /my/path/Maaslin2/R/Maaslin2.R [ Default: Maaslin2.R ] maaslin2_cmmd = Maaslin2.R # The minimum abundance for each feature. [ Default: 0 ] min_abundance = 0 # The minimum percent of samples for which a feature is detected at minimum abundance. [ Default: 0.1 ] min_prevalence = 0.1 # Keep features with variance greater than. [Default: 0.0] min_variance = 0 # The q-value threshold for significance. [ Default: 0.25 ] max_significance = 0.25 # The normalization method to apply [ Choices: TSS, CLR, CSS, NONE, TMM ]. [ Default: TSS ] normalization = NONE # The transform to apply [ Choices: LOG, LOGIT, AST, NONE ]. [ Default: LOG ] transform = LOG # The analysis method to apply [ Choices: LM, CPLM, ZICP, NEGBIN, ZINB ]. [ Default: LM ] analysis_method = LM # The fixed effects for the model, comma-delimited for multiple effects. [ Default: all ] fixed_effects = all # The random effects for the model, comma-delimited for multiple effects. [ Default: none ] random_effects = none # The correction method for computing the q-value. [ Default: BH ] correction = BH # Apply z-score so continuous metadata are on the same scale. [ Default: TRUE ] standardize = TRUE # Generate a heatmap for the significant associations. [ Default: FALSE ] plot_heatmap = FALSE # In heatmap, plot top N features with significant associations. [ Default: FALSE ] heatmap_first_n = FALSE # Generate scatter plots for the significant associations. [ Default: FALSE ] plot_scatter = FALSE # The number of R processes to run in parallel. [ Default: 1 ] maaslin2_cores = 1 # The factor to use as a reference for a variable with more than two levels provided as a string of 'variable,reference' semi-colon delimited for multiple variables. NOTE: A space between the variable and reference will not error but will cause an inaccurate result. [ Default: NA ] reference = NA # The minimum percent of case-control samples used for comparision in which a feature is detected. [ Default: 0.1 ] tshld_prevalence = 0.10 # The q-value threshold for significance used as DA annotations. [ Default: 0.05 ] tshld_qvalue = 0.05 # The statistic used as effect size [ Choices: coef, mean(log) ]. [coef] represents the coefficient from the model; [mean(log)] represents the difference of mean values between case and control conditions. [ Default: mean(log) ] effect_size = mean(log) # The main phenotype metadata used for prioritization, e.g. metadata1. [ Default: none ]: skip the association with environmental/phenotypic parameters phenotype = none
-
MetaWIBELE provides utilities to build gene families starting from quality-controlled shotgun sequencing reads.
This tutorial will use demo input files distributed with the MetaWIBELE installation under the examples/raw_reads/
directory. If you installed with pip
or conda
and don't have these handy, you can obtain copies by right-clicking these links and selecting "save link as":
These two demo samples were taken from the PRISM (Prospective Registry in IBD Study at MGH) cohort. One sample is from a subject with Crohn’s disease (CD) and the other is from a non-IBD control.
The preprocessing module is a utility in the MetaWIBELE package for preprocessing metagenomic reads. It performs the following steps: (i) metagenomic assembly, (ii) open reading frame prediction, (iii) gene family (i.e. non-redundant gene catalogs) construction, and (iv) gene abundance estimation.
If you installed from conda / pip, download them here and save them to metawibele/raw_reads
directory. If you installed from source, copy them from the metawibele source package to metawibele/raw_reads
.
In the VM, navigate to the correct directory with:
cd ~/Tutorials/metawibele
Use the fastq.gz
files from metawibele/raw_reads
to run the command:
metawibele preprocess --input raw_reads/ --output output/preprocess/ \
--output-basename demo --extension-paired "_R1.fastq.gz,_R2.fastq.gz" \
--extension ".fastq.gz" --local-jobs 4
On the terminal, you will notice tasks being initiated and completed.
(Jun 10 15:33:56) [ 2/28 - 7.14%] **Started ** Task 7: ln__sample2.contigs.fa
(Jun 10 15:33:57) [ 4/28 - 14.29%] **Started ** Task 9: format_contig_table
(Jun 10 15:33:58) [16/28 - 57.14%] **Started ** Task 26: extract_protein_coding_genes
(Jun 10 15:33:59) [20/28 - 71.43%] **Started ** Task 33: extract_cluster_CD-hit
(Jun 10 15:34:04) [24/28 - 85.71%] **Started ** Task 43: ln__sample1.sort.bed
(Jun 10 15:33:36) [ 0/28 - 0.00%] **Started ** Task 0: sample2__megahit
(Jun 10 15:33:57) [ 3/28 - 10.71%] **Started ** Task 8: ln__sample1.contigs.fa
...
Note:
- The input directory should contain quality-controlled shotgun sequencing metagenome files (fastq, fastq.gz, fasta, or fasta.gz format)
- See the section on parallelization options to optimize the workflow run based on your computing resources.
- The preprocessing workflow will run with the default settings for all subtasks. You can change the command line parameters to customize your settings for the preprocessing workflow.
- For example, if your samples have a different file type extension or different paired-end labels you can modify the default settings when running the assembly task using the command line parameters
--extension
and--extension-paired
. Using the text that follows after$SAMPLE
in the input file name specify:--extension-paired "$R1_suffix,$R2_suffix"
,--extension "$fastq_suffix"
- For example, if your samples have a different file type extension or different paired-end labels you can modify the default settings when running the assembly task using the command line parameters
- To write the log and err files in the working directory, add
>preprocess.log 2>preprocess.err
after the command above.
Main output files
The following are the two main output files of the preprocessing utility that are used for MetaWIBELE:
output/preprocess/finalized/demo_genecatalogs.centroid.faa
output/preprocess/finalized/demo_genecatalogs_counts.all.tsv
The pre-generated demo files can be directly downloaded here:
To view this file, enter:
less -S output/preprocess/finalized/demo_genecatalogs.centroid.faa
which yields:
>sample1_10-1
MFQDKYVFAQLASFLNRSKFNRIVAKYDGDKYVKHFTCWNQLLALMFGQLSNRESLRDLIVALEAHHSKCYHLGMGKNVS
KSSLARANQDRDYHIFEEYAYYLVSEAREKRVTHIFKLGGNVYAFDSTTIDLCLSVFWWAKFRKKKGGIKVHTLYDIETQ
IPAFFHITEASVHDSKVMNEIPYEPGSYYIFDRGYNNFKMLYKIHQIEAYFVVRAKKNLQYKSIKWKRRLPKNVFSDASI
LLTGFYPKQYYPEPLRLVKYWDEEQEREFTFITNAMHISALQVAELYKNRWQVELFFKWLKQHLKIKRFWGTTENAVRIQ
IYAAICTYCLVAIIQHDMQLDRSTYEVLQILSISLTDKTHLRDLFDKTKFQNDKERFGPNGPSLFN
- This file provides the amino acid sequences for each gene identified as the representative of its gene family (Fasta format file).
- Each gene is given a MetaWIBELE-specific ID (i.e. sample1_10-1) based on the sample and order in which it was identified.
To view this file, enter:
less -S output/preprocess/finalized/demo_genecatalogs_counts.all.tsv
which yields:
ID sample1 sample2
sample1_10-1 67 5
- This file provides the counts of the number of reads mapped to each gene family (rows) in each sample (columns) (TSV format file).
Intermediate files
-
Assembly results:
-
output/preprocess/finalized/demo_contig_sequence.fasta
: contig sequences (Fasta format file). - The assembly outputs for each sample are in the
output/preprocess/assembly/
folder.
-
-
Gene-calling results:
-
output/preprocess/finalized/demo_gene_info.tsv
: all gene calling information (TSV format file). -
output/preprocess/finalized/demo_combined_gene_protein_coding.sorted.fna
: nucleotide sequences for all ORFs sorted by gene length (Fasta format file). -
output/preprocess/finalized/demo_combined_protein.sorted.faa
: protein sequences for all ORFs sorted by gene length (Fasta format file). -
output/preprocess/finalized/demo_combined_gene_protein_coding.complete.sorted.fna
: nucleotide sequences for all complete ORFs sorted by gene length (Fasta format file). -
output/preprocess/finalized/demo_combined_protein.complete.sorted.faa
: protein sequences for all complete ORFs sorted by gene length (Fasta format file). - The gene-calling outputs from prodigal are in the
output/preprocess/gene_calls/
folder. - The gene-annotation outputs from prokka are in the
output/preprocess/gene_annotation/
folder.
-
-
Mapping results:
- All mapping outputs for each sample are in the
output/preprocess/mapping/
folder.
- All mapping outputs for each sample are in the
-
Gene families:
-
output/preprocess/finalized/demo_genecatalogs.clstr
: clustering information for non-redudant gene families (Fasta format file). -
output/preprocess/finalized/demo_genecatalogs.centroid.fna
: nucleotide sequences of gene family representatives (Fasta format file).
-
- Single-sample assembly allows us to estimate the genomes per metagenomic sample using countable computational resources. What might be some pros/cons of this approach in real data? How about co-assembly by assembling metagenomes across samples?
The MetaWIBELE-characterize module is run using gene families with protein sequence and read count information:
- protein sequences for non-redundant gene families (Fasta format file), e.g. demo_genecatalogs.centroid.faa
- reads counts table for non-redundant gene families (TSV format file), e.g. demo_genecatalogs_counts.all.tsv
- metadata file (TSV format file), e.g. demo_mgx_metadata.tsv
- the global configuration file in the current working directory, e.g. metawibele.cfg
This tutorial will use the demo input files distributed with the MetaWIBELE installation under the examples/input/
directory. If you installed with pip
or conda
and don't have these handy, you can obtain copies by right-clicking these links and selecting "save link as":
Let's see the contents of these files.
column -t -s $'\t' input/demo_mgx_metadata.tsv | less -S
which yields:
SID study sex consent_age antibiotic mesalamine steroids immunosuppressant diagnosis
PRISM_7122 PRISM Female 38 No No No Yes CD
PRISM_7147 PRISM Male 50 No Yes No No CD
PRISM_7150 PRISM Male 41 No No No Yes CD
PRISM_7153 PRISM Male 51 No Yes No No CD
The first column is the sample ID of a stool donor in the PRISM cohort which contains Crohn's disease (CD) patients and healthy controls (nonIBD). Other relevant metadata is also listed for these individuals. There are 102 samples (68 CD). In this tutorial, we are interested in prioritizing gene families associated with CD. Some of the relevant lines in the metawibele.cfg
file (under [maaslin2] options) are:
# The factor to use as a reference for a variable with more than two levels provided as a string of 'variable,reference' semi-colon delimited for multiple variables. NOTE: A space between the variable and reference will not error but will cause an inaccurate result. [ Default: NA ]
reference = diagnosis,nonIBD
# The statistic used as effect size [ Choices: coef, mean(log) ]. [coef] represents the coefficient from the model; [mean(log)] represents the difference of mean values between case and control conditions. [ Default: mean(log) ]
effect_size = mean(log)
# The main phenotype metadata used for prioritization, e.g. metadata1. [ Default: none ]: skip the association with environmental/phenoty
phenotype = diagnosis
To view this file, type:
less -S input/demo_genecatalogs.centroid.faa
Let's count the number of genes (centroids):
grep ">" input/demo_genecatalogs.centroid.faa | wc -l
which yields 604
gene families.
To view this file, type:
less -S input/demo_genecatalogs_counts.all.tsv
which yields:
ID PRISM_7122 PRISM_7147 PRISM_7150 PRISM_7153 PRISM_7184 PRISM_7238 PRISM_7406 PRISM_7408
PRISM_7122_03875 197 16 2 15 87 0 0 80 92 0 33 0 1 90
PRISM_7122_38863 16 3 2 8 2 34 905 62 12 38 5 6 0 0
PRISM_7147_05905 0 1934 2 11 0 0 0 16 129 1 18 0 84 83
PRISM_7147_105078 0 2304 132 506 0 8 3 123 718 1 952 1 646 240
PRISM_7147_276043 0 2611 126 944 4 32 2 183 1108 3 1740 5 1037 352
A canonical MetaWIBELE run uses gene families (non-redundant gene catalogs) to build protein families and annotate them functionally and taxonomically. By default, protein families are annotated using global-homology-based, domain/motif-based, and abundance-based approaches. From the examples
folder (or the location where you saved the gene families and sample metadata files above) execute the following:
If you're on the VM, use the files from metawibele/input/
. If you installed with pip
or conda
, download them here and save them to the metawibele/raw_reads
directory. If you installed from source, copy them from the metawibele source package to the metawibele/input/
directory. To run the following command, change to the metawibele/
directory and run:
metawibele characterize --input-sequence input/demo_genecatalogs.centroid.faa \
--input-count input/demo_genecatalogs_counts.all.tsv \
--input-metadata input/demo_mgx_metadata.tsv \
--output output/characterization/ --local-jobs 4 --bypass-psortb \
>characterize.log 2>characterize.err &
Note:
- Make sure the global configuration file
metawibele.cfg
is in your working directory (i.e.metawibele/
directory). - In the command replaces
output/characterization/
with the path to the folder where you want to write the output files. - See the section on parallelization options to optimize the workflow run based on your computing resources.
- The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).
Note the characterization phases of MetaWIBELE-characterize in the above log: 1) building protein families, 2) mapping proteins to UniRef (global-homology based annotation), 3) characterizing protein domains and motifs for protein families (domain-motif based annotation), 4) profiling taxonomy by binning co-abundant gene families and propagating the taxa to unclassified genes, 5) building an abundance table for protein families and associating it with environmental parameters or phenotypes.
- How would you customize the workflow settings to skip the domain-motif based annotation step in order to reduce computational consumption?
MetaWIBELE-characterize creates four main output subfolders that contain the main outputs of each step:
output/characterization/abundance_annotation/
output/characterization/clustering/
output/characterization/domain_motif_annotation/
output/characterization/finalized/
output/characterization/global_homology_annotation/
These are the main finalized characterization output files:
output/characterization/finalized/demo_proteinfamilies_annotation.tsv
output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv
output/characterization/finalized/demo_proteinfamilies_annotation.taxonomy.tsv
output/characterization/finalized/demo_proteinfamilies_nrm.tsv
The pre-generated demo files can be directly downloaded here:
- demo_proteinfamilies_annotation.tsv
- demo_proteinfamilies_annotation.attribute.tsv
- demo_proteinfamilies_annotation.taxonomy.tsv
- demo_proteinfamilies_nrm.tsv
Let's inspect the contents of the finalized annotation files (in the sub-folder finalized
) and the intermediate subfolders for each annotation step:
To view this file, type:
less -S output/characterization/finalized/demo_proteinfamilies_annotation.tsv
which yields:
familyID annotation feature category method AID
Cluster_1 good quality note Quality_control NA
Cluster_1 demo study project Shotgun NA
Cluster_1 Cluster_1 protein_family Denovo_clustering CD-hit Cluster_1__Denovo_clustering
Cluster_1 UniRef90_A7B522 strong_homology UniRef90_homology UniRef90 Cluster_1__UniRef90_homology
Cluster_1 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) Terminal Taxonomy_characterization Taxonomy_annotation Cluster_1__Taxonomy_characterization
Cluster_1 23863.237926470596 DNA-CD_abundance Denovo_characterization DNA Cluster_1__DNA-CD_abundance
Cluster_1 0.9264705882352942 DNA-CD_prevalence Denovo_characterization DNA Cluster_1__DNA-CD_prevalence
Cluster_1 2429.2891764705887 DNA-nonIBD_abundance Denovo_characterization DNA Cluster_1__DNA-nonIBD_abundance
Cluster_1 1.0 DNA-nonIBD_prevalence Denovo_characterization DNA Cluster_1__DNA-nonIBD_prevalence
Cluster_1 23863.237926470596 DNA-within-phenotype_abundance Denovo_characterization DNA Cluster_1__DNA-within-phenotype_abundance
Cluster_1 1.0 DNA-within-phenotype_prevalence Denovo_characterization DNA Cluster_1__DNA-within-phenotype_prevalence
Cluster_1 16718.588343137264 DNA_abundance Denovo_characterization DNA Cluster_1__DNA_abundance
Cluster_1 0.9509803921568627 DNA_prevalence Denovo_characterization DNA Cluster_1__DNA_prevalence
Cluster_1 PF01176:PF00164;PF01176:PF11987 DOMINE_interaction Denovo_characterization DOMINE Cluster_1__DOMINE_interaction
Cluster_1 PF01176:PF00164;PF01176:PF11987 DOMINE_interaction-human Denovo_characterization DOMINE Cluster_1__DOMINE_interaction-human
Cluster_1 PF01176:PF00164;PF01176:PF11987 ExpAtlas_interaction Denovo_characterization ExpAtlas Cluster_1__ExpAtlas_interaction
Cluster_1 PF01176 InterProScan_PfamDomain Denovo_characterization InterProScan Cluster_1__InterProScan_PfamDomain
Cluster_1 msp_unknown MSPminer_MSP Denovo_characterization MSPminer Cluster_1__MSPminer_MSP
Cluster_1 up(CD_vs_nonIBD) MaAsLin2_DA Denovo_characterization MaAsLin2 Cluster_1__MaAsLin2_DA
...
- This file details the annotation of each protein family in the community (TSV format file).
- Protein families are groups of evolutionarily-related protein-coding sequences that often perform similar functions.
- MetaWIBELE annotates protein families by combining global-homology similarity, local-homology similarity, and non-homology based methods.
- The annotations for each protein family come from multiple information sources, e.g. biochemical annotation, taxonomic annotation, ecological properties and association with environmental parameters or phenotypes, etc.
To view this file, type:
less -S output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv
which yields:
TID AID key value
1 Cluster_1__Denovo_clustering repID PRISM_7861_211956
2 Cluster_1__Denovo_clustering rep_length 85
3 Cluster_1__Denovo_clustering cluster_size 24
4 Cluster_1__UniRef90_homology UniProtKB A7B522
5 Cluster_1__UniRef90_homology Protein_names Translation initiation factor IF-1
6 Cluster_1__UniRef90_homology query_cov_type high_confidence
...
- This is the supplementary results for characterization (TSV format file).
- Each line in the file provides supplemental information about the corresponding annotation results.
AID
is the key to connectdemo_proteinfamilies_annotation.tsv
withdemo_proteinfamilies_annotation.attribute.tsv
.
To view this file, type:
less -S output/characterization/finalized/demo_proteinfamilies_annotation.taxonomy.tsv
which yields:
familyID study map_type query_type mutual_type identity query_coverage mutual_coverage detail TaxTaxID Rep_Tax Rep_TaxID UniProtKB unirefID note msp_name msp_taxa_name msp_taxa_id MSP_Tax MSP_TaxID MSP_Rep_Tax MSP_Rep_TaxID taxa_id taxa_name taxa_rank taxa_lineage
Cluster_1 demo UniRef90_characterized high_confidence high_confidence 100 1.0 1.0 Translation initiation factor IF-1 root 1 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) 411470 A7B522 UniRef90_A7B522 good msp_unknownNA NA Lachnospiraceae 186803 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) 411470 411470 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) Terminal k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)
Cluster_10 demo UniRef90_uncharacterized high_confidence high_confidence 99.1 1.0 1.0 Phenylalanine--tRNA ligase beta subunit Clostridiales 186802 [Eubacterium] rectale 39491 A0A395ZH25 UniRef90_A0A395ZH25 good msp_1 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_100 demo UniRef90_uncharacterized high_confidence high_confidence 98.9 1.0 1.0 PTS fructose transporter subunit IIC Clostridiales 186802 [Eubacterium] rectale 39491 A0A395UVP7 UniRef90_A0A395UVP7 good msp_1 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_101 demo UniRef90_uncharacterized high_confidence high_confidence 93.6 1.0 1.0 RecQ family ATP-dependent DNA helicase Bacteria 2 [Eubacterium] rectale 39491 A0A396FK20 UniRef90_A0A396FK20 good msp_1 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
...
- This file provides detailed information about taxonomic annotation (TSV format file).
-
demo_proteinfamilies_annotation.taxonomy.all.tsv
shows the taxonomic information at each taxonomic level. -
Main columns:
taxa_id
: the finalized the taxon ID of this protein family (either taxon for LCA or representative based on parameter setting, default representative taxon);taxa_name
: the finalized the taxon name of this protein family;taxa_rank
: the finalized the taxonomic level of this protein family;taxa_lineage
: the finalized the taxonomic lineage of this protein family.
To view this file, type:
less -S output/characterization/finalized/demo_proteinfamilies_nrm.tsv
which yields:
ID PRISM_7122 PRISM_7147 PRISM_7150 PRISM_7153 PRISM_7184 PRISM_7238 PRISM_7406 PRISM_7408 PRISM_7421
Cluster_1 9165.52 562.073 3127.42 592.641 91252.4 17300.3 25623.1 8615.57 180.455
Cluster_10 80.549 2873.28 2402.93 3313.99 801.949 1655.54 5.23934 1639.27 2993.25
Cluster_100 202.251 1817.23 1632.6 2618.05 503.404 933.184 4.93331 925.316 1612.71
Cluster_101 206.759 1736.68 1729.47 1543.63 0 1084.08 8.40548 1068.26 1638.2
...
- This file contains the relative abundance per protein family across samples (TSV format file).
- Protein family abundance is reported in copies per million (CPM) units by default, which is "total sum scaling (TSS)"-style normalization (i.e. each sample is constrained to sum to 1 million). First, each protein family is normalized to RPK (reads per kilobase) units for gene length normalization; RPK units reflect relative gene (or transcript) copy numbers in the community. Then, RPK values are further sum-normalized (CPM) to adjust for differences in sequencing depth across samples. For further information, refer to the normalization approach in HUMAnN.
The clustering
folder contains the outputs of clustering proteins into protein families using the same criteria for building UniRef90 clusters.
The pre-generated demo files can be directly downloaded here:
To view this file, enter:
less -S output/characterization/clustering/demo_proteinfamilies.clstr
which yields:
>PRISM_7861_211956;Cluster_1;length=85;size=24;cluster=1
PRISM_7861_211956
PRISM_7791_142577
PRISM_8794_132719
PRISM_7910_97769
...
>PRISM_7855_54982;Cluster_2;length=559;size=12;cluster=2
PRISM_7855_54982
PRISM_8776_12844
PRISM_8467_139927
PRISM_9148_54414
...
- This file shows the clustered proteins (Fasta format file).
- The top ID (e.g. PRISM_7861_211956) is the representative of the cluster and it is followed by a list of cluster members.
- Each protein is given a MetaWIBELE-specific ID (i.e. PRISM_7861_211956 and PRISM_7861_54982) based on the sample and order in which it was identified.
To view this file, enter:
less -S output/characterization/clustering/demo_proteinfamilies.centroid.faa
which yields:
>PRISM_7861_211956
MRSGSVKSAGGFSMSKADVIEIEGVVVEKLPNAMFKVELENGHIVLAHISGKLRMNFIKILPGDKVTLEMSPYDLSKGRI
...
>PRISM_7855_54982
MISDNARAGMQQAPARSLFNALGFTPEEMKKPMVGIVSSFNEIVPGHMNIDKIVEAVKLGVAEAGGVPVVFPAIAVCDGI
AMGHVGMKYSLVTRDLIADSTECMAIAHQFDALVMVPNCDKNVPGLLMAAARLNLPTVFVSGGPMLAGHVKGRKRSLSSM
...
- This file provides the amino acid sequences for each protein identified as the representative of its protein family (Fasta format file).
The global_homology_annotation
folder contains the outputs from querying each sequence in the protein families against the UniRef90 database via a protein-level search.
The pre-generated demo files can be directly downloaded here:
To view this file, enter:
less -S output/characterization/global_homology_annotation/demo_proteinfamilies_annotation.uniref90_annotation.tsv
which yields:
familyID study map_type query_type mutual_type identity query_coverage mutual_coverage detail TaxTaxID Rep_Tax Rep_TaxID UniProtKB unirefID taxa_id taxa_name taxa_rank taxa_lineage note
Cluster_1 demo UniRef90_characterized high_confidence high_confidence 100 1.0 1.0 Translation initiation factor IF-1 root 1 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) 411470 A7B522 UniRef90_A7B522 411470 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) Terminal k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9) good
Cluster_100 demo UniRef90_uncharacterized high_confidence high_confidence 98.9 1.0 1.0 PTS fructose transporter subunit IIC Clostridiales 186802 [Eubacterium] rectale 39491 A0A395UVP7 UniRef90_A0A395UVP7 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale good
Cluster_106 demo UniRef90_weak_homology low_confidence low_confidence 97.4 0.5571895424836601 0.5571895424836601 NA NA NA NA NA NA UniRef90_R6U3V8 NA NA Unclassified NA good
Cluster_285 demo UniRef90_worse_homology low_confidence low_confidence 94.8 0.7584269662921348 0.14594594594594595 UniRef90_unknown NA NA NA NA NA NA 39491 NA Unclassified NA good
...
- This file shows the mapping results of protein families to UniRef90 (TSV format file).
- Each row in this file corresponds to a protein family (i.e. Cluster_1, Cluster_10). Based on the sequence identity to and coverage of the UniRef90 sequence, each protein family is labeled as:
- UniRef90_characterized: proteins with strong homology to known characterized proteins
- UniRef90_uncharacterized: proteins with strong homology to proteins of uncharacterized function (i.e. lacking a GO term assignment)
- UniRef90_weak_homology: potentially novel proteins that only had weak homology to any UniRef90 family
- UniRef90_worse_homology: completely novel proteins without any significant homology
To view this file, enter:
less -S output/characterization/global_homology_annotation/demo_UniRef90_proteinfamilies.detail.tsv
which yields:
familyID type detail Protein_names Gene_names Tax TaxID Rep_Tax Rep_TaxID UniProtKB unirefID
Cluster_1 UniRef90_GO GO:0003743;GO:0005737;GO:0019843;GO:0043022 Translation initiation factor IF-1 RUMGNA_02659{ECO:0000313|EMBL:EDN77045.1} root 1 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) 411470 A7B522 UniRef90_A7B522
Cluster_1 UniRef90_PfamDomain PF01176 Translation initiation factor IF-1 RUMGNA_02659{ECO:0000313|EMBL:EDN77045.1} root 1 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) 411470 A7B522 UniRef90_A7B522
Cluster_1 UniRef90_eggNOG COG0361;ENOG4105K9U Translation initiation factor IF-1 RUMGNA_02659{ECO:0000313|EMBL:EDN77045.1} root 1 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) 411470 A7B522 UniRef90_A7B522
Cluster_10 UniRef90_uncharacterized UniRef90_uncharacterized Phenylalanine--tRNA ligase beta subunit DW951_09675{ECO:0000313|EMBL:RHA03586.1} Clostridiales 186802 [Eubacterium] rectale 39491 A0A395ZH25 UniRef90_A0A395ZH25
Cluster_106 UniRef90_unknown UniRef90_unknown UniRef90_unknown NA NA NA NA NA NA NA
...
- This file details the annotations of the UniRef90s to other groupings, including GO terms, KEGG Orthogroups, EggNOG (including COGs), and Pfams (TSV format file).
- Protein families may also be labeled as:
- UniRef90_XYZ: proteins with strong homology to known characterized proteins and annotated with specific type of functions
- UniRef90_uncharacterized: proteins with strong homology to proteins of uncharacterized function (i.e. lacking GO annotation)
- UniRef90_unknown: potentially novel proteins that lacked functional annotations and only had weak homology or without any significant homology to any UniRef90 family
The domain_motif_annotation
folder contains the outputs created by using the local-homology approach (domain/motif search) to characterize the secondary structures of protein families.
The pre-generated demo files can be directly downloaded here:
- demo_Denovo_proteinfamilies.signaling.detail.tsv
- demo_Denovo_proteinfamilies.transmembrane.detail.tsv
- demo_InterProScan_proteinfamilies.detail.tsv
- demo_DOMINE_proteinfamilies.detail.tsv
To view this file, enter:
less -S output/characterization/domain_motif_annotation/demo_Denovo_proteinfamilies.signaling.detail.tsv
which yields:
familyID type detail consistency
Cluster_108 Denovo_signaling signaling 1.0
Cluster_11 Denovo_signaling signaling 1.0
Cluster_111 Denovo_signaling signaling 1.0
Cluster_114 Denovo_signaling signaling 1.0
...
- This file lists potential signaling proteins based on de novo prediction of function (TSV format file).
- Proteins that are not predicted to be signaling proteins are excluded from the file.
- If signal peptides were predicted by both SignalP and Phobius, the consistency score (column 4) contains the consistency scores from SignalP and Phobius predictions. When both SignalP and Phobius predict the same type of signal and have high consistency scores the prediction is believed to be true and the protein family is included.
To view this file, enter:
less -S output/characterization/domain_motif_annotation/demo_Denovo_proteinfamilies.transmembrane.detail.tsv
which yields:
familyID type detail consistency
Cluster_100 Denovo_transmembrane transmembrane 1.0
Cluster_102 Denovo_transmembrane transmembrane 1.0
Cluster_105 Denovo_transmembrane transmembrane 1.0
Cluster_108 Denovo_transmembrane transmembrane 1.0
...
- Similar to the above, this file lists potential transmembrane proteins based on de novo prediction of function (TSV format file).
To view this file, enter:
less -S output/characterization/domain_motif_annotation/demo_InterProScan_proteinfamilies.detail.tsv
which yields:
familyID type detail description InterPro_accession consistency
Cluster_257 InterProScan_PfamDomain PF00005;PF02518 ABC transporter-like;Histidine kinase/HSP90-like ATPase IPR003439;IPR0035941.0;1.0
Cluster_191 InterProScan_PfamDomain PF07005;PF17042 Four-carbon acid sugar kinase, N-terminal domain;Four-carbon acid sugar kinase, C-terminal domain IPR010737;IPR031475 1.0;1.0
Cluster_130 InterProScan_PfamDomain PF13437;PF13533 HlyD family secretion protein;Membrane fusion protein, biotin-lipoyl like domain NA;IPR039562 1.0;1.0
Cluster_71 InterProScan_PfamDomain PF00343 Glycosyl transferase, family 35 IPR000811 1.0
...
- This file lists the results of using InterProScan to predict Pfam IDs or other domain signatures for each protein family (TSV format file).
- The consistency score is calculated as the proportion of members within the cluster that receive the same predicted Pfam ID as the protein family representative.
To view this file, enter:
less -S output/characterization/domain_motif_annotation/demo_DOMINE_proteinfamilies.detail.tsv
which yields:
familyID type detail description consistency
Cluster_257 DOMINE_interaction PF00005:PF00009;PF00005:PF00037;PF00005:PF00571;PF00005:PF00664;PF00005:PF03459;PF02518:PF00012;PF02518:PF00072;PF02518:PF00165;PF02518:PF00183;PF02518:PF00204;PF02518:PF00240;PF02518:PF00515;PF02518:PF00582;PF02518:PF00672;PF02518:PF00785;PF02518:PF00989;PF02518:PF01119;PF02518:PF01584;PF02518:PF01590;PF02518:PF01624;PF02518:PF01627;PF02518:PF01740;PF02518:PF01751;PF02518:PF02895;PF00005:PF00528;PF00005:PF01458;PF00005:PF08402;PF02518:PF00512;PF02518:PF04539;PF02518:PF04542;PF02518:PF04968;PF02518:PF04969;PF02518:PF05833;PF02518:PF07730;PF02518:PF08565;PF02518:PF09239;PF02518:PF10436 ABC transporter:Elongation factor Tu GTP binding domain;ABC transporter:4Fe-4S binding domain;ABC transporter:CBS domain;ABC transporter:ABC transporter transmembrane region;ABC transporter:TOBE domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Hsp70 protein;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Response regulator receiver domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Bacterial regulatory helix-turn-helix proteins, AraC family;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Hsp90 protein;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:DNA gyrase B;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Ubiquitin family;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Tetratricopeptide repeat;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Universal stress protein family;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:HAMP domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:NA;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:PAS fold;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:DNA mismatch repair protein, C-terminal domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:CheW-like domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:GAF domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:MutS domain I;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Hpt domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:STAS domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Toprim domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Signal transducing histidine kinase, homodimeric domain;ABC transporter:Binding-protein-dependent transport system inner membrane component;ABC transporter:Uncharacterized protein family (UPF0051);ABC transporter:TOBE domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:His Kinase A (phospho-acceptor) domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Sigma-70 region 3;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Sigma-70 region 2;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:CHORD;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:CS domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Fibronectin-binding protein A N-terminus (FbpA);Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Histidine kinase;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Cdc37 Hsp90 binding domain;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Topoisomerase VI B subunit, transducer;Histidine kinase-, DNA gyrase B-, and HSP90-like ATPase:Mitochondrial branched-chain alpha-ketoacid dehydrogenase kinase 1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0
Cluster_256 DOMINE_interaction PF00293:PF00300;PF00293:PF05026;PF00293:PF09296 NUDIX domain:Histidine phosphatase superfamily (branch 1);NUDIX domain:Dcp2, box A domain;NUDIX domain:NADH pyrophosphatase-like rudimentary NUDIX domain 1.0;1.0;1.0
Cluster_16 DOMINE_interaction PF00009:PF00005;PF00009:PF00679;PF00009:PF01008;PF00009:PF01423;PF00009:PF01583;PF00009:PF03143;PF00009:PF03144;PF00009:PF03764;PF03144:PF00679;PF03144:PF01583;PF03144:PF03143;PF03144:PF03764;PF00009:PF00129;PF00009:PF00736;PF00009:PF01507;PF00009:PF01926;PF00009:PF03431;PF00009:PF06421;PF00009:PF07686;PF00009:PF09173;PF00009:PF11987;PF03144:PF00736;PF03144:PF01507;PF03144:PF01873;PF03144:PF03431;PF03144:PF07541;PF03144:PF09173;PF03144:PF11987 Elongation factor Tu GTP binding domain:ABC transporter;Elongation factor Tu GTP binding domain:Elongation factor G C-terminus;Elongation factor Tu GTP binding domain:Initiation factor 2 subunit family;Elongation factor Tu GTP binding domain:LSM domain;Elongation factor Tu GTP binding domain:Adenylylsulphate kinase;Elongation factor Tu GTP binding domain:Elongation factor Tu C-terminal domain;Elongation factor Tu GTP binding domain:Elongation factor Tu domain 2;Elongation factor Tu GTP binding domain:Elongation factor G, domain IV;Elongation factor Tu domain 2:Elongation factor G C-terminus;Elongation factor Tu domain 2:Adenylylsulphate kinase;Elongation factor Tu domain 2:Elongation factor Tu C-terminal domain;Elongation factor Tu domain 2:Elongation factor G, domain IV;Elongation factor Tu GTP binding domain:Class I Histocompatibility antigen, domains alpha 1 and 2;Elongation factor Tu GTP binding domain:EF-1 guanine nucleotide exchange domain;Elongation factor Tu GTP binding domain:Phosphoadenosine phosphosulfate reductase family;Elongation factor Tu GTP binding domain:50S ribosome-binding GTPase;Elongation factor Tu GTP binding domain:RNA replicase, beta-chain;Elongation factor Tu GTP binding domain:GTP-binding protein LepA C-terminus;Elongation factor Tu GTP binding domain:Immunoglobulin V-set domain;Elongation factor Tu GTP binding domain:Initiation factor eIF2 gamma, C terminal;Elongation factor Tu GTP binding domain:Translation-initiation factor 2;Elongation factor Tu domain 2:EF-1 guanine nucleotide exchange domain;Elongation factor Tu domain 2:Phosphoadenosine phosphosulfate reductase family;Elongation factor Tu domain 2:Domain found in IF2B/IF5;Elongation factor Tu domain 2:RNA replicase, beta-chain;Elongation factor Tu domain 2:Eukaryotic translation initiation factor 2 alpha subunit;Elongation factor Tu domain 2:Initiation factor eIF2 gamma, C terminal;Elongation factor Tu domain 2:Translation-initiation factor 2 1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0;1.0
Cluster_298 DOMINE_interaction PF07963:PF00114 Prokaryotic N-terminal methylation motif:Pilin (bacterial filament) 1.0
...
- This file lists the results of using DOMINE to predict domain-domain interactions for each protein (TSV format file).
- In this step, MetaWIBELE searches the DOMINE database of known and predicted protein domain (domain-domain) interactions in order to identify microbial proteins that interact with other microbial proteins or host proteins.
- Looking at the first row for protein family Cluster_298, the details column contains "PF07963:PF00114". The first value is the Pfam ID for this protein family, and the second value is the Pfam ID of the protein it is known or predicted to interact with.
- Protein families can interact with multiple other proteins and multiple interactions are separated by ";" in the details column. For example, the third row for protein family Cluster_257 has Pfam ID PF00005 and interacts with multiple other proteins including PF00009, PF00037, PF00571, and many others.
The abundance_annotation
folder contains the outputs created from utilizing a non-homology based strategy compromising (i) taxonomic annotation with phylogenetic binning, (ii) abundance profiling for protein families, and (iii) association with environmental parameters or phenotypes based on differential abundance.
The pre-generated demo files can be directly downloaded here:
- demo_DNA_proteinfamilies.abundance.detail.tsv
- demo_MaAsLin2_proteinfamilies.DA.detail.tsv
- demo_MaAsLin2_proteinfamilies.DA-sig.detail.tsv
- demo_MSPminer_proteinfamilies.detail.tsv
- demo_proteinfamilies_annotation.taxonomy.tsv
To view this file, enter:
less -S output/characterization/abundance_annotation/demo_DNA_proteinfamilies.abundance.detail.tsv
which yields:
familyID type detail mean_abundance mean_prevalent_abundance prevalence total_abundance total_samples
Cluster_1 DNA-CD_abundance 23863.237926470596 23863.237926470596 25757.145698412707 0.9264705882352942 1622700.1790000005 68
Cluster_1 DNA-CD_prevalence 0.9264705882352942 23863.237926470596 25757.145698412707 0.9264705882352942 1622700.1790000005 68
Cluster_1 DNA_abundance 16718.588343137264 16718.588343137264 17580.37124742269 0.9509803921568627 1705296.0110000009 102
Cluster_1 DNA_prevalence 0.9509803921568627 16718.588343137264 17580.37124742269 0.9509803921568627 1705296.0110000009 102
Cluster_1 DNA-nonIBD_abundance 2429.2891764705887 2429.2891764705887 2429.2891764705887 1.0 82595.83200000001 34
Cluster_1 DNA-nonIBD_prevalence 1.0 2429.2891764705887 2429.2891764705887 1.0 82595.83200000001 34
Cluster_1 DNA-within-phenotype_abundance 23863.237926470596 23863.237926470596 25757.145698412707 0.9264705882352942 1622700.1790000005 68
Cluster_1 DNA-within-phenotype_prevalence 1.0 2429.2891764705887 2429.2891764705887 1.0 82595.83200000001 34
...
- This file lists the results of abundance annotations for each protein (TSV format file).
- In this file, the number of rows per protein family is equal to 4+(2*the number of levels of the phenotype variable of interest). In this case, our phenotype variable of interest contains two levels (CD and non-IBD) and so there are 8 rows per protein family.
- We calculate the mean abundance and mean prevalence of the protein family within levels of the phenotype variable. The "within phenotype" abundance and prevalence are used later by MetaWIBELE and are equal to the largest of these values, respectively. For example, for Cluster_1 the "within phenotype" abundance is equal to the CD abundance, as this is larger than the non-IBD abundance.
- Additionally, we calculate the mean abundance and prevalence of the protein family across all the samples ("DNA_abundance", "DNA_prevalence").
To view this file, enter:
less -S output/characterization/abundance_annotation/demo_MaAsLin2_proteinfamilies.DA.detail.tsv
which yields:
familyID type detail cmp_type prevalence prevalence_case prevalence_control coef stderr pvalue qvalue log(FC) Cohen's_d mean(log) mean_abundance mean_abundance_case mean_abundance_control mean_prevalent_abundance mean_prevalent_abundance_case mean_prevalent_abundance_control
Cluster_253 MaAsLin2_DA up(CD_vs_nonIBD) CD_vs_nonIBD 0.6862745098039216 0.8382352941176471 0.38235294117647056 6.59932556080405 1.48799879543361 2.57037708535903e-05 0.000176053225024591 2.48608582454593 1.0766611399179797 4.492591935109035 4587.649978039216 6606.8036835294115 549.3425670588235 6684.861396571429 7881.800885614034 1436.7420984615385
Cluster_247 MaAsLin2_DA up(CD_vs_nonIBD) CD_vs_nonIBD 0.7450980392156863 0.8676470588235294 0.5 8.09168252755146 1.46613110483452 3.18585281861428e-07 3.5953503081526e-05 2.4377401279384125 1.1489923257331105 4.4243947777191845 5935.187654901962 8530.409459705887 744.7440452941177 7965.646589473685 9831.658360338986 1489.4880905882353
Cluster_252 MaAsLin2_DA down(CD_vs_nonIBD) CD_vs_nonIBD 0.5 0.35294117647058826 0.7941176470588235 -5.94413824963104 1.30440935252719 1.60881176921761e-05 0.00013494033537789 -1.7040901947843266 -1.230317145260927 -3.8803544049272 703.886934117647 281.32940911764706 1549.001984117647 1407.773868235294 797.0999925 1950.595091111111
Cluster_180 MaAsLin2_DA down(CD_vs_nonIBD) CD_vs_nonIBD 0.7745098039215687 0.6764705882352942 0.9705882352941176 -3.90013337098607 1.23666141445333 0.00218410898383837 0.00392204763002695 -0.991862676094047 -1.7721309728447108 -3.367940610001413 1749.8750674509804 1117.6126011764707 3014.4 2259.3323655696204 1652.122975652174 3105.745454545455
...
- This file contains the results of running MaAsLin2 linear models to assess the differential abundance of the protein families across levels of the phenotype variable of interest (TSV format file).
- The detail type of "down" indicates that the abundance of the protein family was lower in CD compared to the non-IBD control, while a detail type of "up" indicates that the abundance of the protein family was higher in CD compared to the non-IBD control.
To view this file, enter:
less -S output/characterization/abundance_annotation/demo_MaAsLin2_proteinfamilies.DA-sig.detail.tsv
which yields:
familyID type detail cmp_type prevalence prevalence_case prevalence_control coef stderr pvalue qvalue log(FC) Cohen's_d mean(log) mean_abundance mean_abundance_case mean_abundance_control mean_prevalent_abundance mean_prevalent_abundance_case mean_prevalent_abundance_control
Cluster_253 MaAsLin2_DA-sig up(CD_vs_nonIBD) CD_vs_nonIBD 0.6862745098039216 0.8382352941176471 0.38235294117647056 6.59932556080405 1.48799879543361 2.57037708535903e-05 0.000176053225024591 2.48608582454593 1.0766611399179797 4.492591935109035 4587.649978039216 6606.8036835294115 549.3425670588235 6684.861396571429 7881.800885614034 1436.7420984615385
Cluster_247 MaAsLin2_DA-sig up(CD_vs_nonIBD) CD_vs_nonIBD 0.7450980392156863 0.8676470588235294 0.5 8.09168252755146 1.46613110483452 3.18585281861428e-07 3.5953503081526e-05 2.4377401279384125 1.1489923257331105 4.4243947777191845 5935.187654901962 8530.409459705887 744.7440452941177 7965.646589473685 9831.658360338986 1489.4880905882353
Cluster_252 MaAsLin2_DA-sig down(CD_vs_nonIBD) CD_vs_nonIBD 0.5 0.35294117647058826 0.7941176470588235 -5.94413824963104 1.30440935252719 1.60881176921761e-05 0.00013494033537789 -1.7040901947843266 -1.230317145260927 -3.8803544049272 703.886934117647 281.32940911764706 1549.001984117647 1407.773868235294 797.0999925 1950.595091111111
Cluster_180 MaAsLin2_DA-sig down(CD_vs_nonIBD) CD_vs_nonIBD 0.7745098039215687 0.6764705882352942 0.9705882352941176 -3.90013337098607 1.23666141445333 0.00218410898383837 0.00392204763002695 -0.991862676094047 -1.7721309728447108 -3.367940610001413 1749.8750674509804 1117.6126011764707 3014.4 2259.3323655696204 1652.122975652174 3105.745454545455
...
- This file contains similar data to the demo_MaAsLin2_proteinfamilies.DA.detail.tsv file above (TSV format file); however, it only includes results for protein families that were significantly (q<0.05) differentially abundant by default.
- Under what circumstances would you want to skip the association analysis?
To view this file, enter:
less -S output/characterization/abundance_annotation/demo_MSPminer_proteinfamilies.detail.tsv
which yields:
familyID type detail gene_class module_name msp_description msp_component multi_msp total_member core_gene accessory_gene module_compoment multi_module taxa_id taxa_name taxa_rank taxa_lineage note
Cluster_1 MSPminer_MSP msp_unknown NA NA NA msp_unknown 1 24 0 0 NA 0 NA NA Unclassified NA good
Cluster_10 MSPminer_MSP msp_1 msp_1__core msp_1__core num_genes(460),num_unclassified_genes(91),num_hit_genes(334msp_1;msp_unknown 2 2 0.5 0 msp_1__core 1 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale good
Cluster_100 MSPminer_MSP msp_1 msp_1__core msp_1__core num_genes(460),num_unclassified_genes(91),num_hit_genes(334msp_1 1 1 1.0 0 msp_1__core 1 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale good
Cluster_101 MSPminer_MSP msp_1 msp_1__core msp_1__core num_genes(460),num_unclassified_genes(91),num_hit_genes(334msp_1 1 1 1.0 0 msp_1__core 1 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale good
...
- MetaWIBELE uses MSPminer to construct Metagenomic Species Pan-genomes (MSPs) by binning co-abundant genes across samples (TSV format file).
- For each protein family, the detail column lists the MSP that protein family has been assigned to. For example, protein family Cluster_1 belongs to the first MSP (msp_1).
To view this file, enter:
less -S output/characterization/abundance_annotation/demo_proteinfamilies_annotation.taxonomy.tsv
which yields:
familyID study map_type query_type mutual_type identity query_coverage mutual_coverage detail TaxTaxID Rep_Tax Rep_TaxID UniProtKB unirefID note msp_name msp_taxa_name msp_taxa_id MSP_Tax MSP_TaxID MSP_Rep_Tax MSP_Rep_TaxID taxa_id taxa_name taxa_rank taxa_lineage
Cluster_1 demo UniRef90_characterized high_confidence high_confidence 100 1.0 1.0 Translation initiation factor IF-1 root 1 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) 411470 A7B522 UniRef90_A7B522 good msp_unknownNA NA Lachnospiraceae 186803 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) 411470 411470 Ruminococcus gnavus (strain ATCC 29149 / VPI C7-9) Terminal k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Blautia|s__Ruminococcus_gnavus|t__Ruminococcus_gnavus_(strain_ATCC_29149_/_VPI_C7-9)
Cluster_10 demo UniRef90_uncharacterized high_confidence high_confidence 99.1 1.0 1.0 Phenylalanine--tRNA ligase beta subunit Clostridiales 186802 [Eubacterium] rectale 39491 A0A395ZH25 UniRef90_A0A395ZH25 good msp_1 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_106 demo UniRef90_weak_homology low_confidence low_confidence 97.4 0.5571895424836601 0.5571895424836601 NA NA NA NA NA NA UniRef90_R6U3V8 good msp_1 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
Cluster_285 demo UniRef90_worse_homology low_confidence low_confidence 94.8 0.7584269662921348 0.14594594594594595UniRef90_unknown NA NA NA NA NA NA good msp_1 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 [Eubacterium] rectale 39491 39491 [Eubacterium] rectale Species k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|unclassified_Lachnospiraceae|s__[Eubacterium]_rectale
...
- In this step, MetaWIBELE uses Metagenomic Species Pan-genomes (MSPs) to assign taxonomy to protein families that were not able to be identified through step 3.3.3 global-homology based annotations (TSV format file).
- For example, protein family Cluster_106 and Cluster_285 had only weak or lacked homology with UniRef90s at the sequence level. However, using co-abundance data it was assigned to msp_1, which we know from other protein family information corresponds to Eubacterium rectale.
-
Main columns:
taxa_id
: the finalized the taxon ID of this protein family (either taxon for LCA or representative based on parameter setting, default representative taxon);taxa_name
: the finalized the taxon name of this protein family;taxa_rank
: the finalized the taxonomic level of this protein family;taxa_lineage
: the finalized the taxonomic lineage of this protein family.
- MetaWIBELE uses a non-homology based approach to perform taxonomic classification. What would be some pros/cons of this approach in real data with different levels of characterization? (Hint: consider well-characterized microbiome communities vs. poorly-characterized microbiome communities)
MetaWIBELE can perform prioritization to identify important proteins using the following outputs of the MetaWIBELE-characterize module:
- annotation file produced by MetaWIBELE-characterize workflow, e.g. demo_proteinfamilies_annotation.tsv
- attribute file produced by MetaWIBELE-characterize workflow, e.g. demo_proteinfamilies_annotation.attribute.tsv
A canonical MetaWIBELE run will prioritize protein families using unsupervised and supervised approaches. By default, protein families are 1) quantitatively ranked based on ecological properties (abundance and prevalence) and/or associations with environmental parameters or phenotypes (q-value and effect size), 2) prioritized based on specified functions of interest via binary filtering.
From the working directory where you executed the MetaWIBELE-prioritize module (metawibele/
directory), execute the following:
metawibele prioritize --input-annotation output/characterization/finalized/demo_proteinfamilies_annotation.tsv \
--input-attribute output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv \
--output output/prioritization/ --local-jobs 4
Note:
- Make sure the global configuration file
metawibele.cfg
is in your working directory. - In the command replaces
output/prioritization/
with the name of the directory to write the output files to. - See the section on parallelization options to optimize the workflow run based on your computing resources.
- The workflow runs with the default settings used to run all modules. These settings will work for most data sets. If you need to modify the default settings, you can use command line parameters to customize the characterization workflow settings. For example, you can run with one or more bypass options (for information on bypass options, see the section Workflow bypass mode).
This command yields:
(Jun 10 15:43:28) [0/8 - 0.00%] **Started ** Task 0: quantify_prioritization__unsupervised
(Jun 10 15:43:32) [4/8 - 50.00%] **Started ** Task 13: finalize_prioritization__selected_supervised_priority
(Jun 10 15:43:28) [0/8 - 0.00%] **Started ** Task 5: quantify_prioritization__supervised
(Jun 10 15:43:31) [3/8 - 37.50%] **Started ** Task 6: filter_prioritization
(Jun 10 15:43:30) [1/8 - 12.50%] **Started ** Task 8: filter_prioritization
...
(Jun 10 15:43:33) [8/8 - 100.00%] **Completed** Task 11: finalize_prioritization__selected_unsupervised_priority
Run Finished
Three main output files will be created:
output/prioritization/demo_unsupervised_prioritization.rank.table.tsv
output/prioritization/demo_supervised_prioritization.rank.table.tsv
output/prioritization/demo_supervised_prioritization.rank.selected.table.tsv
The pre-generated demo files can be directly downloaded here:
- demo_unsupervised_prioritization.rank.table.tsv
- demo_supervised_prioritization.rank.table.tsv
- demo_supervised_prioritization.rank.selected.table.tsv
demo_unsupervised_prioritization.rank.table.tsv To view this file, enter:
less -S output/prioritization/demo_unsupervised_prioritization.rank.table.tsv
which yields:
TID familyID evidence value rank description note
1 Cluster_2 DNA_abundance 13143.781063725488 0.992 ranking based on single evidence
2 Cluster_2 DNA_prevalence 0.9803921568627451 0.997 ranking based on single evidence
3 Cluster_2 priority_score 0.994493715434892 0.994493715434892 meta ranking based on multiple evidences
4 Cluster_412 DNA_abundance 20994.94988235294 0.998 ranking based on single evidence
5 Cluster_412 DNA_prevalence 0.9607843137254902 0.988 ranking based on single evidence
6 Cluster_412 priority_score 0.9929748237663647 0.9929748237663647 meta ranking based on multiple evidences
7 Cluster_1 DNA_abundance 16718.588343137264 0.996 ranking based on single evidence
8 Cluster_1 DNA_prevalence 0.9509803921568627 0.982 ranking based on single evidence
9 Cluster_1 priority_score 0.9889504550050557 0.9889504550050557 meta ranking based on multiple evidences
...
- These are the results of unsupervised prioritization based on ecological properties (TSV format file).
- Each protein family has a numeric priority score based on its meta ranking.
-
demo_unsupervised_prioritization.rank.tsv
is the overall ranking for all protein families.
demo_supervised_prioritization.rank.table.tsv To view this file, enter:
less -S output/prioritization/demo_supervised_prioritization.rank.table.tsv
which yields:
TID familyID evidence value rank description note
1 Cluster_162 DNA_within_phenotype_abundance 3360.590970588234 0.82 ranking based on single evidence CD_vs_nonIBD
2 Cluster_162 DNA_within_phenotype_prevalence 1.0 1.0 ranking based on single evidence CD_vs_nonIBD
3 Cluster_162 MaAsLin2_DA__mean_log -2.4957861926317024 0.744 ranking based on single evidence CD_vs_nonIBD
4 Cluster_162 MaAsLin2_DA__qvalue 5.77786226411476e-05 0.957 ranking based on single evidence CD_vs_nonIBD
5 Cluster_162 priority_score 0.8679556698228121 0.8679556698228121 meta ranking based on multiple evidences CD_vs_nonIBD
6 Cluster_349 DNA_within_phenotype_abundance 2890.855411764706 0.646 ranking based on single evidence CD_vs_nonIBD
7 Cluster_349 DNA_within_phenotype_prevalence 1.0 1.0 ranking based on single evidence CD_vs_nonIBD
8 Cluster_349 MaAsLin2_DA__mean_log -2.7772748787138024 0.848 ranking based on single evidence CD_vs_nonIBD
9 Cluster_349 MaAsLin2_DA__qvalue 4.91142235527697e-05 0.969 ranking based on single evidence CD_vs_nonIBD
10 Cluster_349 priority_score 0.8404730391805891 0.8404730391805891 meta ranking based on multiple evidences CD_vs_nonIBD
...
- These are the results of supervised prioritization combining ecological properties and environmental/phenotypic properties (TSV format file).
- Each protein family has a numeric priority score based on its meta ranking.
-
demo_supervised_prioritization.rank.tsv
is the overall ranking for all protein families.
demo_supervised_prioritization.rank.selected.table.tsv To view this file, enter:
less -S output/prioritization/demo_supervised_prioritization.rank.selected.table.tsv
which yields:
TID familyID evidence value rank description note
1 Cluster_197 DNA_within_phenotype_abundance 3540.796235294117 0.866 ranking based on single evidence CD_vs_nonIBD
2 Cluster_197 DNA_within_phenotype_prevalence 1.0 1.0 ranking based on single evidence CD_vs_nonIBD
3 Cluster_197 MaAsLin2_DA__mean_log -2.317348408740017 0.672 ranking based on single evidence CD_vs_nonIBD
4 Cluster_197 MaAsLin2_DA__qvalue 0.000783512050628084 0.659 ranking based on single evidence CD_vs_nonIBD
5 Cluster_197 priority_score 0.7751516860758866 0.7751516860758866 meta ranking based on multiple evidences CD_vs_nonIBD
6 Cluster_42 DNA_within_phenotype_abundance 3127.4102941176475 0.722 ranking based on single evidence CD_vs_nonIBD
7 Cluster_42 DNA_within_phenotype_prevalence 0.9705882352941176 0.7047120418848167 ranking based on single evidence CD_vs_nonIBD
8 Cluster_42 MaAsLin2_DA__mean_log -2.5611076578004637 0.77 ranking based on single evidence CD_vs_nonIBD
9 Cluster_42 MaAsLin2_DA__qvalue 0.000366143186526455 0.755 ranking based on single evidence CD_vs_nonIBD
10 Cluster_42 priority_score 0.737019409947045 0.737019409947045 meta ranking based on multiple evidences CD_vs_nonIBD
- This file is the result of binary filtering of the protein families based on biochemical annotations (TSV format file).
- The settings require that each of the prioritized protein families should 1) be predicted as signaling peptides, and 2) have at least one of the following features: extracellular, cellWall, outerMembrane, transmembrane, Pfam domains, domain-domain interaction annotations.
- Under what circumstances would skipping supervised prioritization be reasonable? Under what circumstances would skipping supervised prioritization be inadvisable?
Unsupervised hierarchical clustering of the 30 protein families most prioritized in IBD which are potentially secreted proteins with signaling functions:
By default, MetaWIBELE will perform binary filtering using the prioritization configuration file installed in the package. Optionally, you can also make your own configuration files and provide them as optional arguments to MetaWIBELE. For example, the local prioritization configuration file can be provided with --prioritization-config $PRIORITIZE_CONF
where $PRIORITIZE_CONF
is the file with prioritization configurations.
5.1.1 Select subset of interest annotated with multiple specific biochemical annotations simultaneously
Setting $PRIORITIZE_CONF
as following:
```
##Binary filtering for selection subset
[filtering]
# Filter for interested functional vignettes type [Choices: pilin | secreted_system | other user defined | none]
vignettes = none
# significant associations for filtering: [required] required item, [optional] optional item, [none] ignoring
MaAsLin2_DA-sig = required
# biochemical annotation for filtering: [required] required item, [optional] optional item, [none] ignoring
ExpAtlas_interaction = required
DOMINE_interaction = none
SIFTS_interaction = none
Denovo_signaling = required
Denovo_transmembrane = optional
PSORTb_extracellular = optional
PSORTb_cellWall = optional
PSORTb_outerMembrane = optional
UniRef90_extracellular = optional
UniRef90_signaling = none
UniRef90_transmembrane = optional
UniRef90_cellWall = optional
UniRef90_outerMembrane = optional
UniRef90_PfamDomain = none
InterProScan_PfamDomain = none
InterProScan_SUPERFAMILY = none
InterProScan_ProSiteProfiles = none
InterProScan_ProSitePatterns = none
InterProScan_Gene3D = none
InterProScan_PANTHER = none
InterProScan_TIGRFAM = none
InterProScan_SFLD = none
InterProScan_ProDom = none
InterProScan_Hamap = none
InterProScan_SMART = none
InterProScan_CDD = none
InterProScan_PRINTS = none
InterProScan_PIRSF = none
InterProScan_MobiDBLite = none
```
- Re-run prioritization workflow for filtering:
$ metawibele prioritize --prioritization-config $PRIORITIZE_CONF --bypass-mandatory --input-annotation output/characterization/finalized/demo_proteinfamilies_annotation.tsv --input-attribute output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv --output output/prioritization/ --selected-output demo_prioritized.selected.tsv
- Output file name:
$OUTPUT_DIR/prioritization demo_prioritized.selected.tsv
- This file is the results of binary filtering of the protein families based on biochemical annotations.
- These settings require that each of prioritized protein family should 1) significantly associated with the main phenotype, 2) be annotated to domain-domain interaction with the host, 3) predicted as signal peptides, and 4) have at least one of the following features: extracellular, cellWall, outerMembrane, transmembrane
Setting $PRIORITIZE_CONF
as following:
```
[filtering]
# Filter for interested functional vignettes type [Choices: pilin | secreted_system | other user defined | none]
vignettes = pilin
# Filter for significant associations: [required] required item, [optional] optional item, [none] ignoring [ Default: required ]
MaAsLin2_DA-sig = none
# biochemical annotation for filtering: [required] required item, [optional] optional item, [none] ignoring
ExpAtlas_interaction = optional
DOMINE_interaction = optional
SIFTS_interaction = optional
Denovo_signaling = optional
Denovo_transmembrane = optional
PSORTb_extracellular = optional
PSORTb_cellWall = optional
PSORTb_outerMembrane = optional
UniRef90_extracellular = optional
UniRef90_signaling = optional
UniRef90_transmembrane = optional
UniRef90_cellWall = optional
UniRef90_outerMembrane = optional
UniRef90_PfamDomain = optional
InterProScan_PfamDomain = optional
InterProScan_SUPERFAMILY = optional
InterProScan_ProSiteProfiles = optional
InterProScan_ProSitePatterns = optional
InterProScan_Gene3D = optional
InterProScan_PANTHER = optional
InterProScan_TIGRFAM = optional
InterProScan_SFLD = optional
InterProScan_ProDom = optional
InterProScan_Hamap = optional
InterProScan_SMART = optional
InterProScan_CDD = optional
InterProScan_PRINTS = optional
InterProScan_PIRSF = optional
InterProScan_MobiDBLite = optional
```
- Re-run prioritization workflow for filtering:
$ metawibele prioritize --prioritization-config $PRIORITIZE_CONF --bypass-mandatory --vignette-config $my_vignette_function_file --input-annotation output/characterization/finalized/demo_proteinfamilies_annotation.tsv --input-attribute output/characterization/finalized/demo_proteinfamilies_annotation.attribute.tsv --output output/prioritization/ --selected-output demo_prioritized_pilin.tsv
- Provide your own vignette function file for filtering specific functions.
- Output file name:
$OUTPUT_DIR/prioritization/demo_prioritized_pilin.tsv
- This file is the result of binary filtering of the protein families based on pilin related functions (TSV format file).
- How would you customize your local configuration file for prioritization to select the subset of protein families annotated with domain-domain interactions?
- How would you customize your local configuration for prioritization to select the subset of protein families which are 1) significantly associated with the phenotype or environmental parameter, 2) annotated to domain-domain interactions, and 3) predicted as signal peptides?
You can also create these databases locally using MetaWIBELE utility scripts based on the latest release version of UniProt/UniRef, and provide $UNIREF_LOCATION
as the location to install the database.
-
Download and obtain UniProt annotations:
$ metawibele_prepare_uniprot_annotation --output $UNIREF_LOCATION
-
Download UniRef sequences and obtain annotations:
$ metawibele_prepare_uniref_annotation -t uniref90 --output $UNIREF_LOCATION
-
Use
diamond
to index sequencesdiamond makedb --in uniref90.fasta -d uniref90.fasta
Protein domain databases are required if you will use MetaWIBELE to perform domain-based annotation. By default, these databases have already been installed when you install the MetaWIBELE package. Alternatively, you can also create these domain databases locally and provide $DOMAIN_LOCATION
as the location to install the database.
-
Download and obtain dependent protein domain information:
$ metawibele_prepare_domain_databases -t Pfam33.0 --output $DOMAIN_LOCATION
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MaAsLin 3.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- MTX model 3
- PARATHAA