PPANINI: Prioritization and Prediction of functional Annotations for Novel and Important genes via automated data Network Integration
PPANINI provides a computational pipeline to prioritize microbial genes based on their metagenomic properties (e.g., prevalence and abundance). The resulting prioritized list of gene candidates can then be analyzed further using our visualization tools.
-
For additional information and a quick demo, please see the PPANINI tutorial.
-
Please direct questions to the PPANINI support forum.
-
If you use the PPANINI software, please cite our manuscript: Gholamali Rahnavard, Afrah Shafquat, Himel Mallick, Jason Lloyd-Price, Kevin Bonham, Bahar Sayoldin, Eric A. Franzosa, Curtis Huttenhower, Identifying important uncharacterized genes using metagenomes and metatranscriptomes. huttenhower.sph.harvard.edu/ppanini
$ pip install ppanini
To test if PPANINI is running correctly, you may run the following command in the terminal:
#!cmd
ppanini_test
Which yields:
test_create_folders (basic_tests_utilities.TestUtilitiesBasicFunctions) ... ok
test_is_present (basic_tests_utilities.TestUtilitiesBasicFunctions) ... ok
test_is_protein (basic_tests_utilities.TestUtilitiesBasicFunctions) ... /n/huttenhower_lab/tools/ppanini/depends/lib/Bio/Seq.py:2309: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)
ok
test_pullgenes_fromcontigs (basic_tests_utilities.TestUtilitiesBasicFunctions) ... ok
test_read_fasta (basic_tests_utilities.TestUtilitiesBasicFunctions) ... ok
test_read_gff3 (basic_tests_utilities.TestUtilitiesBasicFunctions) ... ok
test_read_ppanini_imp_genes_table (basic_tests_utilities.TestUtilitiesBasicFunctions) ... ok
test_write_dict (basic_tests_utilities.TestUtilitiesBasicFunctions) ... ok
test_write_fasta (basic_tests_utilities.TestUtilitiesBasicFunctions) ... ok
test_annotate_genes (basic_tests_annotate_genes.TestAnnotateGenesBasicFunctions) ... ok
test_read_gene_table (basic_tests_ppanini.TestPPANINIBasicFunctions)
Tests the function read_gene_table ... Gene Table contains 2 metadata lines .
Gene Table contains 998 gene or centroid lines.
ok
test_preppanini (basic_tests_preppanini.TestPrePPANINIBasicFunctions) ... ok
test_quantify_genes (basic_tests_quantify_genes.TestQuanitfyGenesBasicFunctions) ... ok
----------------------------------------------------------------------
Ran 13 tests in 4.101s
OK
#!python
usage: ppanini [-h] -i INPUT_TABLE [-o OUTPUT_FOLDER] [--basename BASENAME]
[--uniref2go UNIREF2GO] [--log-level LOG_LEVEL]
[--tshld-abund TSHLD_ABUND] [--tshld-prev TSHLD_PREV]
[--beta BETA] [--version]
optional arguments:
-h, --help show this help message and exit
-i INPUT_TABLE, --input_table INPUT_TABLE
REQUIRED: Gene abundance table with metadata
-o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER
Folder containing results
--basename BASENAME BASENAME for all the output files
--uniref2go UNIREF2GO
uniref to GO term mapping file
--log-level LOG_LEVEL
Choices: [DEBUG, INFO, WARNING, ERROR, CRITICAL]
--tshld-abund TSHLD_ABUND
[X] Percentile Cutoff for Abundance; Default=75th
--tshld-prev TSHLD_PREV
Percentile cutoff for Prevalence
--beta BETA Beta parameter for weights on percentiles
--version prints the version
-i or --input-table:
Gene families abundance table containing annotated gene abundance values in RPK or relative abundance. Such tables can be obtained using ppanini_gene_caller. See the mock gene table as an example. ppanini/input/mock_gene_table.tsv- HUMAnN2 genefamilies output also can be used as input for PPANINI
#!text
#SAMPLES SAMPLE_1 SAMPLE_2 SAMPLE_3
UniRef90_XYZ 0.09 0.00 0.03
UniRef90_MNP 0.00 0.09 0.0
--output-folder
: folder containing all the output files--basename
: name prefix for all intermediate output files produced--log-level
: level of debugging information to be provided; Choices: [DEBUG, INFO, WARNING, ERROR, CRITICAL]--beta
: The parameter used in the methodology to prioritize genes via their metagenomic properties.--tshld_prev
: Percentile cut-off for prevalence--tshld_abund
: Percentile cut-off for abundance
Returns a list of "important" genes.
#!text
#GENEID PREVALENCE MEAN_ABUNDANCE
geneID_XYZ 0.25 0.05
For NICHE-SPECIFIC ANALYSIS:
#!text
#GENEID MEAN_ABUNDANCE ALPHA_PREVALENCE_NICHEX ALPHA_PREVALENCE_NICHEY BETA_PREVALENCE
geneID_XYZ 0.05 0.35 0.50 0.42
- Basic usage:
$ ppanini_barplot -i1 $PPANINI_GENE_INPUT.txt-i2 $PPANINI_OUTPUT_TABLE.txt
$PPANINI_GENE_INPUT.txt
= a gene family abundance table used by PPANINI (it also in the temp directory in the output$PPANINI_OUTPUT_TABLE.txt
= the PPANINI output table- Run with
-h
to see additional command line options
Produces a stacked bar plot for gene families characterization.
ppanini_barplot -i1 temp/ppanini_abundance_table.txt -i2 ppanini_table.txt
usage: ppanini_barplot [-h] [-i1 <input table>] [-i2 <input table>]
[--summary-table SUMMARY_TABLE] [-o <feature id>]
PPANINI plotting tool
optional arguments:
-h, --help show this help message and exit
-i1 <input table>, --ppanini-input <input table>
Gene abundance table
-i2 <input table>, --ppanini-output <input table>
PPANINI output table
--summary-table SUMMARY_TABLE
Summary table
-o <feature id>, --output <feature id>
output plot)
Gene families abundance can be converted to relative abundance or “copies per million” with humann2_renorm_table script, while the unit of abundance doesn't effect the PPANINI results as PPANINI uses the rank of abundances internally.
There are two main steps to get gene families abundant: 1) Generating gene abundance for each sample, 2) clustering genes, including known genes that they map to UniRef90 and an uncharacterized gene with 90% homology-based similarity.
To get the abundance of genes for a sample, short read file (FASTQ or FASTA) and contig file (FNA) is needed. Please use the same output name for all samples in this step to be used as input for Step 3. For example for all sample use -o ppanini_genecaller_output
and use sample name as basename for the files using --output-basename $SAMPLE_ID
where $SAMPLE_ID holds the name of sample or sample file.
#!cmd
usage: ppanini_gene_caller [-h] -i CONTIG -f FASTQ -o OUTPUT
[--output-basename <sample_name>] -u UNIREF [-r]
[--threads <1>] [--one-contig]
PPANINI gene caller
optional arguments:
-h, --help show this help message and exit
-i CONTIG, --contig CONTIG
contigs file (fna)
-f FASTQ, --fastq FASTQ
reads file (fastq)
-o OUTPUT, --output OUTPUT
Path for outputs
--output-basename <sample_name>
the basename for the output files
[DEFAULT: input file basename]
-u UNIREF, --uniref UNIREF
UniRe90 database
-r, --resume bypass commands if the output files exist
--threads <1> number of threads/processes
[DEFAULT: 1]
--one-contig If there is only one contig file for all samples, then this option should be provided
#!cmd
usage: ppanini_join_tables [-h] [-v] -i INPUT -o OUTPUT
[--file_name FILE_NAME] [-s]
[--mapping-uniref MAPPING_UNIREF]
[--mapping-cluster MAPPING_CLUSTER] [-r]
[--scale {rpk,count}]
Join gene, pathway, or taxonomy tables
optional arguments:
-h, --help show this help message and exit
-v, --verbose additional output is printed
-i INPUT, --input INPUT
the directory of tables
-o OUTPUT, --output OUTPUT
the table to write
--file_name FILE_NAME
only join tables with this string included in the file name
-s, --search-subdirectories
search sub-directories of input folder for files
--mapping-uniref MAPPING_UNIREF
Mapping file: gene to uniref90 file
--mapping-cluster MAPPING_CLUSTER
Mapping file: cluster to genes file
-r, --resume bypass commands if the output files exist
--scale {rpk,count} scale the abundance table
After running panini_gene_caller for all samples, the results can be merged into a table with samples as columns and genes as rows:
- make a directory (e.g.
tables
) and copy all gene abundance there. for example, you can copy all TXT file under the previouse step to a folder.mkdir tables
cp ppanini_gene_caller_output/*.txt tables
#!cmd
ppanini_join_tables -i tables/ -o gene_abundance.tsv
To generate gene families, panini_press puts genes that map to the same UniRef90 ID into the same gene family and for genes that don't map sufficiently to a UniRef90 ID, it clusters them with the same criteria that have been used for clustering and mapping into UniRef90 database. For the abundance of gene families, ppanini_press sums up the abundance of all genes belonging to that gene family.
Input: 1) a path to the Step 1's output ppanini_gene_caller
which includes TXT, GFF, and FAA files for all samples. 2) the path to UniRef90 database.
#!cmd
usage: ppanini_press [-h] -i GENE_PATH [-o OUTPUT] [-r] [--threads THREADS]
[--scale {rpk,count}] [--memory CD_HIT_MEMORY]
PPANINI Press: clusters genes to gene families including annotated genes to UniRef90 and homology-based clustered genes.
optional arguments:
-h, --help show this help message and exit
-i GENE_PATH, --gene-path GENE_PATH
a directory path to ppanini_gene_caller outputs which includes txt, gff, and faa files for each sample.
-o OUTPUT, --output OUTPUT
Path for outputs
-r, --resume bypass commands if the output files exist
--threads THREADS number of threads/processes
[DEFAULT: 1]
--scale {rpk,count} scale the abundance table
--memory CD_HIT_MEMORY
memory for -M option in CD-Hit
Input files description:
The input for ppanini_press
is the output directory from ppanini_gene_caller:
-
GENE_COUNTS: featureCounts output files for all samples from the ppanini_gene_caller step. For each sample, we have $SAMPLE_FILENAME.txt under ppanini_gene_caller output.
-
mapping file for sufficiently mapped genes to UniRef90: These files are under
hits
directory in the ppanini_gene_caller output. -
GENE_SEQUNCES: prodigal output files for genes that didn't map sufficiently to UniRef90 in ppanini_gene_caller. These files are under
no_hits
directory. For each sample, we have$SAMPLE_FILENAME_no_hits.faa
underno_hits
directory under ppanini_gene_caller output.
** Basic usage:**
$ ppanini_press -i ppanini_gene_caller_output -o ppanini_press_output --threads 4
** Gene families table as output:**
The final result of ppanini_press is a table for gene families, where rows are gene families (UniRefs or homology-based clusters), and columns are sample IDs. To see the table you can use:
$ less -S ppanini_press_output/gene_families_table.txt
Which yields (abbreviated):
# header SRS015051 SRS015269 SRS015290 SRS015430 SRS015450 SRS015584 SRS015640 SRS015752 SRS015793 SRS015937 SRS015996
Cluster 0 0 100.103199174 0 0 0 0 0 0 0 0 0
Cluster 10003 0 0 0 0 0 0 0 78.5440613027 0 0 0
Cluster 1001 0 0 0 0 0 0 0 0 0 429.018136335 0
Cluster 10013 0 0 0 0 0 0 0 112.847222222 0 0 0
Cluster 1002 0 0 0 0 0 0 0 0 0 160.725453408 0
.
.
.
UniRef90_Y9BN98 0 0 0 0 0 0 0 0 0 76.1904761905 0
UniRef90_Z3N8E7 0 0 0 0 0 0 0 0 0 50.7246376812 0
UniRef90_Z6R8L8 0 0 0 0 0 0 0 0 0 62.7177700348 0
UniRef90_Z7U7N3 0 0 0 0 0 0 0 34.1207349081 0 0 0
UniRef90_Z8BCZ8 0 0 0 0 0 0 0 0 0 0 80.8080808081
This file can be used as input for PPANINI. There are other important files that could be used for further down stream analyses:
** 1) map_cluster_gene.txt:** this file include a mapping file from artificial cluster names to its belonging genes that they have no sufficient hit to UniRef database.
** 2) map_uniref_gene.txt:**: this file includes a mapping file from UniRef ID to its belonging genes.
** 3) no_hits.faa:** This file includes gene sequences for genes with no sufficient hits in UniRef database.
** 4) hits.faa:** This file includes gene sequences for genes with sufficient hits in UniRef database.
** 5) genes.faa:** This file includes gene sequences for all genes.
** 6) no_hits_reads.clust90 no_hits_reads.clust90.clstr:** CD-Hit output for genes with no sufficient hits in UniRef database.
** 7) genes.uniref90hits:** Diamond output of mapping genes to UniRef90.
** 8) hits.txt:** mapping UniRef IDs to genes with sufficient hits.
** 9) no_hits:** genes with no sufficient hits to UniRef90 database.
A gene family abundance table can be obtained from HUMAnN2 output. HUMAnN2 is a reference based approach which means we get the abundance only for genes that have a hit or reference in the gene database, and there is no information for uncharacterized genes except a row that shows the overall abundance of all reads that they haven't had a hit in the gene reference database. To use HUMAnN2 outputs for all your samples, the following steps need to be performed:
-
Run HUMAnN2 each sample to get gene families output for all your samples. here is help for how to run HUMAnN2.
-
Merge gene families abundance table for all sample using humann2_join_tables.
-
Create unstratified gene family abundance table from gene family abundance table using humann2_split_stratified_table. Use the
genefamilies_unstratified.tsv
File form the output folder. Gene families from HUMAnN2 outputs are stratified by species, but for PPANINI we need use unstratified gene families.
Thanks go to these wonderful people: