Skip to content
Jeff Bowman edited this page Mar 21, 2022 · 7 revisions

paprica - PAthway PRediction by phylogenetIC plAcement (formerly Genome Finder)

Please contact jsbowman@ucsd.edu with any questions.

Recent changes

  • epa-ng and gappa are now used in place of pplacer and guppy.
  • paprica now uses a two-step approach for phylogenetic placement for bacteria and eukaryotes (but not for archaea). Reads are placed first to a master tree comprised of representatives for each phylum or division. Reads are then placed to each subtree, and the metabolic inference is determined from the placement of reads on the subtree.
  • The bacterial and archaeal phylogenetic trees are now based on concatenated alignments of the 16S rRNA and 23S rRNA genes.

Caveat and Request

Paprica started as the analysis workflow for a paper I was writing. As it grew in complexity it became clear that it should be spun off as an independent pipeline. Version 0.11 reflects this origin. Version 0.20 was my first attempt at making paprica user friendly. Versions 0.3 and 0.4 are continued evolutions adding additional features. I’m really excited to keep improving the method and scripts. If you try to use paprica and run into difficulty please don’t give up! Create an issue on Github and I’ll do my best to fix it. Similarly, if you have a suggestion to improve paprica either create an issue or shoot me an email.

Please cite paprica as:

Please follow the citation guidelines here.

What should you do if something goes wrong?

Bioinformatics involves a lot of troubleshooting. This can be frustrating, but is in the end a good thing because it teaches us how to do stuff. To get paprica to work you’ll have to install a few core bioinformatics programs. If you’re a Linux whiz this is trivial. If you’re not it can be a little intimidating. If something goes wrong with the installation of one of the core programs (e.g. pplacer, Infernal, Seqmagick, pathway-tools, RAxML) please refer to the documentation for those programs and refer any questions to those development teams. If something seems wrong with paprica itself (i.e. you’ve verified that Infernal is installed and running, but paprica throws an error trying to use it) open an issue in Github. I’ll get on it as quick as possible.

If you’re having trouble with basic installation tasks that result from idiosyncrasies of your system, or lack of familiarity with Linux/Unix, I’m happy to help. I do ask however, that you take an active role in solving these problems. Very often the information you need is already out there on various forums.

The basics

Paprica is a pipeline to determine community structure and conduct a metabolic inference on a collection of 16S rRNA gene sequences. Given a set of 16S rRNA gene sequences the pipeline returns a collection of metabolic pathways and enzymes with EC numbers that are expected for the observed taxa. It also returns some useful expected genome parameters, including genomic plasticity, size, number of coding sequences, 16S rRNA gene copy number, and number of genetic elements (see the Output Files section for a complete list). For bacteria and archaea, the abundance of each genome, EC number, and metabolic pathway will be corrected for 16S rRNA gene copy number. In this way paprica gives direct access to both community structure and metabolic structure data for your collection of 16S rRNA gene libraries. Paprica was designed with the analysis of large 16S rRNA gene sequence libraries in mind, such as those generated by 454 or Illumina sequencing, but is also appropriate for small datasets.

There are two options for running paprica. If you would like to generate the database from scratch use the guidelines in the paprica-build.sh script and follow this tutorial. Building the database from scratch isn’t recommended unless you have a specific reason for wanting to do so. If you would like to skip the (time and space intensive) database build, you can ignore this step and simply use paprica-run.sh. Reasons you might want to build the database from scratch include:

  • You can update it from Genbank as often as you like.
  • You have access to your own collection of PGDBs which you can explore with the pathway-tools software.
  • You can add custom draft genomes to it that are not included in the pool of completed Genbank genomes.
  • You can assign EC numbers to genomes if they are missing.