Pathogenicity prediction of bacterial genomes
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Pathogenicity prediction for bacterial genomes is a random forest based method for the assessment of the pathogenic potential of a set of reads belonging to a single genome. Its strength lies in the prediction of novel, unknown bacterial pathogens.

The related paper is available at

*** News


# install.packages("devtools")

(Installation might require some time due to the package's dependencies)

Due to changes of the dependencies, some problems occurred when installing paprbag and using the original data. In particular, different ranger versions are not compatible. Since the original data were trained with ranger 0.3 they can only be used when the same ranger version is installed on your system.

The original version of paprbag is still available under the legacy branch.

The data made available in this branch, as well as the release data, should work with recent ranger versions.

For full reproducibility, the packages used for building the data sets are provided as a packrat bundle. The bundle can be downloaded from the release

They can be installed via

# Download the bundle
path2bundle <- "/path/to/bundle/" 
packrat::unbundle(bundle=path2bundle, where="/path/to/my/project")

Alternatively you can unpack the bundle, change into the new directory and start an R session. This will trigger the installation of the bundled packages. See here for more details about packrat.

The packages versions are also listed here. They were installed under R version 3.4.4.

Label data

The resource of the inferred pathogenicity labels can be found in labels

The label data together with the strain's Bioproject accession can be accessed by data("Labels")

Original data

The large classifiers used in the publication are provided in the R package data4paprbag. They were trained with ranger version 0.3. Current data is available in the (release tab)[link]

Basic usage

Make prediction from read set

library(ranger) # load ranger package explicitly

# Or read in sequence data via Biostrings::readDNAStringSet()

Prediction <- Predict.ReadSet (ForestObject = forest, ReadObject = ReadData, Feature.Configuration = NULL, verbose = T)

Quick analysis

# Histogram of pathogenic potential

# Mean pathogenic potential

Operate prediction via file paths

Predict.ReadSet.fromFiles (Path2Forest = Path2Forest, Path2ReadFiles = Path2ReadFiles, saveLocation = saveLocation, OutputFilename = OutputFilename, Return.Predictions = F, Save.AsTSV = F, verbose = T, num.threads = 1)

Predicting real data

In this section, we describe how paprbag can be applied to real data based on the pre-trained forests located in the release.

Available data

The following forests are available in the release tab:

Furthermore, the forests related to the 5-fold cross validation in (paprbag)[link] can be found here:


# download forest from release
# e.g. using wget in linux:
# wget
# or using your browser

# define your path to the previously saved forest
Path2Forest <- path/to/forest

# define independent prediction data
Path2ReadFiles <- path/to/independent_test_data_in_fasta_format

# define output
OutputFilename <- "prediction_paprbag"
saveLocation <- "Predictions/example"

Predictions_paprbag <- Predict.ReadSet.fromFiles (Path2Forest = Path2Forest, Path2ReadFiles = Path2ReadFiles, saveLocation = saveLocation, OutputFilename = OutputFilename, Return.Predictions = T, Save.AsTSV = F, num.threads = 20, verbose = T)

alternative: loading forest and read data into R

  • Use case: Avoids re-loading of large forests when predicting many fasta files

# download forest from release

# load forest
forest_large <- readRDS(file.path(forest_nuc))

# read file
Readfile <- Readfiles[1]
ReadData_real <- Biostrings::readDNAStringSet(Readfile)

# predict
Prediction_forest_nuc_realdata <- Predict.ReadSet (ForestObject = forest_large <- , ReadObject = ReadData_real, Feature.Configuration = NULL, verbose = T, num.threads = 20)
  • note: the prediction function can be called with a number of threads via the num.threads option, however it still predicts read per read in a linear fashion. If a number of cores is available, the prediction process can be sped up by diving the reads in read chunks and joining the prediction results.

Quick evaluation


ifelse(mean(Prediction_forest_nuc_realdata$predictions[,2])> 0.5, "Pathogenic", "Non-Pathogenic")

Feature configuration

You can pass a value of Feature.Configuration in function Predict.ReadSet. Choose between:

# load standard configuration; used in toy forest and classifier_all.rds, classifier_fold1.rds, etc.
# load standard configuration including peptide features; used in forest classifier_all_includingPeptideFeatures.rds
# infer feature configuration directly from forest
my.configuration <- paprbag:::IdentifyFeatures.FromForest(ForestObject = forest_new)

And set e.g. Feature.Configuration = Standard.configuration

Advanced usage

Feature extraction

Extract features from a set of reads

# Or read in sequence data via Biostrings::readDNAStringSet()

Features <- CreateFeaturesFromReads(Reads=ReadData, NT_kmax = 4, Do.NTMotifs = T)

Create features for all fasta files in a given directory

  data("Standard.configuration") # load standard configuration
  Standard.configuration$Path2ReadFiles <- Path2ReadFiles # add path to Read files
  Standard.configuration$savePath <- "Features" # add save path,Standard.configuration)

  # or
  UpdateFeatures (Path2ReadFiles = Path2ReadFiles,savePath = file.path(Path2ReadFiles,"Features") ,Cores = 5, verbose = T)

Create training set

Join all feature files in a directory and create read labels for every data row by matching to bioproject accessions.

 #Labels: A vector containing either 'TRUE' or 'FALSE' together with the name attributes pointing to the Organism identifier (Bioproject ID)
 Create.TrainingDataSet (Path2Files = "Features", pattern="Features",OSlabels = Labels, savePath = "TrainingData")

  # patterns: Regex-Pattern that identifies the feature files


Train a pathogenicity classifier from a set of feature data together with read label information

# Please specify paths:
# Path2FeatureFile:  path to combined label and feature file (in rds format)
# savePath: path where trained classifier is saved

# example path
Tempdir <- tempdir()
Path2FeatureFile <- file.path(Tempdir,"trainingData.rds")
savePath <- file.path(Tempdir,"classifier")

# Run with path information
Run.Training (Path2FeatureFile = Path2FeatureFile, savePath = savePath)

More options

# Specify
# Path2FeatureFile: Path to feature data only
# Path2LabelFile: Path to label file with labels for every data row in 'Path2FeatureFile'

forest <- Run.Training (Path2FeatureFile = Path2FeatureFile, Path2LabelFile = Path2LabelFile, SelectedFeatures = NULL, savePath = NULL, ReturnForest = T, verbose = T, min.node.size = 1, num.trees = 100)

Training a classifier with other types of label data

As mentioned in the original publication, PaPrBaG is not restricted for classification of pathogens only.

It is rather a general workflow for the classification of labeled genomes, potential further applications range from bacterial host and habitat prediction, taxonomic classification to human and microbial read separation

Basically the same workflows as described here can be applied and also the features need not be re-calculated. It suffices to

  • either provide a new path to a label file Path2NewLabelFile with labels for every data row in Path2FeatureFile
Run.Training (Path2FeatureFile = Path2FeatureFile, Path2LabelFile = Path2NewLabelFile, savePath = NewsavePath)
  • or manipulate the column "Labels" in the data.frame "Path2FeatureFile". E.g.
trainingData$Labels <- as.factor(c(rep("NewLabel_A",floor(nrow(trainingData)/2)),rep("NewLabel_B",ceiling(floor(nrow(trainingData)/2)) )) )
saveRDS(trainingData, NewPath2FeatureFile) # save 
Run.Training (Path2FeatureFile = NewPath2FeatureFile, savePath = NewsavePath)

In case that read labels are not yet available, a new training set based on the new labels can be obtained via the command

Create.TrainingDataSet (Path2Files = "Features", pattern="Features",OSlabels = NewLabels, savePath = "TrainingData")

A proper vector NewLabels containing two (or more) labels (e.g. 'TRUE' or 'FALSE') together with the name attributes pointing to the Organism identifier (Bioproject ID) is required. It outputs a file "ReadLabel_OS.rds" in the savePath which can be used for running the function Run.Training subsequently.

Predictions obtained from the newly trained classifier provide classification according to the new label.