pipeComp
is a simple framework to facilitate the comparison of pipelines involving various steps and parameters. It was initially developed to benchmark single-cell RNA sequencing pipelines:
pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools
Pierre-Luc Germain, Anthony Sonrel & Mark D Robinson, Genome Biology 2020, doi: 10.1186/s13059-020-02136-7
However the framework can be applied to any other context (see the pipeComp_dea
vignette for an example). This readme provides an overview of the framework and package. For more detail, please refer to the two vignettes.
pipeComp
is especially suited to the benchmarking of pipelines that include many steps/parameters, enabling the exploration of combinations of parameters and of the robustness of methods to various changes in other parts of a pipeline. It is also particularly suited to benchmarks across multiple datasets. It is entirely based on R/Bioconductor, meaning that methods outside of R need to be called via R wrappers. pipeComp
handles multithreading in a way that minimizes re-computation and duplicated memory usage, and computes evaluation metrics on the fly to avoid saving many potentially large intermediate files, making it well-suited for benchmarks involving large datasets.
This readme gives a very brief overview of the package. For more detailed information on the framework, refer to the pipeComp vignette. For information specifically about the scRNAseq pipeline and evaluation metrics (as well as more complex examples usages of the plotting functions), see the pipeComp_scRNA vignette. For a completely different example, with walkthrough the creating of a new PipelineDefinition
, see the pipeComp_dea vignette.
-
In
pipeComp
0.99.43, there is now the possibility to continue runs despite errors (see theskipErrors
argument ofrunPipeline
, and the 'Handling errors' section of the pipeComp vignette.). -
In
pipeComp
0.99.26 on, the plotting functions for the scRNAseq clustering pipeline (scrna_evalPlot_DR
andscrna_evalPlot_clust
) have been replaced by more flexible, pipeline-generic functions (evalHeatmap
) and a silhouette-specific plotting function (scrna_evalPlot_silh
). The general heatmap coloring scheme has also been changed to make meaningful changes clearer. -
In
pipeComp
0.99.24, multithreading capacities have been extended (now virtually no limit). -
pipeComp
>=0.99.3 made important changes to the format of the output, and greatly simplified the evaluation outputs for the scRNA pipeline.As a result, results produced with older version of the package are not anymore compatible with the current version's aggregation and plotting functions.
Install using:
BiocManager::install("plger/pipeComp", build_vignettes=TRUE)
Due to Bioconductor standards, pipeComp
requires R>=4, but it is actually compatible with R>=3.6.1 (users who have not yet moved to R4 can use the R3.6 branch).
Because pipeComp
was meant as a general pipeline benchmarking framework, we have tried to restrict the package's dependencies to a minimum. To use the scRNA-seq pipeline and wrappers, however, requires further packages to be installed. To check whether these dependencies are met for a given pipelineDefinition
and set of alternatives, see ?checkPipelinePackages
.
As represented in the figure above, the PipelineDefinition
S4 class represents pipelines as, minimally, a set of functions (accepting any number of parameters) consecutively executed on the output of the previous one, and optionally accompanied by evaluation and aggregation functions. As simple pipeline can be constructed as follows:
my_pip <- PipelineDefinition( list( step1=function(x, param1){
# do something with x and param1
x
},
step2=function(x, method1, param2){
get(method1)(x, param2)
},
step3=function(x, param3){
x <- some_fancy_function(x, param3)
# the functions can also output evaluation
# through the `intermediate_return` slot:
e <- my_evaluation_function(x)
list( x=x, intermediate_return=e)
}
))
The PipelineDefinition can also include descriptions of each step or evaluation and aggregation functions. For example:
my_pip <- PipelineDefinition( list( step1=function(x, meth1){ get(meth1)(x) },
step2=function(x, meth2){ get(meth2)(x) } ),
evaluation=list( step2=function(x){ sum(x) }) )
See the ?PipelineDefinition
for more information, or scrna_pipeline
for a more complex example:
pipDef <- scrna_pipeline()
pipDef
A PipelineDefinition object with the following steps:
- doublet(x, doubletmethod) *
Takes a SCE object with the `phenoid` colData column, passes it through the
function `doubletmethod`, and outputs a filtered SCE.
- filtering(x, filt) *
Takes a SCE object, passes it through the function `filt`, and outputs a
filtered Seurat object.
- normalization(x, norm)
Passes the object through function `norm` to return the object with the
normalized and scale data slots filled.
- selection(x, sel, selnb)
Returns a seurat object with the VariableFeatures filled with `selnb` features
using the function `sel`.
- dimreduction(x, dr, maxdim) *
Returns a seurat object with the PCA reduction with up to `maxdim` components
using the `dr` function.
- clustering(x, clustmethod, dims, k, steps, resolution, min.size) *
Uses function `clustmethod` to return a named vector of cell clusters.
A number of generic methods are implemented on the object, including show
, names
, length
, [
, as.list
. This means that, for instance, a step can be removed from a pipeline in the following way:
pd2 <- pipDef[-1]
Steps can also be added (using the addPipelineStep
function) and edited - see the pipeComp
vignette for more detail:
vignette("pipeComp", package="pipeComp")
runPipeline
requires 3 main arguments: i) the pipelineDefinition, ii) the list of alternative parameters values to try, and iii) the list of benchmark datasets.
The scRNAseq datasets used in the papers can be downloaded from figshare and prepared in the following way:
download.file("https://ndownloader.figshare.com/articles/11787210/versions/1", "datasets.zip")
unzip("datasets.zip", exdir="datasets")
datasets <- list.files("datasets", pattern="SCE\\.rds", full.names=TRUE)
names(datasets) <- sapply(strsplit(basename(datasets),"\\."),FUN=function(x) x[1])
Next we prepare the alternative methods and parameters. Functions can be passed as arguments through their name (if they are loaded in the environment):
# load alternative functions
source(system.file("extdata", "scrna_alternatives.R", package="pipeComp"))
# we build the list of alternatives
alternatives <- list(
doubletmethod=c("none"),
filt=c("filt.lenient", "filt.stringent"),
norm=c("norm.seurat", "norm.sctransform", "norm.scran"),
sel=c("sel.vst"),
selnb=2000,
dr=c("seurat.pca"),
clustmethod=c("clust.seurat"),
dims=c(10, 15, 20, 30),
resolution=c(0.01, 0.1, 0.2, 0.3, 0.5, 0.8, 1, 1.2, 2)
)
res <- runPipeline( datasets, alternatives, pipDef, nthreads=3,
output.prefix="myfolder/" )
Data can be explored manually or plotted using generic or pipeline-specific functions. For example:
scrna_evalPlot_silh( res )
evalHeatmap( res, step="dimreduction", what2="meanAbsCorr.covariate2",
what=c("log10_total_features","log10_total_counts") )
The functions enable the choice of parameters at whose values to aggregate, as well as custom filtering:
evalHeatmap(res, step = "clustering", what=c("MI","ARI"), agg.by=c("filt","norm")) +
evalHeatmap(res, step = "clustering", what="ARI", agg.by=c("filt", "norm"),
filter=n_clus==true.nbClusts, title="ARI at\ntrue k")
See the vignettes and the function's help for more details.