Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Datasets containing trajectories

Real datasets

Figure 1: Main characteristics of the real datasets

Figure 2: Dimensions of the real datasets

Synthetic datasets

To generate synthetic datasets, we used four different synthetic data simulators:

  • dyngen: Simulations of gene regulatory networks, available at
  • dyntoy: Random gradients of expression in the reduced space, available at
  • PROSSTT: Expression is sampled from a linear model which depends on pseudotime1
  • Splatter: Simulations of non-linear paths between different expression states2

For every simulator, we took great care to make the datasets as realistic as possible. To do this, we extracted several parameters from all real datasets. We calculated the number of differentially expressed features within a trajectory using a two-way Mann–Whitney U test between every pair of cell groups. These values were corrected for multiple testing using the Benjamini-Hochberg procedure (FDR < 0.05) and we required that a gene was expressed in at least 5% of cells, and had at least a fold-change of 2. We also calculated several other parameters, such as drop-out rates and library sizes using the Splatter package2. These parameters were then given to the simulators when applicable, as described for each simulator below. Not every real dataset was selected to serve as a reference for a synthetic dataset. Instead, we chose a set of ten distinct reference real datasets by clustering all the parameters of each real dataset, and used the reference real datasets at the cluster centers from a pam clustering (with , implemented in the R cluster package) to generate synthetic data.


The dyngen ( workflow to generate synthetic data is based on the well established workflow used in the evaluation of network inference methods3,4 and consists of four main steps: network generation, simulation, gold standard extraction and simulation of the scRNA-seq experiment. At every step, we tried to mirror real regulatory networks, while keeping the model simple and easily extendable. We simulated a total of 110 datasets, with 11 different topologies.

Network generation

One of the main processes involved in cellular dynamic processes is gene regulation, where regulatory cascades and feedback loops lead to progressive changes in expression and decision making. The exact way a cell chooses a certain path during its differentiation is still an active research field, although certain models have already emerged and been tested in vivo. One driver of bifurcation seems to be mutual antagonism, where genes5 strongly repress each other, forcing one of the two to become inactive6. Such mutual antagonism can be modelled and simulated7,8. Although such a two-gene model is simple and elegant, the reality is frequently more complex, with multiple genes (grouped into modules) repressing each other9.

To simulate certain trajectory topologies, we therefore designed module networks in which the cells follow a particular trajectory topology given certain parameters. Two module networks generated linear trajectories (linear and linear long), one generated a bifurcation, one generated a convergence, one generated a multifurcation (trifurcating), two generated a tree (consecutive bifurcating and binary tree), one generated an acyclic graph (bifurcating and converging), one generated a complex fork (trifurcating), one generated a rooted tree (consecutive bifurcating) and two generated simple graph structures (bifurcating loop and bifurcating cycle). The structure of these module networks is available at

From these module networks we generated gene regulatory networks in two steps: the main regulatory network was first generated, and extra target genes from real regulatory networks were added. For each dataset, we used the same number of genes as were differentially expressed in the real datasets. 5% of the genes were assigned to be part of the main regulatory network, and were randomly distributed among all modules (with at least one gene per module). We sampled edges between these individual genes (according to the module network) using a uniform distribution between 1 and the number of possible targets in each module. To add additional target genes to the network, we assigned every regulator from the network to a real regulator in a real network (from regulatory circuits10), and extracted for every regulator a local network around it using personalized pagerank (with damping factor set to 0.1), as implemented in the page_rank function of the igraph package.

Simulation of gene regulatory systems using thermodynamic models

To simulate the gene regulatory network, we used a system of differential equations similar to those used in evaluations of gene regulatory network inference methods4. In this model, the changes in gene expression () and protein expression () are modeled using ordinary differential equations3 (ODEs):

where , , and represent production and degradation rates, the ratio of which determines the maximal gene and protein expression. The two types of equations are coupled because the production of protein depends on the amount of gene expression , which in turn depends on the amount of other proteins through the activation function .

The activation function is inspired by a thermodynamic model of gene regulation, in which the promoter of a gene can be bound or unbound by a set of transcription factors, each representing a certain state of the promoter. Each state is linked with a relative activation , a number between 0 and 1 representing the activity of the promoter at this particular state. The production rate of the gene is calculated by combining the probabilities of the promoter being in each state with the relative activation:

The probability of being in a state is based on the thermodynamics of transcription factor binding. When only one transcription factor is bound in a state:

where the hill coefficient represents the cooperativity of binding and the transcription factor concentration at half-maximal binding. When multiple regulators are bound:

where represents the cooperativity of binding between the different transcription factors.

is only proportional to because is normalized such that .

To each differential equation, we added an additional stochastic term:

with .

Similar to GeneNetWeaver3, we sample the different parameters from random distributions, defined as follows. defines whether a transcription factor activates (1) or represses (-1), as defined within the regulatory network network.

We converted each ODE to an SDE by adding a chemical Langevin equation, as described in3. These SDEs were simulated using the Euler–Maruyama approximation, with time-step and noise strength . The total simulation time varied between 5 for linear and bifurcating datasets, 10 for consecutive bifurcating, trifurcating and converging datasets, 15 for bifurcating converging datasets and 30 for linear long, cycle and bifurcating loop datasets. The burn-in period was for each simulation 2. Each network was simulated 32 times.

Simulation of the single-cell RNA-seq experiment

For each dataset we sampled the same number of cells as were present in the reference real dataset, limited to the simulation steps after burn-in. These cells were sampled uniformly across the different steps of the 32 simulations. Next, we used the Splatter package2 to estimate the different characteristics of a real dataset, such as the distributions of average gene expression, library sizes and dropout probabilities. We used Splatter to simulate the expression levels of housekeeping genes (to match the number of genes in the reference dataset) in every cell . These were combined with the expression levels of the genes simulated within a trajectory. Next, true counts were simulated using . Finally, we simulated dropouts by setting true counts to zero by sampling from a Bernoulli distribution using a dropout probability . Both (the midpoint for the dropout logistic function) and (the shape of the dropout logistic function) were estimated by Splatter.

This count matrix was then filtered and normalised using the pipeline described below.

Gold standard extraction

Because each cellular simulation follows the trajectory at its own speed, knowing the exact position of a cell within the trajectory topology is not straightforward. Furthermore, the speed at which simulated cells make a decision between two or more alternative paths is highly variable. We therefore first constructed a backbone expression profile for each branch within the trajectory. To do this, we first defined in which order the expression of the modules is expected to change, and then generated a backbone expression profile in which the expression of these modules increases and decreases smoothly between 0 and 1. We also smoothed the expression in each simulation using a rolling mean with a window of 50 time steps, and then calculated the average module expression along the simulation. We used dynamic time warping, implemented in the dtw R package11,12, with an open end to align a simulation to all possible module progressions, and then picked the alignment which minimised the normalised distance between the simulation and the backbone. In case of cyclical trajectory topologies, the number of possible milestones a backbone could progress through was limited to 20.


For more simplistic data generation (“toy” datasets), we created the dyntoy workflow ( . We created 12 topology generators (described below), and with 10 datasets per generator, this lead to a total of 120 datasets.

We created a set of topology generators, were denotes a binomial distribution, and denotes a uniform distribution:

  • Linear and cyclic, with number of milestones
  • Bifurcating and converging, with four milestones
  • Binary tree, with number of branching points
  • Tree, with number of branching points and maximal degree

For more complex topologies we first calculated a random number of “modifications” and a . For each type of topology, we defined what kind of modifications are possible: divergences, loops, convergences and divergence-convergence. We then iteratively constructed the topology by uniformly sampling from the set of possible modifications, and adding this modification to the existing topology. For a divergence, we connected an existing milestone to a number of a new milestones. Conversely, for a convergence we connected a number of new nodes to an existing node. For a loop, we connected two existing milestones with a number of milestones in between. Finally for a divergence-convergence we connected an existing milestone to several new milestones which again converged on a new milestone. The number of nodes was sampled from

  • Looping, allowed loop modifications
  • Diverging-converging, allowed divergence and converging modifications
  • Diverging with loops, allowed divergence and loop modifications
  • Multiple looping, allowed looping modifications
  • Connected, allowed looping, divergence and convergence modifications
  • Disconnected, number of components sampled from , for each component we randomly chose a topology from the ones listed above

After generating the topology, we sampled the length of each edge . We added regions of delayed commitment to a divergence in a random half of the cases. We then placed the number of cells (same number as from the reference real dataset), on this topology uniformly, based on the length of the edges in the milestone network.

For each gene (same number as from the reference real dataset), we calculated the Kamada-Kawai layout in 2 dimensions, with edge weight equal to the length of the edge. For this gene, we then extracted for each cell a density value using a bivariate normal distribution with and . We used this density as input for a zero-inflated negative binomial distribution with , and from the parameters of the reference real dataset, to get the final count values.

This count matrix was then filtered and normalised using the pipeline described below.


PROSSTT is a recent data simulator1, which simulates expression using linear mixtures of expression programs and random walks through the trajectory. We used 5 topology generators from dyntoy (linear, bifurcating, multifurcating, binary tree and tree), and simulated for each topology generator 10 datasets using different reference real datasets. However, due to frequent crashes of the tool, only 19 datasets created output and were thus used in the evaluation.

Using the simulate_lineage function, we simulated the lineage expression, with parameters , and . These parameter distributions were chosen very broad so as to make sure both easy and difficult datasets are simulated. After simulating base gene expression with simulate_base_gene_exp, we used the sample_density function to finally simulate expression values of a number of cells (the same as from the reference real dataset), with ( and ) and ( and ). Each of these parameters were centered around the default values of PROSSTT, but with enough variability to ensure a varied set of datasets.

This count matrix was then filtered and normalised using the pipeline described below.


Splatter2 simulates expression values by constructing non-linear paths between different states, each having a distinct expression profile. We used 5 topology generators from dyntoy (linear, bifurcating, multifurcating, binary tree and tree), and simulated for each topology generator 10 datasets using different reference real datasets, leading to a total of 50 datasets.

We used the splatSimulatePaths function from Splatter to simulate datasets, with number of cells and genes equal to those in the reference real dataset, and with parameters , and all sampled from .


1. Papadopoulos, N., Parra, R. G. & Soeding, J. PROSSTT: Probabilistic simulation of single-cell RNA-seq data for complex differentiation processes. bioRxiv 256941 (2018). doi:[10.1101/256941](

2. Zappia, L., Phipson, B. & Oshlack, A. Splatter: Simulation of single-cell RNA sequencing data. Genome Biology 18, 174 (2017).

3. Schaffter, T., Marbach, D. & Floreano, D. GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics (Oxford, England) 27, 2263–2270 (2011).

4. Marbach, D. et al. Wisdom of crowds for robust gene network inference. Nature methods 9, 796–804 (2012).

5. Xu, H. et al. Regulation of bifurcating B cell trajectories by mutual antagonism between transcription factors IRF4 and IRF8. Nature Immunology 16, 1274–1281 (2015).

6. Graf, T. & Enver, T. Forcing cells to change lineages. Nature 462, 587 (2009).

7. Wang, J., Zhang, K., Xu, L. & Wang, E. Quantifying the Waddington landscape and biological paths for development and differentiation. Proceedings of the National Academy of Sciences 108, 8257–8262 (2011).

8. Ferrell, J. E. Bistability, Bifurcations, and Waddington’s Epigenetic Landscape. Current Biology 22, R458–R466 (2012).

9. Yosef, N. et al. Dynamic regulatory network controlling Th17 cell differentiation. Nature 496, 461–468 (2013).

10. Marbach, D. et al. Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases. Nature Methods 13, 366 (2016).

11. Giorgino, T. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package. Journal of Statistical Software 31, (2009).

12. Tormene, P., Giorgino, T., Quaglini, S. & Stefanelli, M. Matching incomplete time series with dynamic time warping: An algorithm and an application to post-stroke rehabilitation. Artificial Intelligence in Medicine 45, 11–34 (2009).

Dataset characterisation


Figure 1: The 1 distinct Cycle topology, with a total of 24 datasets.

Figure 2: The 1 distinct Linear topology, with a total of 75 datasets.

Figure 3: The 2 distinct Convergence topologies, with a total of 17 datasets.

Figure 4: The 3 distinct Bifurcation topologies, with a total of 40 datasets.

Figure 5: The 6 distinct Multifurcation topologies, with a total of 15 datasets.

Figure 6: The 39 distinct Tree topologies, with a total of 86 datasets.

Figure 7: The 13 distinct Acyclic graph topologies, with a total of 17 datasets.

Figure 8: The 19 distinct Connected graph topologies, with a total of 35 datasets.

Figure 9: The 21 distinct Disconnected graph topologies, with a total of 24 datasets.

You can’t perform that action at this time.