Dataset processing and characterisation
The datasets are split in real datasets and synthetic datasets. The real datasets are downloaded and preprocessed first, and characteristics from these datasets (such as the number of cells and genes, library sizes, dropout probabilities, …) are used to generate synthetic datasets. The datasets are then characterised, after which they are uploaded to Zenodo.
||Downloading the processed datasets from Zenodo (10.5281/zenodo.1443566)|
||Download the datasets from the cluster|
||Upload the datasets to Zenodo (10.5281/zenodo.1211532)|
The results of this experiment are available here.
The generation of the real datasets is divided in two parts. We first download all the (annotated) expression files from sites such as GEO. Next, we filter and normalise all datasets, and wrap them into the common trajectory format of dynwrap.
||Downloading the real datasets from their sources (eg. GEO), and constructing the gold standard model, using the helpers in helpers-download_from_sources|
||Filtering and normalising the real datasets using
||Gathers some metadata about all the real datasets|
||Creates a table of the datasets in, excuse me, excel (for supplementary material)|
Each synthetic dataset is based on some characteristics of some real datasets. These characteristics include:
- The number of cells and features
- The number of features which are differentially expressed in the trajectory
- Estimates of the distribution of the library sizes, average expression, dropout probabilities, … estimated by Splatter.
Here we estimate the parameters of these “platforms” and use them to simulate datasets using different simulators. Each simulation script first creates a design dataframe, which links particular platforms, different topologies, seeds and other parameters specific for a simulator.
The data is then simulated using wrappers around the simulators (see /package/R/simulators.R), so that they all return datasets in a format consistent with dynwrap.
||Estimation of the platforms from real data done by
||dyngen, simulations of regulatory networks which will produce a particular trajectory|
||PROSSTT, expression is sampled from a linear model which depends on pseudotime|
||Splatter, simulations of non-linear paths between different states|
||dyntoy, simulations of toy data using random expression gradients in a reduced space|
||Gathers some metadata about all the synthetic datasets|
Characterisation of the datasets regarding the different topologies present.
||An overview of all the topologies present in the datasets|