Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Dataset processing and characterisation

The datasets are split in real datasets and synthetic datasets. The real datasets are downloaded and preprocessed first, and characteristics from these datasets (such as the number of cells and genes, library sizes, dropout probabilities, …) are used to generate synthetic datasets. The datasets are then characterised, after which they are uploaded to Zenodo.

# script/folder description
0 πŸ“„download_from_zenodo.R Downloading the processed datasets from Zenodo (10.5281/zenodo.1443566)
1 πŸ“real Real datasets
2 πŸ“synthetic Synthetic datasets
3 πŸ“„download_from_prism.R Download the datasets from the cluster
4 πŸ“dataset_characterisation Dataset characterisation
5 πŸ“„upload_to_zenodo.R Upload the datasets to Zenodo (10.5281/zenodo.1211532)

The results of this experiment are available here.

Real datasets

The generation of the real datasets is divided in two parts. We first download all the (annotated) expression files from sites such as GEO. Next, we filter and normalise all datasets, and wrap them into the common trajectory format of dynwrap.

# script/folder description
1 πŸ“„download_from_sources.R Downloading the real datasets from their sources (eg. GEO), and constructing the gold standard model, using the helpers in helpers-download_from_sources
2 πŸ“„filter_and_normalise.R Filtering and normalising the real datasets using dynbenchmark::process_raw_dataset All datasets are then saved into the dynwrap format.
3 πŸ“„gather_metadata.R Gathers some metadata about all the real datasets
4 πŸ“„datasets_table.R Creates a table of the datasets in, excuse me, excel (for supplementary material)

Synthetic datasets

Each synthetic dataset is based on some characteristics of some real datasets. These characteristics include:

  • The number of cells and features
  • The number of features which are differentially expressed in the trajectory
  • Estimates of the distribution of the library sizes, average expression, dropout probabilities, … estimated by Splatter.

Here we estimate the parameters of these β€œplatforms” and use them to simulate datasets using different simulators. Each simulation script first creates a design dataframe, which links particular platforms, different topologies, seeds and other parameters specific for a simulator.

The data is then simulated using wrappers around the simulators (see /package/R/simulators.R), so that they all return datasets in a format consistent with dynwrap.

# script/folder description
1 πŸ“„estimate_platform.R Estimation of the platforms from real data done by dynbenchmark::estimate_platform
2a πŸ“„simulate_dyngen_datasets.R dyngen, simulations of regulatory networks which will produce a particular trajectory
2b πŸ“„simulate_prosstt_datasets.R PROSSTT, expression is sampled from a linear model which depends on pseudotime
2c πŸ“„simulate_splatter_datasets.R Splatter, simulations of non-linear paths between different states
2d πŸ“„simulate_dyntoy_datasets.R dyntoy, simulations of toy data using random expression gradients in a reduced space
3 πŸ“„gather_metadata.R Gathers some metadata about all the synthetic datasets
4 πŸ“„dyngen_samplers_table.R


Dataset characterisation

Characterisation of the datasets regarding the different topologies present.

# script/folder description
1 πŸ“„topology.R An overview of all the topologies present in the datasets
You can’t perform that action at this time.