-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathscalability_and_runtime.Rmd
244 lines (203 loc) · 9.08 KB
/
scalability_and_runtime.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
title: "Advanced: On scalability and runtime"
output: rmarkdown::html_vignette
vignette: >
%\VignetteEncoding{UTF-8}
%\VignetteIndexEntry{Advanced: On scalability and runtime}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r init, message=FALSE, fig.width = 10, fig.height = 6}
library(tidyverse)
library(dyngen)
```
<!-- github markdown built using
rmarkdown::render("vignettes/scalability_and_runtime.Rmd", output_format = rmarkdown::github_document())
-->
In this vignette, we will take a look at the runtime of dyngen as the number of genes and the number of cells sampled
is increased. We'll be using the bifurcating cycle backbone which is well known for its beautiful 3D butterfly shape!
```{r settings}
library(dyngen)
library(tidyverse)
set.seed(1)
save_dir <- "scalability_and_runtime_runs"
if (!dir.exists(save_dir)) dir.create(save_dir, recursive = TRUE)
backbone <- backbone_bifurcating_cycle()
```
## Initial run
We'll be running this simulation a few times, with different values for `num_cells` and `num_features` to assess the scalability of dyngen. An example of a resulting dyngen model is shown here.
```{r example, fig.width=20, fig.height=20}
num_cells <- 100
num_features <- 100
num_tfs <- nrow(backbone$module_info)
num_targets <- round((num_features - num_tfs) / 2)
num_hks <- num_features - num_targets - num_tfs
out <-
initialise_model(
backbone = backbone,
num_tfs = num_tfs,
num_targets = num_targets,
num_hks = num_hks,
num_cells = num_cells,
gold_standard_params = gold_standard_default(
census_interval = 1,
tau = 100/3600
),
simulation_params = simulation_default(
census_interval = 10,
ssa_algorithm = ssa_etl(tau = 300/3600),
experiment_params = simulation_type_wild_type(
num_simulations = num_cells / 10
)
),
verbose = FALSE
) %>%
generate_dataset(make_plots = TRUE)
out$plot
```
We tweaked some of the parameters by running this particular backbone once with `num_cells = 100` and `num_features = 100` and verifying that the new parameters still yield the desired outcome. The parameters we tweaked are:
* On average, 10 cells are sampled per simulation (e.g. `num_simulations = 100` and `num_cells = 1000`). You could increase this ratio to get a better cell count yield from a given set of simulations, but cells from the same simulation that are temporally close will have highly correlated expression profiles.
* Increased time steps `tau`. This will make the Gillespie algorithm slightly faster but might result in unexpected artifacts in the simulated data.
* `census_interval` increased from 4 to 10. This will cause dyngen to store an expression profile only every 10 time units. Since the total simulation time is xxx, each simulation will result in yyy data points. Note that on average only 10 data points are sampled per simulation.
For more information on parameter tuning, see the vignette 'Advanced: tuning the simulation parameters'.
## Timing experiments
The simulations are run once with a large `num_features` and `num_cells`, a few times with varying `num_cells` and then once more with varying `num_features`. Every run is repeated three times in order to get a bit more stable time measurements.
Since some of the simulations can take over 10 minutes, the timings results of the simulations are cached in the '`r save_dir`' folder.`
```{r experiments}
settings <- bind_rows(
tibble(num_cells = 10000, num_features = 10000, rep = 1), #, rep = seq_len(3)),
crossing(
num_cells = seq(1000, 10000, by = 1000),
num_features = 100,
rep = seq_len(3)
),
crossing(
num_cells = 100,
num_features = seq(1000, 10000, by = 1000),
rep = seq_len(3)
)
) %>%
mutate(filename = paste0(save_dir, "/cells", num_cells, "_feats", num_features, "_rep", rep, ".rds"))
timings <- pmap_dfr(settings, function(num_cells, num_features, rep, filename) {
if (!file.exists(filename)) {
set.seed(rep)
cat("Running num_cells: ", num_cells, ", num_features: ", num_features, ", rep: ", rep, "\n", sep = "")
num_tfs <- nrow(backbone$module_info)
num_targets <- round((num_features - num_tfs) / 2)
num_hks <- num_features - num_targets - num_tfs
out <-
initialise_model(
backbone = backbone,
num_tfs = num_tfs,
num_targets = num_targets,
num_hks = num_hks,
num_cells = num_cells,
gold_standard_params = gold_standard_default(
census_interval = 1,
tau = 100/3600
),
simulation_params = simulation_default(
census_interval = 10,
ssa_algorithm = ssa_etl(tau = 300/3600),
experiment_params = simulation_type_wild_type(
num_simulations = num_cells / 10
)
),
verbose = FALSE
) %>%
generate_dataset()
tim <-
get_timings(out$model) %>%
mutate(rep, num_cells, num_features)
write_rds(tim, filename, compress = "gz")
}
read_rds(filename)
})
timings_gr <-
timings %>%
group_by(group, task, num_cells, num_features) %>%
summarise(time_elapsed = mean(time_elapsed), .groups = "drop")
timings_sum <-
timings %>%
group_by(num_cells, num_features, rep) %>%
summarise(time_elapsed = sum(time_elapsed), .groups = "drop")
```
## Simulate a large dataset (10k × 10k)
Below is shown the timings of each of the steps in simulating a dyngen dataset containing 10'000 genes and 10'000 features.
The total simulation time required is `r timings_gr %>% filter(num_cells == 10000, num_features == 10000) %>% pull(time_elapsed) %>% sum() %>% round()` seconds,
most of which is spent performing the simulations itself.
```{r bblego, fig.width = 8, fig.height = 8}
timings0 <-
timings_gr %>%
filter(num_cells == 10000, num_features == 10000) %>%
mutate(name = forcats::fct_rev(forcats::fct_inorder(paste0(group, ": ", task))))
ggplot(timings0) +
geom_bar(aes(x = name, y = time_elapsed, fill = group), stat = "identity") +
scale_fill_brewer(palette = "Dark2") +
theme_classic() +
theme(legend.position = "none") +
coord_flip() +
labs(x = NULL, y = "Time (s)", fill = "dyngen stage")
```
## Increasing the number of cells
By increasing the number of cells from 1000 to 10'000 whilst keeping the number of features fixed, we can get an idea of
how the simulation time scales w.r.t. the number of cells.
```{r figure1, fig.width=10, fig.height=5}
timings1 <-
timings_gr %>%
filter(num_features == 100) %>%
group_by(num_cells, num_features, group) %>%
summarise(time_elapsed = sum(time_elapsed), .groups = "drop")
ggplot(timings1) +
geom_bar(aes(x = forcats::fct_inorder(as.character(num_cells)), y = time_elapsed, fill = forcats::fct_inorder(group)), stat = "identity") +
theme_classic() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Number of cells", y = "Average time (s)", fill = "dyngen step")
```
It seems the execution time scales linearly w.r.t. the number of cells. This makes sense, because as the number of cells are increased, so do we
increase the number of simulations made (which is not necessarily mandatory). Since the simulations are independent of each other and take up
the most time, the execution time will scale linearly.
```{r plot_timings_cell, fig.width=8, fig.height=6}
ggplot(timings_sum %>% filter(num_features == 100)) +
theme_bw() +
geom_point(aes(num_cells, time_elapsed)) +
scale_x_continuous(limits = c(0, 10000)) +
scale_y_continuous(limits = c(0, 300)) +
geom_abline(intercept = 22.097, slope = 0.0252) +
labs(x = "Number of cells", y = "Execution time (s)")
```
## Increasing the number of features
By increasing the number of features from 1000 to 10'000 whilst keeping the number of cells fixed, we can get an idea of
how the simulation time scales w.r.t. the number of features
```{r figure2, fig.width=10, fig.height=5}
timings2 <-
timings_gr %>%
filter(num_cells == 100) %>%
group_by(num_cells, num_features, group) %>%
summarise(time_elapsed = sum(time_elapsed), .groups = "drop")
ggplot(timings2) +
geom_bar(aes(x = forcats::fct_inorder(as.character(num_features)), y = time_elapsed, fill = forcats::fct_inorder(group)), stat = "identity") +
theme_classic() +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Number of features", y = "Average time (s)", fill = "dyngen step")
```
It seems the execution time also scales linearly w.r.t. the number of features.
As more genes are added to the underlying gene regulatory network, the density of the graph
doesn't change, so it makes sense that the execution time also scales linearly w.r.t. the number
of features.
```{r plot_timings_feats, fig.width=8, fig.height=6}
ggplot(timings_sum %>% filter(num_cells == 100)) +
theme_bw() +
geom_point(aes(num_features, time_elapsed)) +
scale_x_continuous(limits = c(0, 10000)) +
scale_y_continuous(limits = c(0, 850)) +
geom_abline(intercept = 0.5481, slope = 0.07988) +
labs(x = "Number of features", y = "Execution time (s)")
```
## Execution platform
These timings were measured using 30 (out of 32) threads using a AMD Ryzen 9 5950X clocked at 3.4GHz.
Session info:
```{r}
sessionInfo()
```