Merge pull request #121 from bodkan/spatial-nonspatial-mixed-models

Allow for the possibility to have non-spatial populations in an otherwise spatial model
bodkan · Feb 2, 2023 · 57aca52 · 57aca52
2 parents cd505a4 + 3724ca7
commit 57aca52
Show file tree

Hide file tree

Showing 46 changed files with 641 additions and 306 deletions.
diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml
@@ -117,8 +117,8 @@ jobs:
         run: |
           mkdir ~/R_LIBS
           echo "R_LIBS_USER=~/R_LIBS" >> ~/.Renviron
-          R -e 'install.packages("reticulate", repos = "http://cran.rstudio.com")'
-          R -e 'deps <- c("msprime==1.2.0", "tskit==0.5.2", "pyslim==1.0"); PYTHON_ENV <- paste(gsub("==", "-", deps), collapse = "_"); reticulate::conda_create(envname = PYTHON_ENV); reticulate::use_condaenv(PYTHON_ENV, required = TRUE); reticulate::conda_install(envname = PYTHON_ENV, packages = c(deps, "pandas"), pip = TRUE)'
+          R -e 'install.packages("reticulate", repos = "http://cran.rstudio.com"); reticulate::install_miniconda()'
+          R -e 'deps <- c("msprime==1.2.0", "tskit==0.5.4", "pyslim==1.0.1"); PYTHON_ENV <- paste0("Python-3.11_", paste(gsub("==", "-", deps), collapse = "_")); reticulate::conda_create(envname = PYTHON_ENV, python_version = '3.11'); reticulate::use_condaenv(PYTHON_ENV, required = TRUE); reticulate::conda_install(envname = PYTHON_ENV, packages = c(deps, "pandas"), pip = TRUE)'
 
       - uses: r-lib/actions/setup-pandoc@v2
 

diff --git a/.github/workflows/test-coverage.yaml b/.github/workflows/test-coverage.yaml
@@ -92,8 +92,8 @@ jobs:
         run: |
           mkdir ~/R_LIBS
           echo "R_LIBS_USER=~/R_LIBS" >> ~/.Renviron
-          R -e 'install.packages("reticulate", repos = "http://cran.rstudio.com")'
-          R -e 'deps <- c("msprime==1.2.0", "tskit==0.5.2", "pyslim==1.0"); PYTHON_ENV <- paste(gsub("==", "-", deps), collapse = "_"); reticulate::conda_create(envname = PYTHON_ENV); reticulate::use_condaenv(PYTHON_ENV, required = TRUE); reticulate::conda_install(envname = PYTHON_ENV, packages = c(deps, "pandas"), pip = TRUE)'
+          R -e 'install.packages("reticulate", repos = "http://cran.rstudio.com"); reticulate::install_miniconda()'
+          R -e 'deps <- c("msprime==1.2.0", "tskit==0.5.4", "pyslim==1.0.1"); PYTHON_ENV <- paste0("Python-3.11_", paste(gsub("==", "-", deps), collapse = "_")); reticulate::conda_create(envname = PYTHON_ENV, python_version = '3.11'); reticulate::use_condaenv(PYTHON_ENV, required = TRUE); reticulate::conda_install(envname = PYTHON_ENV, packages = c(deps, "pandas"), pip = TRUE)'
 
       - uses: r-lib/actions/setup-r@v2
         with:

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: slendr
 Title: A Simulation Framework for Spatiotemporal Population Genetics
-Version: 0.4.0.9000
+Version: 0.5.0
 Authors@R:
     person(given = "Martin",
            family = "Petr",

diff --git a/NEWS.md b/NEWS.md
@@ -1,4 +1,4 @@
-# slendr (development version)
+# slendr 0.5.0
 
 - **<u>Minor breaking change!</u> Python environments of _slendr_ are no longer automatically activated upon calling `library(slendr)`! Using the coalescent _msprime_ back end and _slendr_'s tree-sequence functions now requires making an explicit call to a new function `init_env()` after `library(slendr)` is executed.** (PR [#102](https://github.com/bodkan/slendr/pull/118))
 
@@ -12,8 +12,12 @@ Splitting the Python virtual environment activation step into its own `init_env(
 
 So, to recap: `library(slendr)` no longer activates _slendr_'s isolated Python virtual environment. In order to simulate tree sequences and analyse them using its interface to _tskit_, it is necessary to call `init_env()`. This function performs the same Python-activation steps that `library(slendr)` used to call automagically in earlier _slendr_ versions. No other change to your scripts is necessary.
 
+- Related to the previous point: _slendr_ now requires Python 3.11, msprime 1.2.0, tskit 0.5.4, and pyslim 1.0.1, to keep up with recent releases of its Python dependencies. Again, this presents no hassle to the user, and the only thing required is re-running `setup_env()`. (PR [#112](https://github.com/bodkan/slendr/pull/121)).
+
 - When a named list is provided as a `sample_sets =` argument to a oneway statistic function, the names are used in a `set` column of the resulting data frame even if only single samples were used. ([#2a6781](https://github.com/bodkan/slendr/commit/2a6781))
 
+- It is now possible to have non-spatial populations in an otherwise spatial model. Of course, when plotting such models on a map, only spatial components of the model will be plotted and _slendr_ will give a warning. To be absolutely sure that users intends to do this, _slendr_ will also give a warning when running `compile_model()` on models like this. Please consider this option experimental for the time-being as it is hard to predict which edge cases might break because of this (all unit tests and documentation tests are passing though). Feedback is more than welcome. (PR [#112](https://github.com/bodkan/slendr/pull/121)).
+
 - It is now possible to label groups of samples in _slendr_'s _tskit_ interface functions which should make data frames with statistics results more readable. As an example, running `ts_f3(ts, A = c("p1_1", "p1_2", "p1_3"), B = c("p2_1", "p2_3"), C = c("p3_1", "p3_2", "p3_"))` resulted in a following data-frame output:
 
 ```

diff --git a/R/compilation.R b/R/compilation.R
@@ -65,13 +65,29 @@ compile_model <- function(populations, generation_time, path = NULL, resolution
   if (is.null(slim_script))
     slim_script <- system.file("scripts", "script.slim", package = "slendr")
 
-  map <- get_map(populations[[1]])
-  if (!is.null(map) && is.null(resolution))
+  # get values of all map attributes across populations
+  maps <- lapply(populations, get_map) %>% Filter(Negate(is.null), .)
+
+  if (length(unique(maps)) > 1)
+    stop("Multiple spatial maps detected across populations but only a single\n",
+         "world map is allowed for spatial models.", call. = FALSE)
+  else if (length(unique(maps)) == 1)
+    map <- maps[[1]]
+  else
+    map <- NULL
+
+  if (!is.null(map) && length(maps) != length(populations))
+    warning("Model containing a mix of spatial and non-spatial populations will be compiled.\n",
+            "Although this is definitely supported, make sure this is really what you want.",
+            call. = FALSE)
+
+  if (!is.null(map) && is.null(resolution) && serialize)
     stop("A map resolution must be specified for spatial models", call. = FALSE)
 
   if (!serialize && is.null(path) && inherits(map, "slendr_map")) {
-    stop("Spatial models must be serialized to disk for SLiM to simulate data from",
-         call. = FALSE)
+    warning("Spatial models must be serialized to disk for SLiM to simulate data from.\n",
+            "Compiled like this, your model can only be simulated with msprime.",
+            call. = FALSE)
   }
   if (serialize && is.null(path))
     path <- tempfile()
@@ -120,9 +136,6 @@ compile_model <- function(populations, generation_time, path = NULL, resolution
 
   if (is.data.frame(gene_flow)) gene_flow <- list(gene_flow)
 
-  if (length(unique(sapply(populations, has_map))) != 1)
-    stop("Populations must be either all spatial or non-spatial, but not both", call. = FALSE)
-
   # make sure all populations share the same direction of time
   time_dir <- setdiff(unique(sapply(populations, time_direction)), "unknown")
 
@@ -157,7 +170,7 @@ setting `direction = 'backward'.`", call. = FALSE)
   admix_table <- compile_geneflows(gene_flow, split_table, generation_time, time_dir, end_time)
   resize_table <- compile_resizes(populations, generation_time, time_dir, end_time, split_table)
 
-  if (inherits(map, "slendr_map")) {
+  if (serialize && inherits(map, "slendr_map")) {
     if (!is.null(competition)) check_resolution(map, competition)
     if (!is.null(mating)) check_resolution(map, mating)
     if (!is.null(dispersal)) check_resolution(map, dispersal)
@@ -915,12 +928,13 @@ compile_maps <- function(populations, split_table, resolution, generation_time,
   map_table <- map_table[!duplicated(map_table[, c("pop", "time_gen")]), ]
   map_table$path <- seq_len(nrow(map_table)) %>% paste0(., ".png")
 
-  # maps of ancestral populations have to be set in the first generation,
-  # regardless of the specified split time
+  # maps of ancestral populations (that is, those that are spatial) have to
+  # be set in the first generation, regardless of the specified split time
   ancestral_pops <- split_table[split_table$parent == "__pop_is_ancestor", ]$pop
   ancestral_maps <- purrr::map(ancestral_pops, ~ which(map_table$pop == .x)) %>%
     purrr::map_int(~ .x[1])
-  map_table[ancestral_maps, ]$time_gen <- 1
+  if (any(!is.na(ancestral_maps)))
+    map_table[ancestral_maps[!is.na(ancestral_maps)], ]$time_gen <- 1
 
   map_table
 }
@@ -1044,7 +1058,10 @@ compile_dispersals <- function(populations, generation_time, direction,
 
 # Render population boundaries to black-and-white spatial maps
 render <- function(pops, resolution) {
-  raster_list <- lapply(pops, function(pop) {
+  # process only populations which have a map
+  spatial_pops <- Filter(has_map, pops)
+
+  raster_list <- lapply(spatial_pops, function(pop) {
     # iterate over temporal maps for the current population
     snapshots <- lapply(unique(pop$time), function(t) {
       snapshot <- pop[pop$time == t, ]

diff --git a/R/interface.R b/R/interface.R
@@ -55,16 +55,22 @@ population <- function(name, time, N, parent = NULL, map = FALSE,
   if (!is.character(name) || length(name) != 1)
     stop("A population name must be a character scalar value", call. = FALSE)
 
-  # is this the first population defined in the model?
-  if (is.null(parent)) {
-    if (!is.logical(map) && !inherits(map, "slendr_map"))
-      stop("Ancestral population must specify its 'map'", call. = FALSE)
+  # if this population splits from a parental population, check that the parent
+  # really exists and that the split time make sense given the time of appearance
+  # of the parental population
+  if (!is.null(parent)) {
+    if (!inherits(parent, "slendr_pop"))
+      stop("Only slendr_pop objects can represent parental populations", call. = FALSE)
     else
-      map <- map
-  } else {
-    check_split_time(time, parent)
-    map <- attr(parent, "map")
+      check_split_time(time, parent)
   }
+
+  if (!is.logical(map) && !inherits(map, "slendr_map"))
+    stop("A simulation landscape must be an object of the class slendr_map", call. = FALSE)
+
+  if (!is.null(parent) && is.logical(map) && map == FALSE)
+    map <- attr(parent, "map")
+
   time <- as.integer(time)
   if (time < 1) stop("Split time must be a non-negative integer number", call. = FALSE)
   N <- as.integer(N)
@@ -570,7 +576,15 @@ gene_flow <- function(from, to, rate, start, end, overlap = TRUE) {
   if ((has_map(from) && !has_map(to)) || (!has_map(from) && has_map(to)))
     stop("Both or neither populations must be spatial", call. = FALSE)
 
+  # make sure both participating populations are present at the start of the
+  # gene flow event (`check_present_time()` is reused from the sampling functionality)
+  # check_present_time(start, from, offset = 0)
+  # check_present_time(end, from, offset = 0)
+  # check_present_time(start, to, offset = 0)
+  # check_present_time(end, to, offset = 0)
+
   # make sure the population is not removed during the the admixture period
+  # (note that this will only be relevant for SLiM simulations at this moment)
   check_removal_time(start, from)
   check_removal_time(end, from)
   check_removal_time(start, to)
@@ -1327,23 +1341,23 @@ setup_env <- function(quiet = FALSE, agree = FALSE, pip = NULL) {
       if (!dir.exists(reticulate::miniconda_path()))
         reticulate::install_miniconda()
 
-      reticulate::conda_create(envname = PYTHON_ENV)
-      reticulate::use_condaenv(PYTHON_ENV, required = TRUE)
-
       # parse the Python env name back to the list of dependencies
       # (the environment is defined in .onAttach(), and this makes sure the
       # dependencies are defined all in one place)
-      version_deps <- PYTHON_ENV %>% gsub("-", "==", .) %>% strsplit("_") %>% .[[1]]
-      other_deps <- "pandas"
-      deps <- c(version_deps, other_deps)
+      versions <- PYTHON_ENV %>% gsub("-", "==", .) %>% strsplit("_") %>% .[[1]]
+      python_version <- gsub("Python==", "", versions[1])
+      package_versions <- c(versions[-1], "pandas")
+
+      reticulate::conda_create(envname = PYTHON_ENV, python_version = python_version)
+      reticulate::use_condaenv(PYTHON_ENV, required = TRUE)
 
       # msprime/tskit conda dependency is broken on M1 Mac architecture, fallback
       # to pip in cases like this (otherwise use conda to avoid any potential
       # compilation issues such as missing libgsl)
       if (is.null(pip))
         pip <- all(Sys.info()[c("sysname", "machine")] == c("Darwin", "arm64"))
 
-      reticulate::conda_install(envname = PYTHON_ENV, packages = deps, pip = pip)
+      reticulate::conda_install(envname = PYTHON_ENV, packages = package_versions, pip = pip)
 
       if (!quiet) {
         message("======================================================================")

diff --git a/R/tree-sequences.R b/R/tree-sequences.R
@@ -2258,7 +2258,7 @@ get_pyslim_table_data <- function(ts, simplify_to = NULL) {
   # add numeric node IDs to each individual
   combined <-
     dplyr::bind_rows(sampled, not_sampled) %>%
-    dplyr::right_join(nodes, by = "ind_id") %>%
+    dplyr::right_join(nodes, by = "ind_id", multiple = "all") %>% # required after dplyr v1.1.0
     dplyr::mutate(time = ifelse(is.na(ind_id), time.y, time.x),
                   time_tskit = time_tskit.y,
                   sampled = ifelse(is.na(ind_id), FALSE, sampled)) %>%
@@ -2361,7 +2361,7 @@ get_tskit_table_data <- function(ts, simplify_to = NULL) {
 
   # add numeric node IDs to each individual
   combined <- dplyr::select(individuals, -time, -pop_id) %>%
-    dplyr::right_join(nodes, by = "ind_id") %>%
+    dplyr::right_join(nodes, by = "ind_id", multiple = "all") %>%  # required after dplyr v1.1.0
     dplyr::rename(pop = pop.y, time_tskit = time_tskit.y)
 
   # for tree sequences which have some individuals/nodes marked as 'sampled', mark the
@@ -2422,7 +2422,10 @@ get_annotated_edges <- function(x) {
       dplyr::select(parent_pop = pop,
                     parent_node_id = node_id,
                     parent_time = time, parent_location = location)
-  parent_nodes <- dplyr::left_join(parent_nodes, edges, by = stats::setNames("parent", join1)) %>%
+  parent_nodes <- dplyr::left_join(
+    parent_nodes, edges,
+    by = stats::setNames("parent", join1), multiple = "all"
+  ) %>% # multiple = "all" required after dplyr v1.1.0
     dplyr::arrange(!!join1)
 
   # take the `parent_nodes` able above and do another join operation, this time
@@ -2438,17 +2441,12 @@ get_annotated_edges <- function(x) {
       dplyr::select(child_pop = pop,
                     child_node_id = node_id,
                     child_time = time, child_location = location)
-  edge_nodes <- dplyr::inner_join(edge_nodes, parent_nodes, by = stats::setNames("child", join2)) %>%
+  edge_nodes <- dplyr::inner_join(
+    edge_nodes, parent_nodes,
+    by = stats::setNames("child", join2)
+  ) %>%
     dplyr::arrange(!!join2)
 
-  # data %>%
-  #   dplyr::filter(phylo_id %in% edges$child) %>%
-  #   dplyr::select(child_pop  = pop,
-  #                 child_phylo_id = phylo_id, child_node_id = node_id,
-  #                 child_time = time, child_location = location) %>%
-  #   dplyr::inner_join(parent_nodes, by = c("child_phylo_id" = "child")) %>%
-  #   dplyr::arrange(child_phylo_id)
-
   # transforming individual child/parent location columns (type POINT) into a
   # line (type LINESTRING)
   if (spatial) {

diff --git a/R/utils.R b/R/utils.R
@@ -213,12 +213,22 @@ get_geneflows <- function(model, time) {
     after_op <- `<=`
   }
 
-  geneflows <- subset(model$geneflow, before_op(tstart_orig, time) & after_op(tend_orig, time))
+  if (!is.null(time))
+    geneflows <- subset(model$geneflow, before_op(tstart_orig, time) & after_op(tend_orig, time))
+  else
+    geneflows <- model$geneflow
+
   geneflows$from <- factor(geneflows$from, levels = pop_names)
   geneflows$to <- factor(geneflows$to, levels = pop_names)
 
   migr_coords <- lapply(seq_len(nrow(geneflows)), function(row_i) {
 
+    # if time point was not provided, simply take the midpoint of the current
+    # gene-flow event
+    if (is.null(time)) {
+      time <- geneflows[row_i, c("tstart_orig", "tend_orig")] %>% as.numeric() %>% mean()
+    }
+
     from <- model$populations[pop_names == geneflows[row_i, ]$from][[1]] %>%
       .[before_op(.$time, time), ] %>%
       .[nrow(.), ]

diff --git a/R/visualization.R b/R/visualization.R
@@ -45,8 +45,8 @@ plot_map <- function(..., time = NULL, gene_flow = FALSE,
     stop("'slendr_region' and 'slendr_pops' object cannot be plotted at once", call. = FALSE)
 
   if (gene_flow & (is.null(time) | !inherits(args[[1]], "slendr_model")))
-    stop("Migrations can be visualized only when a time point and a 'slendr_model'
-objects are specified", call. = FALSE)
+    warning("All gene-flow event will be visualized at once. If you wish to visualize\n",
+            "gene flows at a particular point in time, use the `time` argument.", call. = FALSE)
 
   # a single model object was provided
   if (length(args) == 1 & inherits(args[[1]], "slendr_model")) {
@@ -146,7 +146,12 @@ objects are specified", call. = FALSE)
       pop_maps <- pops
     }
 
-    if (intersect) pop_maps <- lapply(pop_maps, intersect_features)
+    if (intersect) {
+      intersected_maps <- pop_maps %>% Filter(has_map, .) %>% lapply(intersect_features)
+      if (length(intersected_maps) != length(pop_maps))
+        warning("Non-spatial populations in your model won't be visualized", call. = FALSE)
+      pop_maps <- intersected_maps
+    }
 
     pop_maps <- do.call(rbind, pop_maps)
     pop_maps$pop <- factor(pop_maps$pop, levels = pop_names)

diff --git a/R/zzz.R b/R/zzz.R
@@ -9,9 +9,10 @@ msp <- NULL
 # define slendr's required Python dependencies and compose an environment name
 # that will be used specifically for them
 PYTHON_ENV <-
-  c("msprime==1.2.0", "tskit==0.5.2", "pyslim==1.0") %>%
+  c("msprime==1.2.0", "tskit==0.5.4", "pyslim==1.0.1") %>%
   gsub("==", "-", .) %>%
-  paste(collapse = "_")
+  paste(collapse = "_") %>%
+  paste0("Python-3.11_", .)
 
 .onAttach <- function(libname, pkgname) {
   # check for presence of the slim binary in user's PATH and display
@@ -43,25 +44,24 @@ PYTHON_ENV <-
     if (!getOption("slendr.custom_env")) {
       version <- strsplit(PYTHON_ENV, "_")[[1]] %>% gsub(".*-", "", .)
       packageStartupMessage(
-        sprintf(paste0("A slendr Python environment with the necessary versions of msprime (%s),\n",
-        "tskit (%s), and pyslim (%s) has not been found.\n"), version[1], version[2], version[3]),
-        "\nYou can setup a pre-configured environment with all of slendr's Python\n",
-        "dependencies automatically by running the function setup_env()."
+        sprintf(paste0("A slendr Python (%s) environment with the necessary versions of\n",
+          "msprime (%s), tskit (%s), and pyslim (%s) has not been found.\n"),
+          version[1], version[2], version[3], version[4]),
+          "\nYou can setup a pre-configured environment with all of slendr's Python\n",
+          "dependencies automatically by running the function setup_env()."
       )
     }
   }
 
   packageStartupMessage(
     "=======================================================================\n",
-    "NOTE: Due to frequent issues with some user's Python setups, slendr no\n",
-    "longer activates its Python environment automatically upon calling\n",
-    "library(slendr).\n\n",
+    "NOTE: Due to Python setup issues on some systems which have been\n",
+    "causing trouble particularly for novice users, calling library(slendr)\n",
+    "no longer activates slendr's Python environment automatically.\n\n",
     "In order to use slendr's msprime back end or its tree-sequence\n",
-    "functionality, users now must activate slendr's Python environments\n",
-    "manually by calling init_env().\n\n",
-    "This inconvenience is a compromise in order to help novice users avoid\n",
-    "having to debug very technical, low-level Python-specific issues. This\n",
-    "note will be removed in a future version of slendr.",
+    "functionality, users must now activate slendr's Python environment\n",
+    "manually by executing init_env() after calling library(slendr).\n\n",
+    "(This note will be removed in the next major version of slendr.)",
     "\n======================================================================="
   )
 }