Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SLiM simulations on Windows #149

Merged
merged 98 commits into from Feb 6, 2024
Merged

Add support for SLiM simulations on Windows #149

merged 98 commits into from Feb 6, 2024

Conversation

bodkan
Copy link
Owner

@bodkan bodkan commented Jan 18, 2024

This is inspired by the changes suggested by @GKresearch here.

I manageds to setup Windows 11 in a virtual machine on my Mac. It's not a great development experience but it's something!

The goal is to check if we can support the default means of installation described in Section 2.3.1 of the SLiM manual. Conda-installed SLiM should work as well, except that doesn't support SLiMgui which makes it not the best default target. That said, Conda users will be able specify the path to their slim binary directly via the slim_path= argument of the slim() function.

Speaking about paths -- I still haven't figured out how path management is supposed to work in Windows shell. In fact, it appears that the SLiM installation on Windows described in the SLiM manual is using a non-native non-Windows shell? That's why I'm considering making specification of path to a SLiM binary to be used (be it CLI SLiM or SLiMgui) mandatory on Windows.

Hey @GKresearch, if you're still interested in helping, it would be great if you could test the new version of slendr with Windows support on your computer, after I manage to put everything together.

Thanks for pushing me to do this and sorry again for the huge delay!

GKresearch and others added 3 commits July 26, 2023 14:51
Hi Martin,

Here is my edit to try and get the slim command working in Windows. The primary changes are with the slim command line request with specific changes to the pathing. I hope this pull request works appropriately, let me know if there are any issues.

Couple of notes: I'm on a Windows 11 machine. My slim is held in a conda environment and was downloaded through conda-forge. I call the slim path directly with the "slim_path" option in your slim() R command. I used you setup_env() to get access to the necessary python modules. Other than that, these changes were sufficient to allow me to work through the tutorial that you have on the slendr website. I mention this briefly in notes in the code itself, but I did not test anything regarding the slim gui, this was command line only.

Sincerely,
GK
@bodkan bodkan mentioned this pull request Jan 18, 2024
@bhaller
Copy link
Collaborator

bhaller commented Jan 18, 2024

In fact, it appears that the SLiM installation on Windows described in the SLiM manual is using a non-native non-Windows shell?

The main way for people to install on Windows nowadays, I think, is the pacman installer, I think? I don't know anything about Windows, but I think that installation route is pretty standard there...?

@bodkan
Copy link
Owner Author

bodkan commented Jan 18, 2024

The main way for people to install on Windows nowadays, I think, is the pacman installer, I think? I don't know anything about Windows, but I think that installation route is pretty standard there...?

Yeah, that's what I understood as well (looks to me that pacman is something like Homebrew on macOS but ¯\(ツ)/¯ I haven't used Windows in 15 years).

What I meant is that I it looked to me that CLI programs installed via pacman appear to be using some sort of bash ported for Windows but not the actual Windows "Command Prompt" (I think that's how it used to be called) or the "newer" Powershell. This is what the colored terminal screenshots in SLiM manual are from.

So this wasn't actually about the installation itself but about running those programs. In any case, it's all good -- this wasn't meant to be a criticism of any kind. Or it wasn't meant to express surprise or anything. Again, I don't use Windows, and have no prior expectations about any of this. :)

@bodkan
Copy link
Owner Author

bodkan commented Jan 18, 2024

Ah, hold on. I can run slim.exe through the good old Windows System Prompt that I remember from my teenage years. :) What threw me off is that the shell spawned by msys2 is a bash shell (not the Windows prompt), and same was true for git, etc. So I assumed something more special is going on with the compilation of programs ported from unix to Windows.

Also, when I said

In fact, it appears that the SLiM installation on Windows described in the SLiM manual is using a non-native non-Windows shell?

I actually meant pacman / msys2 compiled programs in general. But the above reads as I'm talking specifically about SLiM, which wasn't my intention, sorry.

@bhaller
Copy link
Collaborator

bhaller commented Jan 18, 2024

No worries, just checking that you were using pacman not some other thing like the WSL installation procedure, which (I think?) is more fringe.

@GKresearch
Copy link
Contributor

Hi @bodkan,

Thank you for looking into this! Yeah, I'm happy to try running the new version when it is ready!

@benjeffery
Copy link

@bodkan Hi! I assume you're using the latest tskit as some time ago we increased the possible lengths of ragged columns such as metadata to a 64bit int. It may be that on windows the type may still be 32 bit. What length of metadata column causes this issue? You can tell by with len(ts.tables.metadata).

@bhaller
Copy link
Collaborator

bhaller commented Feb 1, 2024

@benjeffery Might it be a good idea to add a bit of code in tskit, at startup or some convenient spot like that, that just checks sizeof(X) for various types and prints an error and exits if they are not what you expect/need? Trivial to do, and would catch a problem like this on an unusual platform that doesn't use the C type sizes you expect...?

@bodkan
Copy link
Owner Author

bodkan commented Feb 1, 2024

Hello @benjeffery, thanks for jumping in.

I assume you're using the latest tskit [...]

I release a new slendr version whenever a new version of tskit or msprime or pyslim is released, so currently it includes tskit 0.5.6.

What length of metadata column causes this issue? You can tell by with len(ts.tables.metadata).

I'll try to get that as soon as I manage to isolate a broken simulation run in a fully reproducible way. I don't have Windows access and debugging this over GitHub actions or inside a super slow virtualized Windows machine has been what I imagine using punchcards in the 70s must've been like... The fact that this doesn't happen (for any given simulation) in every run has made investigating this even more challenging.

I'll be in touch as I get more info on this, thank you.

@benjeffery
Copy link

@benjeffery Might it be a good idea to add a bit of code in tskit, at startup or some convenient spot like that, that just checks sizeof(X) for various types and prints an error and exits if they are not what you expect/need? Trivial to do, and would catch a problem like this on an unusual platform that doesn't use the C type sizes you expect...?

I haven't had time to dig into this yet - but am surprised we don't have a test for it. Hopefully we can replicate in CI.

@bodkan
Copy link
Owner Author

bodkan commented Feb 3, 2024

So, I have to say I’m not yet 100% convinced it’s really a tskit-on-Windows issue rather than reticulate-on-Windows issue. In fact, so far I'm leaning towards the latter.

I did finally managed to create a tiny test case which produces a tree sequence which reproducibly causes the crash every time. This will hopefully make it possible to really dig deep to determine just what is it that’s causing the problem! From a casual glance I managed yesterday evening before I had to run home it’s something… utterly bizarre.

@bodkan
Copy link
Owner Author

bodkan commented Feb 5, 2024

OK, I don't remember dealing with such an obscure weird issue.

@rdinnager Would you mind testing one more thing for me? I managed to create a tree sequence which always breaks on my Windows virtual machine but I'd like to make sure it also happens on a native Windows system. Interestingly, this happens regardless of whether the tree sequence is simulated with SLiM or msprime.

I attach two tree sequences, output_slim.trees and output_msprime.trees (output_trees.zip). Accessing the metadata element of a ts tree-sequence object crashes for both (so it must be something that they have in common!). Could it be the seed maybe? Have to do more digging. Already getting to an always breaking simulation run was quite a struggle.

  • First, to eliminate tskit in pure Python from the list of suspects, can you check that this runs without any problems?
# restart R to get a clean environment (and no slendr)

# use Python installed by slendr (but this doesn't really have an effect)
library(reticulate)
use_condaenv("Python-3.12_msprime-1.3.0_tskit-0.5.6_pyslim-1.0.4_tspop-0.0.2") 

repl_python() # open a Python REPL inside R

# paste the following into the Python REPL
import tskit
ts = tskit.load("output_msprime.trees")

ts.metadata          # this works without any issues
ts.metadata["SLiM"]  # this works without any issues
  • Then, let's run the same code as above but this time in the reticulate interface to Python (this works by embedding a Python process in an R session, otherwise there's no difference vs running pure Python code):
# restart R to get a clean environment (and no slendr)

# use Python installed by slendr (but this doesn't really have an effect)
library(reticulate)
use_condaenv("Python-3.12_msprime-1.3.0_tskit-0.5.6_pyslim-1.0.4_tspop-0.0.2") 

# load a tree sequence directly using the Python/R reticulate interface
tskit <- import("tskit")
ts <- tskit$load("output_msprime.trees")

ts$metadata # this works
ts$metadata # but this breaks!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
ts$metadata # this works
ts$metadata # but this breaks! WTF?!?!
# ... rinse and repeat

So accessing the metadata breaks but only on every other access?!?! What the hell? This must be some weird, super low-level memory problem related to the Python embedding / caching the metadata method call. But I'm totally out of my depth.

@rdinnager Could you please check that you observe the same behavior? (Note that in my VM it happens for both the SLiM and msprime tree sequence). At least then I'll know I'm not chasing ghosts but that the attached tree sequences break also for you in this way. Thank you very much.


Again, for completeness, here's the error in full:

> ts$metadata # but this breaks! WTF?!?!
Error in py_get_attr_impl(x, name, silent) :
  SystemError: <functools._lru_cache_wrapper object at 0x000001D8D7D5BE20> returned a result with an exception set
Run `reticulate::py_last_error()` for details.
> reticulate::py_last_error()

── Python Exception Message ────────────────────────────────────────────
OverflowError: Python int too large to convert to C long

The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
  File "C:\Users\mp\AppData\Local\R-MINI~1\envs\PYTHON~1.2\Lib\site-packages\tskit\trees.py", line 4421, in metadata
return self.metadata_schema.decode_row(self._ll_tree_sequence.get_metadata())
^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\mp\AppData\Local\R-MINI~1\envs\PYTHON~1.2\Lib\site-packages\tskit\trees.py", line 4428, in metadata_schema
return metadata_module.parse_metadata_schema(
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    SystemError: <functools._lru_cache_wrapper object at 0x000001D8D7D5BE20> returned a result with an exception set

  ── R Traceback ─────────────────────────────────────────────────────────
  ▆
  1. ├─ts$metadata
  2. └─reticulate:::`$.python.builtin.object`(ts, metadata)
  3.   └─reticulate:::py_get_attr_or_item(x, name, TRUE)
  4.     └─reticulate::py_get_attr(x, name)
  5.       └─reticulate:::py_get_attr_impl(x, name, silent)

Here's a slendr script which generates a broken SLiM and msprime tree sequence. No need to run this, both tree sequences are attached to this message:

# simulate a "broken" tree sequence ---------------------------------------

devtools::load_all() # must be in a checked out slim-on-windows branch of slendr
init_env()

SEED <- 4106790818

p <- population(name = "pop1", N = 700, time = 1)
model <- compile_model(populations = p, generation_time = 30, simulation_length = 1000)

output_slim <- "output_slim.trees"
output_msprime <- "output_msprime.trees"

slim(model, sequence_length = 1e6, recombination_rate = 0, random_seed = SEED, output = output_slim, load = FALSE)
msprime(model, sequence_length = 1e6, recombination_rate = 0, random_seed = SEED, output = output_msprime, load = FALSE)

@bodkan
Copy link
Owner Author

bodkan commented Feb 5, 2024

@bodkan wrote:
Accessing the metadata element of a ts tree-sequence object crashes for both SLiM and msprime tree sequences (so it must be something that they have in common!). Could it be the seed maybe? Have to do more digging. Already getting to an always breaking simulation run was quite a struggle.

Holy crap. It is the seed?!?

devtools::load_all()
init_env(quiet = TRUE)
  • This always runs:
MAX_INT <-  2147483647 # https://en.wikipedia.org/wiki/2,147,483,647#In_computing
SEED <- MAX_INT

p <- population(name = "pop1", N = 100, time = 1)
model <- compile_model(populations = p, generation_time = 30, simulation_length = 1000)

while (TRUE) { # infinite loop, careful
  cat("*")
  ts <- msprime(model, sequence_length = 1e3, recombination_rate = 0, random_seed = SEED)
}
  • This always crashes:
MAX_INT <-  2147483647
SEED <- MAX_INT + 1       # <==== NOTE THE +1!

p <- population(name = "pop1", N = 100, time = 1)
model <- compile_model(populations = p, generation_time = 30, simulation_length = 1000)

while (TRUE) { # infinite loop except on Windows
  cat("*")
  ts <- msprime(model, sequence_length = 1e3, recombination_rate = 0, random_seed = SEED)
}

This is the weirdest, craziest bug I've ever encountered, ever. How hilarious that all this frustrating debugging and digging boils down to... a huge-ass number being a little bigger than allowed!

I still have no idea why ts.metadata crashes in reticulated Python on Windows only once every second access (seriously, WTF?!). If it's int overflow, I would expect it to happen every time? But I really have no idea. Clearly this has something to do with the caching of the metadata access method (the primary error I was getting was related to method caching).

@rdinnager, I think you can ignore my request for help in the message above. I'll adjust the range of integers that slendr picks a random seed from and also put a constraint on the range of user-provided random seeds -- for both slim() and msprime() functions.

Then I'll run the unit tests again and see what happens. 🤞

@bodkan
Copy link
Owner Author

bodkan commented Feb 5, 2024

Oh! It doesn't look like slendr is generating a random seed in case the user doesn't provide any. If no random seed is explicitly specified, then slendr lets SLiM or msprime pick their own.

This is interesting because if SLiM tends to provide a wider range of integers as random seeds than msprime, it would explain why it's been only slim()-based unit tests failing this whole time, never msprime unit tests -- which is what threw me off originally as I assumed it's something specific to SLiM tree sequences. Fun!

Given that the error happens someplace slendr has no control over (low-level interface between R/embedded-Python), maybe a compromise (at least intermediate one) is for slendr to actually generate a random seed of its own, restricted to be less than .Machine$integer.max.

@bhaller
Copy link
Collaborator

bhaller commented Feb 5, 2024

Wow, craziness. Nice debugging!

@bodkan
Copy link
Owner Author

bodkan commented Feb 6, 2024

OK, so all the slendr unit tests are passing across macOS / Linux / Windows with no exceptions. I also verified that all the vignettes (which contain much more ambitious simulations runs) are building on all three systems.

It really was the random seed being bigger than a 32-bit integer causing the intermittent crashes on Windows during parsing of the metadata (in which slendr saves dictionary with some custom metadata, including the seed used). Some very obscure interaction at the level of R-Python interface that's impossible for me to explain because it would require digging into that interface at a C level. I'm sure it would be fun to figure out but I can't put more time into this.

As far as slendr is concerned, this is fixed by making sure that random seeds are less than 2,147,483,647 which makes Windows happy in the current state of things.

Once this is merged, I will put out a new slendr version to release the Windows support to the wild. I'm really happy to have this in because porting the entire unit test suite to Windows required changes to make the code more robust overall.

I'm pretty sure there are still smaller issues here and there but discovering those would require using slendr and it's tskit interface on Windows for more serious work. Not something I can manage on a slow-oh-my-god-why-is-it-so-slow virtualized Windows system with a hilariously low screen resolution (I can't even get a proper fullscreen RStudio view! :)). Those bugs can be taken care of with individual GitHub issues.

What a journey this has been. Thank you everyone for helping out!

image

@bodkan bodkan merged commit e9bfaad into main Feb 6, 2024
6 of 8 checks passed
@bodkan bodkan deleted the slim-on-windows branch February 6, 2024 11:40
@bodkan bodkan restored the slim-on-windows branch March 21, 2024 13:47
@bodkan bodkan deleted the slim-on-windows branch March 21, 2024 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants