-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize the R-tskit interface towards non-slendr tree sequences #91
Conversation
Hello @bhaller, I have quite an exciting update to share. Also tagging @FerRacimo so he is kept in the loop, because we spoke about this a couple of times in the past. I have now finished a draft implementation of the tskit-powered tree-sequence functionality in slendr towards non-slendr tree sequences. I.e, it is now possible to take any non-slendr tree sequence (either from SLiM or msprime), load/simplify/recapitate/process it with slendr in R, and run tree-sequence popgen statistics on it (at least those currently supported by slendr). Spatial tree sequences are also supported, so all the geospatial R goodies are now available. This gets us significantly closer towards having slendr as a toolkit for analyzing tree sequence data in general (non-spatial and spatial!). In fact, barring any bugs (still working on unit tests), there should be anything in the tskit-R interface of slendr that would work on slendr tree sequences at the moment but wouldn't work on non-slendr tree sequence. A couple of quick examples:Non-spatial SLiM (non-slendr) tree sequencesConsider the following simple SLiM script, which creates a couple of subpopulations (with different Ne) splitting from an ancestral p1 subpopulation:
We can use slendr to load the output tree sequence, simplify it, and overlay mutations on it using the standard functionality originally developed for to slendr tree sequences:
We can extract information about individual's names, nodes, population assignments, etc. just as with any slendr tree sequence (
Moving on to tskit statistics, we can use the data table above to extract a list of nodes belonging to each population (this is what various tskit tree-sequence statistics operate on, and slendr follows that design). Here we are computing the nucleotide diversity in each of the four populations using the
Just as with slendr tree sequences (as demonstrated in our preprint) we can get a individual trees too, extracted in the in the phylogenetic format provided by the ape R package. Here we first simplify the tree sequence even further to just 10 nodes to make things manageable:
Once we have that R tree object, we can use packages like ggtree to visualize the tree (any other phylogenetic package would work too). Note that because nodes of 'ape phylo' trees must conform to a strict format (they must be labelled 1...N), we will extract the information about the node IDs in the tskit tree sequence data to be able to plot them in the tree. Significantly less eye candy than the tree in our preprint, but it conveys the idea:
msprime (non-slendr) tree sequencesThe whole thing works also for msprime tree sequences (not super surprising, given that it's all tskit under the hood). Spatial SLiM (non-slendr) tree sequencesFurthermore, the generalized interface also supports slendr's spatial tree-sequence features, with all bells and whistles. If we take the following spatial model (modified from the SLiM manual):
We can load and simplify the output tree sequence in the standard slendr way:
We can then access the spatio-temporal data embedded in the output tree sequence in the standard slendr way (note the "spatial" sf column
Because we get the tree sequence converted to the spatial sf data format, we can use standard geospatial packages to use any spatial data analysis methods that those packages provide. Just to demonstrate, we can trivially plot the location of each recorded node (ggplot is not mandatory, it's just something I'm familiar with!):
We can also collect spatio-temporal ancestry information of a particular node (i.e. the times and locations of all of its ancestors all the way to the root, with each "link" in the plot signifying parent-child edge somewhere along the tree sequence) and plot it on a 2D surface (x and y dimensions [0, 1]). The plot is obviously chaotic, but should convey the idea (the "focal node" 0 is highlighted in red). It's the same plot we have in the last figure of our paper.
|
I will probably collect those simple examples in the previous message into a separate vignette. It won't add much that wouldn't be covered by the other vignettes, but it might be helpful to the people who use msprime or SLiM without slendr (basically everyone at this point) that there is a possibility to simulate data with whatever means and still analyze them in R via slendr's tskit interface. |
Hi folks, especially @bhaller and @petrelharp. First of all, sorry for the delay with this PR. Things at I’m finally back and thanks to your comments and suggestions I have now made the code much cleaner and more robust. I now have a reasonable version of the slendr R-tskit interface that can operate on standard non-slendr SLiM and msprime tree sequences. This PR now accomplishes what I was aiming for and the new code is ready for testing by people who use SLiM/msprime (but not slendr) for simulation and who would like to check out slendr’s features for tree-sequence analysis. I’ve already had a number of people express interest in this, so the sooner I can get things into their hands the better. Once the GitHub Actions CI unit tests pass (currently in progress), I will merge the PR. In case you remember what the various issues were, then very briefly (if you don’t, just ignore this, I'm logging this also for my own benefit):
I have talked to a bunch of people recently who use SLiM for awesome spatial selection work (range expansion surfing, etc.) — they are not that interested in slendr as a simulation tool (they need much more detailed control over spatial dynamics than what slendr provides) but they are very interested in the spatial R-tskit interface of slendr. I have also had two students working on slendr-adjacent things who rely on this PR to be merged. Because of that, as I mentioned above, I will merge this PR as soon as I know that the GitHub CI checks and unit tests pass. If there are corner cases that I have not taken into account, the brave users who reached out to me offered to help me fix those (the only thing needed will be a SLiM script that produces a tree sequence that fails to load/process/analyze). That said, given that slendr tree sequences are normal tskit tree sequences, I don’t expect dramatic problems. The only thing the current implementation of R-tskit layer is doing is that it’s ignoring slendr-specific metadata when that metadata is missing, and uses the same code for non-slendr tree sequences otherwise. There are other updates too, but I will leave those to an email update that I’ll send soon. Thanks again for your input! |
Hi @bodkan. All sounds great! Happy to talk about adding a slendr-based recipe to the SLiM manual. It could go in as section 17.11, if having it added to the end makes sense to you, or somewhere internal to chapter 17. The description of slendr in section 1.10 could also probably use revision, given these changes. I'm a bit swamped at the moment – I'm at a workshop thing for the next 3.5 weeks that pretty much takes up all my time. If you'd like to propose changes/additions, I'll be happy to add them; if you want to have a zoom to discuss first, that will wait at least 1.5 weeks, very possibly longer, for a more quiet time in my life. :-> One question: is slendr now SLiM 4 compatible? I hope so. The changes needed ought to be fairly minimal, and with a little elbow grease it should be possible to have your scripts run on both 3.7 and 4.0 (let me know if you need any advice on that). It would be great if the SLiM 4 release, which I plan to do in early to mid August, did not break slendr and create a crisis. :-> Exciting! I'm curious who you've been talking to about range expansion surfing, etc. :-> |
My goodness. 4 hour GitHub Actions CRAN checks. (Not running the tests on Windows because I don't have an access to a Windows machine with an R development environment set up. But given that SLiM now runs on Windows without issues, it's about time I worked on this at some point soon.) Merging the PR now and I will also tag a new version. Currently working on the first CRAN submission. Wish me luck, if past experiences are any indication, I'm going to need it... |
Hello @bhaller
I will take a closer look and let you know. 👍
Absolutely, I will take a look and suggest adjustments where necessary.
Same, the only reason I had time for all this is that I dropped everything else unconditionally to push the first release to CRAN and submit the paper, now that lots of people tried slendr and no major disaster has happened. I don't think I'll have a lot of time for those changes/additions over the next 2-3 weeks either. But I will be in touch sometime in July. (Even if I manage earlier, there's no pressure.)
I had absolutely no time to test things with SLiM 4 just yet :( But it is important thing to check, so I opened an issue (#98) and will make sure to test this soon. Would you say that running the current slendr unit tests against the SLiM 4 binary be a good start? I.e. that only a few syntactic/semantic would be needed that I could fix a few errors that might arise? In any case, I will post potential issues there and ping you when necessary. Thanks for bringing this up.
In this particular instance mostly Excoffier people, so not that surprising. :) |
Sounds good.
I expect it to only be syntactic tweaks, yes. By and large, SLiMgui will offer to "autofix" the issues for you if you open a generated slendr script in SLiMgui and run it; each error encountered should produce a new "autofix" suggestion. Accepting those fixes will get you a working SLiM 4 script (although there might be a case or two where autofix is not sufficient). The SLiM 4 beta has links to more information about exactly what changed. The thing that might bite you the most is that sim.generation no longer exists; it is now community.tick or sim.cycle, to get the same sort of information (see the docs regarding what the difference between those two things is). If you use it a lot, you might find it simpler to just make a user-defined function named
:-> |
Generalize the R-tskit interface towards non-slendr tree sequences
Finally some solid progress towards #85.
We can now:
ts_load()
)ts_simplify()
)ts_phylo()
)Not completely happy with the code. Basically, I'm currently adding things like
Which is not pretty because these are often scattered throughout those functions.
The "specific processing" above generally involves:
Next steps:
ts_recapitate
does what it's supposed to do even on non-slendr tree sequences.msprime.sim_ancestry(N)
calls to generate SLiM and msprime data and make sure thatts_data()
,ts_samples()
, and friends work as they should even on those.I already tested the tskit popgen statistics methods (
ts_divergence()
,ts_f4()
, etc.) and they seem to work without me having to make any changes. This was a pleasant surprise: I had forgotten that those functions already worked on non-symbolic integer tskit node indices, so we should be good on that front. Good for me. :)