Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize the R-tskit interface towards non-slendr tree sequences #91

Merged
merged 104 commits into from
Jun 22, 2022

Conversation

bodkan
Copy link
Owner

@bodkan bodkan commented Apr 28, 2022

Finally some solid progress towards #85.

We can now:

  • load a standard (non-slendr) SLiM and msprime tree sequence (ts_load())
  • simplify a non-slendr tree sequence (ts_simplify())
  • convert one tree from the tree sequence to a standard R phylogenetic ape format phylo (ts_phylo())

Not completely happy with the code. Basically, I'm currently adding things like

if (!is.null(attr(ts, "model"))) { // if the tree sequence does not have slendr-specific attribute...
    // ... do some slendr-specific processing
}

Which is not pretty because these are often scattered throughout those functions.

The "specific processing" above generally involves:

  • converting tskit node ages to "slendr-model-specific" time units
  • sprinkling slendr's nice readable symbolic names for individuals on top of the raw tree-sequence table data.

Next steps:

  • Make sure ts_recapitate does what it's supposed to do even on non-slendr tree sequences.
  • Implement conversion of SLiM tree-sequence individual coordinates (raster-based) to the sf format. This will enable SLiM users who don't use slendr for simulations access all the awesome geospatial libraries for R via the slendr R-tskit interface. They will simply get an sf-compatible data frame with times and locations of everything recorded in their tree sequence.
  • Implement a variation of tree-sequence unit tests already in slendr. These will use simple SLiM scripts and msprime.sim_ancestry(N) calls to generate SLiM and msprime data and make sure that ts_data(), ts_samples(), and friends work as they should even on those.

I already tested the tskit popgen statistics methods (ts_divergence(), ts_f4(), etc.) and they seem to work without me having to make any changes. This was a pleasant surprise: I had forgotten that those functions already worked on non-symbolic integer tskit node indices, so we should be good on that front. Good for me. :)

@bodkan
Copy link
Owner Author

bodkan commented May 1, 2022

Hello @bhaller, I have quite an exciting update to share. Also tagging @FerRacimo so he is kept in the loop, because we spoke about this a couple of times in the past.

I have now finished a draft implementation of the tskit-powered tree-sequence functionality in slendr towards non-slendr tree sequences. I.e, it is now possible to take any non-slendr tree sequence (either from SLiM or msprime), load/simplify/recapitate/process it with slendr in R, and run tree-sequence popgen statistics on it (at least those currently supported by slendr). Spatial tree sequences are also supported, so all the geospatial R goodies are now available.

This gets us significantly closer towards having slendr as a toolkit for analyzing tree sequence data in general (non-spatial and spatial!). In fact, barring any bugs (still working on unit tests), there should be anything in the tskit-R interface of slendr that would work on slendr tree sequences at the moment but wouldn't work on non-slendr tree sequence.

A couple of quick examples:

Non-spatial SLiM (non-slendr) tree sequences

Consider the following simple SLiM script, which creates a couple of subpopulations (with different Ne) splitting from an ancestral p1 subpopulation:

initialize() {
	initializeTreeSeq();
	initializeMutationRate(0);
	initializeMutationType("m1", 0.5, "f", 0.0);
	initializeGenomicElementType("g1", m1, 1.0);
	initializeGenomicElement(g1, 0, 1e6);
	initializeRecombinationRate(1e-8);
}

1 {
	sim.addSubpop("p1", 10);
}

1000 {
	sim.addSubpopSplit("p2", 500, p1);
}

3000 {
	sim.addSubpopSplit("p3", 2500, p1);
}

5000 {
	sim.addSubpopSplit("p4", 10000, p1);
}

15000 late() {
	sim.treeSeqOutput("~/Desktop/nonspatial.trees");
}

We can use slendr to load the output tree sequence, simplify it, and overlay mutations on it using the standard functionality originally developed for to slendr tree sequences:

ts <- ts_load("~/Desktop/nonspatial.trees") %>%
  ts_simplify() %>%
  ts_mutate(mutation_rate = 1e-7)

We can extract information about individual's names, nodes, population assignments, etc. just as with any slendr tree sequence (ts_data() basically loads raw nodes and individuals tree sequences tables, performs a couple of join operations, and presents the whole thing in a nice unified form for interactive data analysis):

data <- ts_data(ts) %>% dplyr::filter(sampled)
data

> slendr 'table' object 
> --------------------- 
> data was extracted from a SLiM forward time model
> 
> summary of the table data contents:
> 
> total:
>   - 13010 'sampled' individuals
>   - 0 'retained' individuals
>   - 0 no nodes from 'recapitated' individuals
> --------------------- 
> oldest sampled individual: 0 time units 'before present' 
> youngest sampled individual: 0 time units 'before present' 
> 
> oldest node: 0 time units 'before present' 
> youngest node: 0 time units 'before present' 
> --------------------- 
> overview of the underlying table object:
> 
> # A tibble: 26,020 × 9
>    pop   ind_id node_id  time sampled remembered retained alive pedigree_id
>    <chr>  <dbl>   <int> <dbl> <lgl>   <lgl>      <lgl>    <lgl>       <dbl>
>  1 p1         0   20000     0 TRUE    FALSE      FALSE    TRUE    137163000
>  2 p1         0   20001     0 TRUE    FALSE      FALSE    TRUE    137163000
>  3 p1         1   20002     0 TRUE    FALSE      FALSE    TRUE    137163001
>  4 p1         1   20003     0 TRUE    FALSE      FALSE    TRUE    137163001
>  5 p1         2   20004     0 TRUE    FALSE      FALSE    TRUE    137163002
>  6 p1         2   20005     0 TRUE    FALSE      FALSE    TRUE    137163002
>  7 p1         3   20006     0 TRUE    FALSE      FALSE    TRUE    137163003
>  8 p1         3   20007     0 TRUE    FALSE      FALSE    TRUE    137163003
>  9 p1         4   20008     0 TRUE    FALSE      FALSE    TRUE    137163004
> 10 p1         4   20009     0 TRUE    FALSE      FALSE    TRUE    137163004
> # … with 26,010 more rows

Moving on to tskit statistics, we can use the data table above to extract a list of nodes belonging to each population (this is what various tskit tree-sequence statistics operate on, and slendr follows that design). Here we are computing the nucleotide diversity in each of the four populations using the ts_diversity() function:

sample_sets <- split(data$node_id, data$pop)

# compute nucleotide diversity in each population
# (any other ts_*() tskit R interface function should work)
ts_diversity(ts, sample_sets)

> # A tibble: 4 × 2
>   set    diversity
>   <chr>      <dbl>
> 1 p1    0.00000371
> 2 p2    0.000192  
> 3 p3    0.000900  
> 4 p4    0.00161   

Just as with slendr tree sequences (as demonstrated in our preprint) we can get a individual trees too, extracted in the in the phylogenetic format provided by the ape R package. Here we first simplify the tree sequence even further to just 10 nodes to make things manageable:

samples <- sample(data$node_id, 10)
ts_small <- ts_simplify(ts, simplify_to = samples)

# extract the 42nd tree in the genealogy to an R 'phylo' format
tree <- ts_phylo(ts_small, 42)
tree

> Phylogenetic tree with 10 tips and 9 internal nodes.
> 
> Tip labels:
>   2, 6, 3, 9, 5, 4, ...
> Node labels:
>   209, 13, 17, 28, 129, 200, ...
> 
> Rooted; includes branch lengths.

Once we have that R tree object, we can use packages like ggtree to visualize the tree (any other phylogenetic package would work too). Note that because nodes of 'ape phylo' trees must conform to a strict format (they must be labelled 1...N), we will extract the information about the node IDs in the tskit tree sequence data to be able to plot them in the tree.

Significantly less eye candy than the tree in our preprint, but it conveys the idea:

labels <- ts_data(tree) %>% select(node = phylo_id, tskit_id = node_id)

ggtree(tree) %<+% labels +
  geom_label(aes(label = tskit_id))

image

msprime (non-slendr) tree sequences

The whole thing works also for msprime tree sequences (not super surprising, given that it's all tskit under the hood).


Spatial SLiM (non-slendr) tree sequences

Furthermore, the generalized interface also supports slendr's spatial tree-sequence features, with all bells and whistles.

If we take the following spatial model (modified from the SLiM manual):

initialize() {
	initializeSLiMOptions(keepPedigrees=T, dimensionality="xy");
	initializeTreeSeq();
	initializeMutationRate(1e-7);
	initializeMutationType("m1", 0.5, "f", 0.0);
	initializeGenomicElementType("g1", m1, 1.0);
	initializeGenomicElement(g1, 0, 1e6);
	initializeRecombinationRate(1e-8);
}
1 early() {
	sim.addSubpop("p1", 500);
	
	// initial positions are random in ([0,1], [0,1])
	p1.individuals.x = runif(p1.individualCount);
	p1.individuals.y = runif(p1.individualCount);
}
modifyChild() {
	// draw a child position near the first parent, within bounds
	do child.x = parent1.x + rnorm(1, 0, 0.02);
	while ((child.x < 0.0) | (child.x > 1.0));
	
	do child.y = parent1.y + rnorm(1, 0, 0.02);
	while ((child.y < 0.0) | (child.y > 1.0));
	
	return T;
}
1: late() {
    sim.treeSeqRememberIndividuals(sim.subpopulations.individuals, permanent = F);
}

10000 late() {
	sim.treeSeqOutput("~/Desktop/spatial.trees");
}

We can load and simplify the output tree sequence in the standard slendr way:

ts <- ts_load("~/Desktop/spatial.trees") %>% ts_simplify()

We can then access the spatio-temporal data embedded in the output tree sequence in the standard slendr way (note the "spatial" sf column location with the POINT data type):

data <- ts_data(ts)
data

> slendr 'table' object 
> --------------------- 
> data was extracted from a SLiM forward time model
> 
> summary of the table data contents:
> 
> total:
>   - 500 'sampled' individuals
>   - 777 'retained' individuals
>   - 0 no nodes from 'recapitated' individuals
> --------------------- 
> oldest sampled individual: 0 time units 'before present' 
> youngest sampled individual: 0 time units 'before present' 
> 
> oldest node: 1355 time units 'before present' 
> youngest node: 0 time units 'before present' 
> --------------------- 
> overview of the underlying sf object:
> 
> # A tibble: 1,817 × 10
>    pop   ind_id node_id  time               location sampled remembered retained alive pedigree_id
>    <chr>  <dbl>   <int> <dbl>                <POINT> <lgl>   <lgl>      <lgl>    <lgl>       <dbl>
>  1 p1         0       0     0  (0.3744914 0.2729115) TRUE    FALSE      TRUE     TRUE      5000000
>  2 p1         0       1     0  (0.3744914 0.2729115) TRUE    FALSE      TRUE     TRUE      5000000
>  3 p1         1       2     0 (0.2956575 0.03230727) TRUE    FALSE      TRUE     TRUE      5000001
>  4 p1         1       3     0 (0.2956575 0.03230727) TRUE    FALSE      TRUE     TRUE      5000001
>  5 p1         2       4     0 (0.3947814 0.02905613) TRUE    FALSE      TRUE     TRUE      5000002
>  6 p1         2       5     0 (0.3947814 0.02905613) TRUE    FALSE      TRUE     TRUE      5000002
>  7 p1         3       6     0  (0.1304009 0.1380062) TRUE    FALSE      TRUE     TRUE      5000003
>  8 p1         3       7     0  (0.1304009 0.1380062) TRUE    FALSE      TRUE     TRUE      5000003
>  9 p1         4       8     0  (0.4263232 0.3269192) TRUE    FALSE      TRUE     TRUE      5000004
> 10 p1         4       9     0  (0.4263232 0.3269192) TRUE    FALSE      TRUE     TRUE      5000004
> # … with 1,807 more rows

Because we get the tree sequence converted to the spatial sf data format, we can use standard geospatial packages to use any spatial data analysis methods that those packages provide.

Just to demonstrate, we can trivially plot the location of each recorded node (ggplot is not mandatory, it's just something I'm familiar with!):

ggplot() + geom_sf(data = data, aes(color = time), alpha = 0.5)

image

We can also collect spatio-temporal ancestry information of a particular node (i.e. the times and locations of all of its ancestors all the way to the root, with each "link" in the plot signifying parent-child edge somewhere along the tree sequence) and plot it on a 2D surface (x and y dimensions [0, 1]). The plot is obviously chaotic, but should convey the idea (the "focal node" 0 is highlighted in red). It's the same plot we have in the last figure of our paper.

ancestral_links <- ts_ancestors(ts, 0)

ggplot() +
    geom_sf(data = ancestral_links, size = 0.5, aes(alpha = parent_time)) +
    geom_sf(data = sf::st_set_geometry(ancestral_links, "parent_location"), aes(color = parent_time)) +
    geom_sf(data = data[data$node_id == 0, ], size = 3, color = "red")

image

@bodkan
Copy link
Owner Author

bodkan commented May 1, 2022

I will probably collect those simple examples in the previous message into a separate vignette. It won't add much that wouldn't be covered by the other vignettes, but it might be helpful to the people who use msprime or SLiM without slendr (basically everyone at this point) that there is a possibility to simulate data with whatever means and still analyze them in R via slendr's tskit interface.

@bodkan
Copy link
Owner Author

bodkan commented Jun 22, 2022

Hi folks, especially @bhaller and @petrelharp.

First of all, sorry for the delay with this PR. Things at $DAY_JOB went kind of sideways and I was forced to drop my work on slendr for more than a month.

I’m finally back and thanks to your comments and suggestions I have now made the code much cleaner and more robust. I now have a reasonable version of the slendr R-tskit interface that can operate on standard non-slendr SLiM and msprime tree sequences.

This PR now accomplishes what I was aiming for and the new code is ready for testing by people who use SLiM/msprime (but not slendr) for simulation and who would like to check out slendr’s features for tree-sequence analysis. I’ve already had a number of people express interest in this, so the sooner I can get things into their hands the better. Once the GitHub Actions CI unit tests pass (currently in progress), I will merge the PR.


In case you remember what the various issues were, then very briefly (if you don’t, just ignore this, I'm logging this also for my own benefit):

  • I am no longer reinventing the wheel with the sampled/focal column I’ve been adding to the slendr node/individual table. I wasn’t aware of the existence of TreeSequence.samples() which does the same thing so I'm using that where applicable.
  • ts_load() and friends no longer treat tree sequences as coming from either SLiM or msprime and ignoring everything else with a warning (which was the case in the original version of this PR). Now, tree sequence is either coming from SLiM, or its regarded as "generic" tskit tree sequence (msprime tree sequence or other source). This refactoring made the internal code much cleaner with less awkward special case conditions.
  • There is a new vignette which expands on the overview from the first comment of this PR. As discussed, perhaps @bhaller and I could talk about which (if any) of this stuff would be useful to show in the SLiM manual. I kept this vignette extremely short, because everything in it is described in all other vignettes in much greater detail. The point of the new vignette is to show "you can use SLiM and msprime, then use standard slendr functions to load/process/analyze tree sequences". It will be available here once the PR is merged.
  • I added a couple of rudimentary unit tests (such as this one and this one) which run a simple pure SLiM or msprime script and also a slendr version of the same (very simple) model and compare the resulting tree-sequence tables and trees. Additional unit tests will be added if/when issues arise by people who want to use slendr for data analysis (but not necessarily simulation).

I have talked to a bunch of people recently who use SLiM for awesome spatial selection work (range expansion surfing, etc.) — they are not that interested in slendr as a simulation tool (they need much more detailed control over spatial dynamics than what slendr provides) but they are very interested in the spatial R-tskit interface of slendr. I have also had two students working on slendr-adjacent things who rely on this PR to be merged.

Because of that, as I mentioned above, I will merge this PR as soon as I know that the GitHub CI checks and unit tests pass. If there are corner cases that I have not taken into account, the brave users who reached out to me offered to help me fix those (the only thing needed will be a SLiM script that produces a tree sequence that fails to load/process/analyze). That said, given that slendr tree sequences are normal tskit tree sequences, I don’t expect dramatic problems. The only thing the current implementation of R-tskit layer is doing is that it’s ignoring slendr-specific metadata when that metadata is missing, and uses the same code for non-slendr tree sequences otherwise.

There are other updates too, but I will leave those to an email update that I’ll send soon.

Thanks again for your input!

@bhaller
Copy link
Collaborator

bhaller commented Jun 22, 2022

Hi @bodkan. All sounds great! Happy to talk about adding a slendr-based recipe to the SLiM manual. It could go in as section 17.11, if having it added to the end makes sense to you, or somewhere internal to chapter 17. The description of slendr in section 1.10 could also probably use revision, given these changes. I'm a bit swamped at the moment – I'm at a workshop thing for the next 3.5 weeks that pretty much takes up all my time. If you'd like to propose changes/additions, I'll be happy to add them; if you want to have a zoom to discuss first, that will wait at least 1.5 weeks, very possibly longer, for a more quiet time in my life. :->

One question: is slendr now SLiM 4 compatible? I hope so. The changes needed ought to be fairly minimal, and with a little elbow grease it should be possible to have your scripts run on both 3.7 and 4.0 (let me know if you need any advice on that). It would be great if the SLiM 4 release, which I plan to do in early to mid August, did not break slendr and create a crisis. :->

Exciting! I'm curious who you've been talking to about range expansion surfing, etc. :->

@bodkan
Copy link
Owner Author

bodkan commented Jun 22, 2022

My goodness. 4 hour GitHub Actions CRAN checks. (Not running the tests on Windows because I don't have an access to a Windows machine with an R development environment set up. But given that SLiM now runs on Windows without issues, it's about time I worked on this at some point soon.)

Merging the PR now and I will also tag a new version.

Currently working on the first CRAN submission. Wish me luck, if past experiences are any indication, I'm going to need it...

@bodkan
Copy link
Owner Author

bodkan commented Jun 22, 2022

Hello @bhaller

Happy to talk about adding a slendr-based recipe to the SLiM manual. It could go in as section 17.11, if having at added to the end makes sense to you, or somewhere internal to chapter 17.

I will take a closer look and let you know. 👍

The description of slendr in section 1.10 could also probably use revision, given these changes.

Absolutely, I will take a look and suggest adjustments where necessary.

I'm a bit swamped at the moment – I'm at a workshop thing for the next 3.5 weeks that pretty much takes up all my time. If you'd like to propose changes/additions, I'll be happy to add them; if you want to have a zoom to discuss first, that will wait at least 1.5 weeks, very possibly longer, for a more quiet time in my life. :->

Same, the only reason I had time for all this is that I dropped everything else unconditionally to push the first release to CRAN and submit the paper, now that lots of people tried slendr and no major disaster has happened. I don't think I'll have a lot of time for those changes/additions over the next 2-3 weeks either. But I will be in touch sometime in July. (Even if I manage earlier, there's no pressure.)

One question: is slendr now SLiM 4 compatible? I hope so. The changes needed ought to be fairly minimal, and with a little elbow grease it should be possible to have your scripts run on both 3.7 and 4.0 (let me know if you need any advice on that). It would be great if the SLiM 4 release, which I plan to do in early to mid August, did not break slendr and create a crisis. :->

I had absolutely no time to test things with SLiM 4 just yet :( But it is important thing to check, so I opened an issue (#98) and will make sure to test this soon. Would you say that running the current slendr unit tests against the SLiM 4 binary be a good start? I.e. that only a few syntactic/semantic would be needed that I could fix a few errors that might arise? In any case, I will post potential issues there and ping you when necessary. Thanks for bringing this up.

Exciting! I'm curious who you've been talking to about range expansion surfing, etc. :->

In this particular instance mostly Excoffier people, so not that surprising. :)

@bodkan bodkan merged commit f7237fb into main Jun 22, 2022
@bhaller
Copy link
Collaborator

bhaller commented Jun 22, 2022

...Same, the only reason I had time for all this is that I dropped everything else unconditionally to push the first release to CRAN and submit the paper, now that lots of people tried slendr and no major disaster has happened. I don't think I'll have a lot of time for those changes/additions over the next 2-3 weeks either. But I will be in touch sometime in July. (Even if I manage earlier, there's no pressure.)

Sounds good.

One question: is slendr now SLiM 4 compatible? I hope so. The changes needed ought to be fairly minimal, and with a little elbow grease it should be possible to have your scripts run on both 3.7 and 4.0 (let me know if you need any advice on that). It would be great if the SLiM 4 release, which I plan to do in early to mid August, did not break slendr and create a crisis. :->

I had absolutely no time to test things with SLiM 4 just yet :( But it is important thing to check, so I opened an issue (#98) and will make sure to test this soon. Would you say that running the current slendr unit tests against the SLiM 4 binary be a good start? I.e. that only a few syntactic/semantic would be needed that I could fix a few errors that might arise? In any case, I will post potential issues there and ping you when necessary. Thanks for bringing this up.

I expect it to only be syntactic tweaks, yes. By and large, SLiMgui will offer to "autofix" the issues for you if you open a generated slendr script in SLiMgui and run it; each error encountered should produce a new "autofix" suggestion. Accepting those fixes will get you a working SLiM 4 script (although there might be a case or two where autofix is not sufficient). The SLiM 4 beta has links to more information about exactly what changed. The thing that might bite you the most is that sim.generation no longer exists; it is now community.tick or sim.cycle, to get the same sort of information (see the docs regarding what the difference between those two things is). If you use it a lot, you might find it simpler to just make a user-defined function named time() that checks the version number and returns the correct property, or something. Sorry about that; I had my reasons, but the changeover has created a lot fo work for me, too. :->

Exciting! I'm curious who you've been talking to about range expansion surfing, etc. :->

In this particular instance mostly Excoffier people, so not that surprising. :)

:->

bodkan added a commit that referenced this pull request Jun 24, 2022
Generalize the R-tskit interface towards non-slendr tree sequences
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants