Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Puzzling differences of statistics computed from recapitated SLiM tree sequences compared to theory #141

Closed
bodkan opened this issue Jun 13, 2023 · 16 comments

Comments

@bodkan
Copy link
Owner

bodkan commented Jun 13, 2023

Hello @bhaller and @petrelharp, I have recently encountered a strange tree-sequence-related and recapitation-related issue with my MSc student @IsabelMarleen. It all started with slendr, but I have now managed to cut down the hundreds of lines of slendr’s Eidos code to a minimum example, described below.

Given that my issue persists even when I take slendr out of the picture this probably belongs to the tskit or SLiM GitHub, but I decided to post it here first because I'm not entirely sure what's going on (and where).

Problem setup

Let's say that I simulated tree sequences from three different simulations of a constant Ne single-population model (Ne = 1000, sequence length L = 10e6 bp, recombination rate rho = 1e-8):

i. A standard msprime simulation via msprime.sim_ancestry(). In the attached full R script, I run this with a helper R function as simulate_msprime_ts(Ne = 1000, L = 10e6, rho = 1e-8).

ii. The following SLiM script (I will call this "v1"):

initialize() {
  initializeTreeSeq();
  initializeMutationRate(0);
  initializeMutationType("m0", 0.5, "f", 0.0);
  initializeGenomicElementType("g1", m0, 1.0);
  initializeGenomicElement(g1, 0, 1e+07 - 1);
  initializeRecombinationRate(1e-08);
}
1 late() { // <<<------ NOTE THIS LINE --------
  sim.addSubpop("p0", 1000);
}
1000 late() {
  sim.treeSeqOutput("/tmp/v1.trees");
}

In the analysis below, I get a simulated tree sequence from this script using a helper R function like this simulate_slim_ts(Ne = 1000, time = 1000, L = 10e6, rho = 1e-8, "v1") (the function is defined in the attached R script).

iii. And finally the following SLiM script (I will call this "v2"):

initialize() {
  initializeTreeSeq();
  initializeMutationRate(0);
  initializeMutationType("m0", 0.5, "f", 0.0);
  initializeGenomicElementType("g1", m0, 1.0);
  initializeGenomicElement(g1, 0, 1e+07 - 1);
  initializeRecombinationRate(1e-08);
}
1 early() { // <<<------ NOTE THIS LINE --------
  sim.addSubpop("p0", 1000);
}
1000 late() {
  sim.treeSeqOutput("/tmp/v2.trees");
}

In the analysis below, I get a simulated tree sequence from this script using a helper R function like this simulate_slim_ts(Ne = 1000, time = 1000, L = 10e6, rho = 1e-8, "v2") (the function defined in the attached R script).

Note that the only difference between "v1" and "v2" is the timing of the creation of the population "p0" (early() or late()) .

How I ran the simulations

Using i-iii above, I ran 50 replicates to generate the following tree-sequence outputs from:

i. the msprime run (item i above)

ii. the "v1" SLiM run (item ii above) -- two versions:
a. letting it run for 30000 ticks to ensure a complete coalescence (basically including a "burn-in");
b. letting it run for 1000 ticks, then recapitating it to ensure a complete coalescence (using pyslim.recapitate with Ne and recombination rate rho as used in the forward simulation).

iii. the "v2" SLiM run (item iii above) -- two versions:
a. letting it run for 30000 ticks to ensure a complete coalescence (basically including a "burn-in");
b. letting it run for 1000 ticks, then recapitating it to ensure a complete coalescence (using pyslim.recapitate with Ne and recombination rate rho as used in the forward simulation).

Then, for each of those five tree sequences (i, ii.a, ii.b, iii.a, iii.b as above), I computed <tree-sequence object>.diversity(sample_sets = [<100 sample nodes>], mode = "branch").

A couple of notes:

  • There is no simplification anywhere. The tree sequences are loaded and recapitated immediately after the simulation.
  • I get the same pattern as in the figure below even with allele_frequency_spectrum() or divergence() methods of tskit.

Problem itself

My expectation was, that all five tree sequences should give me the same diversity. In particular, that it should not matter whether I run a SLiM simulation "long enough" (to achieve complete coalescence of all genealogies via a burn-in) or if I run the simulation for a short(er) time and then follow it up with recapitation.

However, this is not the case. What I see is shown in the following figure. Recapitation gives me the diversity consistent with the coalescent msprime simulation and with population genetic theory only when I run the "v2" version of the SLiM script as shown above. If I run the "v1" version, the diversity is consistently lower.

image

My expectation

I understand that whether the p0 population is created at the end of the first tick ("v1" above) or at the beginning of the first tick ("v2" above) will influence how much time in total does p0 "spend" in the simulation -- the difference of just one generation.

However, it was my assumption that the only parameter that the recapitation coalescent simulation cares about is the Ne of the population at the time of the oldest recorded nodes in the tree sequence (because it involves putting a coalescent simulation "on top" of the oldest nodes)? But if this is what's going on, I'd think that it shouldn't matter if the most ancestral population is created in a "v1" 1 late() { ... } or "v2" 1 early() { ... }, because the oldest nodes (which then form the basis for the recapitation simulation) would be just one generation younger in v1 compared to v2?

Am I wrong? And if I am, could you please help me understand where I am wrong?

The reason I discovered this issue is that the SLiM simulation code in slendr basically uses the "v1" option. We were running some statistical tests doing comparisons of SLiM+burn-in vs SLiM+recapitation vs msprime vs popgen theory on the same model, and discovered this surprising pattern.

Full reproducible script

This probably won't help clarify the issue, but given that the figure for the test above needs multiple simulation runs to show the pattern, I'm attaching the full R script to replicate it. On top of the script is a bit of a setup (creating a minimum Python virtual environment with msprime + tskit + pyslim to be ran in shell) and a few R dependencies for parallelization and ggplot2-ing the results. It doesn't use slendr for anything, but it does rely on reticulate to call the Python code from R. That said, all the calls of Python methods of tskit/pyslim/msprime are quite trivial so I hope that even the R code will be readable (i.e., instead of calling tskit.load("<path>") as in Python you do tskit$load("<path>") in R).

Thank you for any help or pointers!

I'm also tagging @FerRacimo, because we discussed this issue not long ago.

Here's the complete R script: ts_reprex.R.zip.
If the description of the problem above is not sufficient and the R script is giving you problems, please let me know.

@petrelharp
Copy link
Collaborator

Wow, this is rather surprising. I'll have to dig in to this.

@bhaller
Copy link
Collaborator

bhaller commented Jun 13, 2023

Wow. Great bug report @bodkan. I will be curious to hear what @petrelharp finds!

One note: you write "There is no simplification anywhere", but that is probably not true. SLiM probably auto-simplifies at least once during the run of the model, and certainly at the call to treeSeqOutput(). But those simplifies will keep the first generation individuals, so they should not be important to what's happening here, I imagine (?).

@bodkan
Copy link
Owner Author

bodkan commented Jun 13, 2023

Thanks @petrelharp and @bhaller! I was a bit worried that I'm missing something totally obvious (even after all this time, tree sequences are always ready to surprise me), which is why I conservatively submitted an issue here first.

One note: you write "There is no simplification anywhere", but that is probably not true. SLiM probably auto-simplifies at least once during the run of the model, and certainly at the call to treeSeqOutput(). But those simplifies will keep the first generation individuals, so they should not be important to what's happening here, I imagine (?).

Ah, yes, of course! Thank you for the clarification. What I meant to say is that I do not explicitly call simplify() before calling recapitate(), because I know from the pyslim docs on the topic that this can cause problems unless taken care of via keep_input_roots=True argument of simplify().

Speaking of which, I did wonder if the issue described above is related to this note from the same section of the docs but I didn't want to speculate too much because I'm not familiar with the low-level details of how this is all done under the hood:

[...] suppose our SLiM simulation has a single diploid who has reproduced by clonal reproduction for 1,000 generations, so that the final tree sequence is just two vertical lines of descent going back to the two chromosomes in the initial individual alive 1,000 generations ago. Recapitation would produce a shared history for these two chromosomes, that would coalesce some time longer ago than 1,000 generations. However, if we simplified first, then those two branches going back 1,000 generations would be removed, since they don’t convey any information about the shape of the tree; and so recapitation might produce a common ancestor more recently than 1,000 generations, which would be inconsistent with the SLiM simulation.

After all, the lower apparent diversity in the middle ii.b boxplot in the figure above would be consistent with the recapitated MRCA to be younger (and the genealogies to be "shallower") than expected by theory? I think? But that's just a hunch.

@bhaller
Copy link
Collaborator

bhaller commented Jun 13, 2023

I don't see how that would happen, since you don't simplify in python; but yes, it seems possibly related.

@petrelharp
Copy link
Collaborator

petrelharp commented Jun 14, 2023

Here's a preliminary report:

Well, the surprising thing is that early versus late gets different diversities after recapitation. Here's a try at independently reproducing that:

With this as late.slim:

initialize() {
  initializeTreeSeq();
  initializeMutationRate(0);
  initializeMutationType("m0", 0.5, "f", 0.0);
  initializeGenomicElementType("g1", m0, 1.0);
  initializeGenomicElement(g1, 0, 1e+07 - 1);
  initializeRecombinationRate(1e-08);
}
1 late() { // <<<------ NOTE THIS LINE --------
  sim.addSubpop("p0", 1000);
}
1000 late() {
  sim.treeSeqOutput("late.trees");
}

and this as early.slim:

initialize() {
  initializeTreeSeq();
  initializeMutationRate(0);
  initializeMutationType("m0", 0.5, "f", 0.0);
  initializeGenomicElementType("g1", m0, 1.0);
  initializeGenomicElement(g1, 0, 1e+07 - 1);
  initializeRecombinationRate(1e-08);
}
1 early() { // <<<------ NOTE THIS LINE --------
  sim.addSubpop("p0", 1000);
}
1000 late() {
  sim.treeSeqOutput("early.trees");
}

after running

for x in early late; do slim -s 23 ${x}.slim; done

then the following script

import msprime, tskit, pyslim

Ne = 1000    # constant Ne of the population
rho  = 1e-8  # recombination rate

ts_early = tskit.load("early.trees")
ts_late = tskit.load("late.trees")

def recap(ts):
    return pyslim.recapitate(ts, ancestral_Ne=Ne, recombination_rate=rho)

div_early = ts_early.diversity(mode='branch')
div_late = ts_late.diversity(mode='branch')

print(f"early: {div_early}, late: {div_late}")

gets

early: 1609.5673186794847, late: 1606.1689324204615

... so, I'm not seeing the bug here.

AFAICT this ought to be doing just what the R code is doing, but apparently it is not. Further investigations to follow.

@bodkan
Copy link
Owner Author

bodkan commented Jun 14, 2023

Whoah, this is even weirder!

Let me try rerunning what you have on my setup first (versions of SLiM and the Python deps, just to be sure).

I will also try to eliminate the reticulate-d Python calls from my code. The one potential issue I could foresee there is some forced / broken implicit conversion of numeric types passed from R to Python… but who knows? I have been taking the reticulate magic for granted, that’s the one obvious uncontrolled variable here.

I will dig in as soon as I get to work.

Thank you so much for your help! I’m quite relieved to know that everything works on your end as it should!

@petrelharp
Copy link
Collaborator

No need to re-run: I can confirm that I get the same plot on my computer:
Screenshot from 2023-06-13 22-51-13

@petrelharp
Copy link
Collaborator

petrelharp commented Jun 14, 2023

Uh-oh: using the same .trees files as above and this minimal script:

reticulate::use_virtualenv("/tmp/tsreprex")
msprime <- reticulate::import("msprime")
tskit <- reticulate::import("tskit")
pyslim <- reticulate::import("pyslim")

Ne <- 1000    # constant Ne of the population
rho  <- 1e-8  # recombination rate

ts_slim_v1_short <- tskit$load("late.trees") |>
  pyslim$recapitate(ancestral_Ne = Ne, recombination_rate = rho)

ts_slim_v2_short <- tskit$load("early.trees") |>
  pyslim$recapitate(ancestral_Ne = Ne, recombination_rate = rho)

pi_slim_v1_short <- ts_slim_v1_short$diversity(mode = "branch")
pi_slim_v2_short <- ts_slim_v2_short$diversity(mode = "branch")

sprintf("early: %f, late: %f", pi_slim_v2_short, pi_slim_v1_short)

I get consistently something like

early: 4308.810302, late: 3347.565959

... so, this seems to be reproducing the bug.

And, it seems to enter through recapitate: diversity before recapitation is (nearly) the same.

@petrelharp
Copy link
Collaborator

I currently have no idea what's going on - hopefully, you know where to look?

@bodkan
Copy link
Owner Author

bodkan commented Jun 14, 2023

OK, I'm finally at the office and have to run to another meeting soon, but I immediately noticed one thing (I was reading this on my phone before): are you sure you are recapitating in your Python script? You defined the recap() helper function in your Python script but it doesn't look like you're calling it, at least in the version of the script you pasted here.

If I modify your script in this way (just adding the recap() calls with a set random seed):

import msprime, tskit, pyslim

Ne = 1000   # constant Ne of the population
rho = 1e-8  # recombination rate

ts_early = tskit.load("early.trees")
ts_late = tskit.load("late.trees")

def recap(ts):
    return pyslim.recapitate(ts, ancestral_Ne=Ne, recombination_rate=rho, random_seed=42)

div_early = ts_early.diversity(mode='branch')
div_late = ts_late.diversity(mode='branch')

div_early_recap = recap(ts_early).diversity(mode='branch')
div_late_recap = recap(ts_late).diversity(mode='branch')

print(f"no recapitation -- early: {div_early}, late: {div_late}")
print(f"with recapitation -- early: {div_early_recap}, late: {div_late_recap}")

I get the following on the early.trees and late.trees files produced by the two minimal SLiM scripts:

no recapitation -- early: 1609.5673186794847, late: 1606.1689324204615
with recapitation -- early: 4518.1617484892295, late: 3044.321404057624

Which is exactly what is produced by the R code using reticulate after slightly modifying the R script to match the outputs of the new Python script just above (and using the same random seed):

reticulate::use_virtualenv("/tmp/tsreprex")
msprime <- reticulate::import("msprime")
tskit <- reticulate::import("tskit")
pyslim <- reticulate::import("pyslim")

Ne <- 1000    # constant Ne of the population
rho  <- 1e-8  # recombination rate

v1_short <- tskit$load("late.trees")
v2_short <- tskit$load("early.trees")
v1_short_recap <- v1_short |> pyslim$recapitate(ancestral_Ne = Ne, recombination_rate = rho, random_seed = 42)
v2_short_recap <- v2_short |> pyslim$recapitate(ancestral_Ne = Ne, recombination_rate = rho, random_seed = 42)

pi_v1_short <- v1_short$diversity(mode = "branch")
pi_v2_short <- v2_short$diversity(mode = "branch")
pi_v1_short_recap <- v1_short_recap$diversity(mode = "branch")
pi_v2_short_recap <- v2_short_recap$diversity(mode = "branch")

cat(sprintf("no recapitation -- early: %f, late: %f\n", pi_v2_short, pi_v1_short))
cat(sprintf("with recapitation -- early: %f, late: %f\n", pi_v2_short_recap, pi_v1_short_recap))

which gives me this:

no recapitation -- early: 1609.567319, late: 1606.168932
with recapitation -- early: 4518.161748, late: 3044.321404

Note that the values of the recapitated "v2" ("early") are also closer to the expected theta = 4 * Ne = 4 * 1000 = 4000.

So the issue is present even with pure Python recapitation.

@petrelharp
Copy link
Collaborator

Ah ha! Well, that is now less mysterious, at least. I'll figure out what's going on.

@petrelharp
Copy link
Collaborator

ARGGGHHH it is a bug in recapitate. I'll open a bug on pyslim and refer to this one. Thanks for tracking this down.

@petrelharp
Copy link
Collaborator

Moving this to tskit-dev/pyslim#307 - thanks again for the excellent report, @bodkan; tracking this down looks like a lot of work.

@bodkan
Copy link
Owner Author

bodkan commented Jun 14, 2023

Thanks @petrelharp!

I was a bit worried that I'm going crazy... or (worse, and more likely) that there's some really nasty bug in slendr itself. I'm glad this report turned out to be useful. Thank you very much for jumping on this so quickly!

Also, a significant credit goes to @IsabelMarleen who wrote an exciting new Demes -> slendr import functionality. She first noticed a similar pattern as reported in the figure above when she computed various popgen statistics from different published Demes models in order to compare the results between slendr/msprime and slendr/SLiM. This was primarily intended for a new set of statistical unit tests of the slendr's msprime and SLiM simulation code but it's great that this turned out to be helpful even for pyslim!

@bhaller
Copy link
Collaborator

bhaller commented Jun 14, 2023

Also, a significant credit goes to @IsabelMarleen who wrote an exciting new Demes -> slendr import functionality. She first noticed a similar pattern as reported in the figure above when she computed various popgen statistics from different published Demes models in order to compare the results between slendr/msprime and slendr/SLiM. This was primarily intended for a new set of statistical unit tests of the slendr's msprime and SLiM simulation code but it's great that this turned out to be helpful even for pyslim!

A set of statistical unit tests sounds WONDERFUL. I've been wanting to put something like that together for years, but, well, inertia, plus my knowledge of popgen is so weak that I don't really know what good tests would be, what the expected results would be, etc. So... I'm very glad you folks are doing this!

@bodkan
Copy link
Owner Author

bodkan commented Jun 22, 2023

For the record, the bug has now been fixed and the fix has been confirmed even with a pure-slendr version of the analysis above as shown here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants