-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Puzzling differences of statistics computed from recapitated SLiM tree sequences compared to theory #141
Comments
Wow, this is rather surprising. I'll have to dig in to this. |
Wow. Great bug report @bodkan. I will be curious to hear what @petrelharp finds! One note: you write "There is no simplification anywhere", but that is probably not true. SLiM probably auto-simplifies at least once during the run of the model, and certainly at the call to |
Thanks @petrelharp and @bhaller! I was a bit worried that I'm missing something totally obvious (even after all this time, tree sequences are always ready to surprise me), which is why I conservatively submitted an issue here first.
Ah, yes, of course! Thank you for the clarification. What I meant to say is that I do not explicitly call Speaking of which, I did wonder if the issue described above is related to this note from the same section of the docs but I didn't want to speculate too much because I'm not familiar with the low-level details of how this is all done under the hood:
After all, the lower apparent diversity in the middle ii.b boxplot in the figure above would be consistent with the recapitated MRCA to be younger (and the genealogies to be "shallower") than expected by theory? I think? But that's just a hunch. |
I don't see how that would happen, since you don't simplify in python; but yes, it seems possibly related. |
Here's a preliminary report: Well, the surprising thing is that With this as
and this as
after running
then the following script
gets
... so, I'm not seeing the bug here. AFAICT this ought to be doing just what the R code is doing, but apparently it is not. Further investigations to follow. |
Whoah, this is even weirder! Let me try rerunning what you have on my setup first (versions of SLiM and the Python deps, just to be sure). I will also try to eliminate the reticulate-d Python calls from my code. The one potential issue I could foresee there is some forced / broken implicit conversion of numeric types passed from R to Python… but who knows? I have been taking the reticulate magic for granted, that’s the one obvious uncontrolled variable here. I will dig in as soon as I get to work. Thank you so much for your help! I’m quite relieved to know that everything works on your end as it should! |
Uh-oh: using the same
I get consistently something like
... so, this seems to be reproducing the bug. And, it seems to enter through |
I currently have no idea what's going on - hopefully, you know where to look? |
OK, I'm finally at the office and have to run to another meeting soon, but I immediately noticed one thing (I was reading this on my phone before): are you sure you are recapitating in your Python script? You defined the If I modify your script in this way (just adding the
I get the following on the
Which is exactly what is produced by the R code using reticulate after slightly modifying the R script to match the outputs of the new Python script just above (and using the same random seed):
which gives me this:
Note that the values of the recapitated "v2" ("early") are also closer to the expected So the issue is present even with pure Python recapitation. |
Ah ha! Well, that is now less mysterious, at least. I'll figure out what's going on. |
ARGGGHHH it is a bug in recapitate. I'll open a bug on pyslim and refer to this one. Thanks for tracking this down. |
Moving this to tskit-dev/pyslim#307 - thanks again for the excellent report, @bodkan; tracking this down looks like a lot of work. |
Thanks @petrelharp! I was a bit worried that I'm going crazy... or (worse, and more likely) that there's some really nasty bug in slendr itself. I'm glad this report turned out to be useful. Thank you very much for jumping on this so quickly! Also, a significant credit goes to @IsabelMarleen who wrote an exciting new Demes -> slendr import functionality. She first noticed a similar pattern as reported in the figure above when she computed various popgen statistics from different published Demes models in order to compare the results between slendr/msprime and slendr/SLiM. This was primarily intended for a new set of statistical unit tests of the slendr's msprime and SLiM simulation code but it's great that this turned out to be helpful even for pyslim! |
A set of statistical unit tests sounds WONDERFUL. I've been wanting to put something like that together for years, but, well, inertia, plus my knowledge of popgen is so weak that I don't really know what good tests would be, what the expected results would be, etc. So... I'm very glad you folks are doing this! |
For the record, the bug has now been fixed and the fix has been confirmed even with a pure-slendr version of the analysis above as shown here. |
Hello @bhaller and @petrelharp, I have recently encountered a strange tree-sequence-related and recapitation-related issue with my MSc student @IsabelMarleen. It all started with slendr, but I have now managed to cut down the hundreds of lines of slendr’s Eidos code to a minimum example, described below.
Given that my issue persists even when I take slendr out of the picture this probably belongs to the tskit or SLiM GitHub, but I decided to post it here first because I'm not entirely sure what's going on (and where).
Problem setup
Let's say that I simulated tree sequences from three different simulations of a constant Ne single-population model (
Ne = 1000
, sequence lengthL = 10e6
bp, recombination raterho = 1e-8
):i. A standard msprime simulation via
msprime.sim_ancestry()
. In the attached full R script, I run this with a helper R function assimulate_msprime_ts(Ne = 1000, L = 10e6, rho = 1e-8)
.ii. The following SLiM script (I will call this "v1"):
In the analysis below, I get a simulated tree sequence from this script using a helper R function like this
simulate_slim_ts(Ne = 1000, time = 1000, L = 10e6, rho = 1e-8, "v1")
(the function is defined in the attached R script).iii. And finally the following SLiM script (I will call this "v2"):
In the analysis below, I get a simulated tree sequence from this script using a helper R function like this
simulate_slim_ts(Ne = 1000, time = 1000, L = 10e6, rho = 1e-8, "v2")
(the function defined in the attached R script).Note that the only difference between "v1" and "v2" is the timing of the creation of the population "p0" (
early()
orlate()
) .How I ran the simulations
Using i-iii above, I ran 50 replicates to generate the following tree-sequence outputs from:
i. the msprime run (item i above)
ii. the "v1" SLiM run (item ii above) -- two versions:
a. letting it run for 30000 ticks to ensure a complete coalescence (basically including a "burn-in");
b. letting it run for 1000 ticks, then recapitating it to ensure a complete coalescence (using
pyslim.recapitate
withNe
and recombination raterho
as used in the forward simulation).iii. the "v2" SLiM run (item iii above) -- two versions:
a. letting it run for 30000 ticks to ensure a complete coalescence (basically including a "burn-in");
b. letting it run for 1000 ticks, then recapitating it to ensure a complete coalescence (using
pyslim.recapitate
withNe
and recombination raterho
as used in the forward simulation).Then, for each of those five tree sequences (i, ii.a, ii.b, iii.a, iii.b as above), I computed
<tree-sequence object>.diversity(sample_sets = [<100 sample nodes>], mode = "branch")
.A couple of notes:
allele_frequency_spectrum()
ordivergence()
methods of tskit.Problem itself
My expectation was, that all five tree sequences should give me the same diversity. In particular, that it should not matter whether I run a SLiM simulation "long enough" (to achieve complete coalescence of all genealogies via a burn-in) or if I run the simulation for a short(er) time and then follow it up with recapitation.
However, this is not the case. What I see is shown in the following figure. Recapitation gives me the diversity consistent with the coalescent msprime simulation and with population genetic theory only when I run the "v2" version of the SLiM script as shown above. If I run the "v1" version, the diversity is consistently lower.
My expectation
I understand that whether the
p0
population is created at the end of the first tick ("v1" above) or at the beginning of the first tick ("v2" above) will influence how much time in total doesp0
"spend" in the simulation -- the difference of just one generation.However, it was my assumption that the only parameter that the recapitation coalescent simulation cares about is the Ne of the population at the time of the oldest recorded nodes in the tree sequence (because it involves putting a coalescent simulation "on top" of the oldest nodes)? But if this is what's going on, I'd think that it shouldn't matter if the most ancestral population is created in a "v1"
1 late() { ... }
or "v2"1 early() { ... }
, because the oldest nodes (which then form the basis for the recapitation simulation) would be just one generation younger in v1 compared to v2?Am I wrong? And if I am, could you please help me understand where I am wrong?
The reason I discovered this issue is that the SLiM simulation code in slendr basically uses the "v1" option. We were running some statistical tests doing comparisons of SLiM+burn-in vs SLiM+recapitation vs msprime vs popgen theory on the same model, and discovered this surprising pattern.
Full reproducible script
This probably won't help clarify the issue, but given that the figure for the test above needs multiple simulation runs to show the pattern, I'm attaching the full R script to replicate it. On top of the script is a bit of a setup (creating a minimum Python virtual environment with msprime + tskit + pyslim to be ran in shell) and a few R dependencies for parallelization and ggplot2-ing the results. It doesn't use slendr for anything, but it does rely on reticulate to call the Python code from R. That said, all the calls of Python methods of tskit/pyslim/msprime are quite trivial so I hope that even the R code will be readable (i.e., instead of calling
tskit.load("<path>")
as in Python you dotskit$load("<path>")
in R).Thank you for any help or pointers!
I'm also tagging @FerRacimo, because we discussed this issue not long ago.
Here's the complete R script: ts_reprex.R.zip.
If the description of the problem above is not sufficient and the R script is giving you problems, please let me know.
The text was updated successfully, but these errors were encountered: