Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Rhat and ESS issues for K>1 with spatial analysis #31

Closed
crcardenas opened this issue Sep 10, 2020 · 5 comments
Closed

High Rhat and ESS issues for K>1 with spatial analysis #31

crcardenas opened this issue Sep 10, 2020 · 5 comments

Comments

@crcardenas
Copy link

crcardenas commented Sep 10, 2020

Hello,

I am currently running conStruct with a fairly small dataset of 74 individuals and 525 SNPs, I cannot use my full data set of 1113 neutral SNPs because there seems to be too much missing data, I could only get the analysis to run after filtering it down this much.

I have run conStruct at K=1, n.chains=3 and n.iter=1e4 iterations without any Tail and Bulk ESS issues or Rhat problems. When ever I run K>1 I get warnings. I tried finding a solution in other issues with the same problem (for example here and here), but as of yet nothing seems to have worked.

My latest run took all weekend with k=1:2, but K=2 but still gives me these warnings:
The largest R-hat....
Bulk Effective Sampling Sizes (ESS)...
Tail Effective Sampling Sizes (ESS)...

Here is the code I used and associated trace save files.

k01sp <- conStruct(spatial = TRUE, 
                 K = 1, 
                 freqs = constructdata,
                 geoDist = geo_dist_con, 
                 coords = xy_construct,
                 prefix = "k01sp",
                 n.chains= 6,
                 n.iter = 3e4,
                 save.files = T,
                 control = setNames(list(0.95),"adapt_delta")) # use if R-hat is misbehaving, values of 0.9-0.99
k02sp <- conStruct(spatial = TRUE, 
                 K = 2, 
                 freqs = constructdata,
                 geoDist = geo_dist_con, 
                 coords = xy_construct,
                 prefix = "k2sp",
                 n.chains= 6,
                 n.iter = 3e4,
                 save.files = T,
                control = setNames(list(0.95),"adapt_delta")) # use if R-hat is misbehaving, values of 0.9-0.99

k01sp_trace.plots.chain_1.pdf
k01sp_trace.plots.chain_2.pdf
k01sp_trace.plots.chain_3.pdf
k01sp_trace.plots.chain_4.pdf
k01sp_trace.plots.chain_5.pdf
k01sp_trace.plots.chain_6.pdf

k2sp_trace.plots.chain_1.pdf
k2sp_trace.plots.chain_2.pdf
k2sp_trace.plots.chain_3.pdf
k2sp_trace.plots.chain_4.pdf
k2sp_trace.plots.chain_5.pdf
k2sp_trace.plots.chain_6.pdf

The trace plots for K=2 don't seem to agree for any iteration, but they look pretty good for K=1.

I would like to test higher values of K (up to 8), but at ... n.chains= 6, n.iter = 3e4, ... it would take a long time. I have run a non-spatial analysis in the past with this data, but that gave me similar warnings. I am mainly interested in the spatial analysis. This is because so far with this data there seems to be an effect of IBD. I have run a hierarchical AMOVA that indicates a panmictic population, with a small percentage of the variation coming from one of my stratifications. My STRUCTURE analysis fails to find a solution (multiple "peaks" of deltaK , and all with relatively low magnitude).

I do wonder, could the poor mixing and ESS warnings at K>1 be because there is no ancestral K value greater than 1 in my data? Is this an unreasonable assumption or can I move forward ignoring those warnings and see what the cross-validation and contribution layers say?

Let me know what other information or data you may need. I appreciate any assistance you can provide.

@gbradburd
Copy link
Owner

Hello,

So it seems like you have a few different questions/problems: (1) you're getting mixing warnings; (2) the analyses are taking a long time; (3) the traceplots for K=2 "don't agree". I'll respond to each below:

  1. warnings - as long as the trace plots look ok, and you're getting consistent results across independent runs, I wouldn't worry that much about the warnings. Good mixing makes things more efficient, but your results aren't necessarily suspect if the mixing is poor. It's hard for me to eyeball the traceplots you sent, but it looks like things are broadly consistent (similar log posterior probabilities and parameter estimates), so I wouldn't worry too much about inefficient mixing in any given run.

  2. analysis time - 3e4 iterations is probably overkill, as is 6 independent replicates. I would try a smaller number of iterations (maybe 1e4 as a max?) and see how mixing looks in each run, and whether independent runs agree with each other.

  3. "don't agree for any iteration" - I'm not sure what you mean here. It looks like the results for K=2 are broadly comparable across independent runs. Is there a parameter or feature of the data in particular that jumps out as troubling to you that I'm missing here?

hope that helps,
-Gideon

@crcardenas
Copy link
Author

Gideon,

When I look at the trace plots they dont all seem to agree. It seems like you mean they should all be in the same ballpark on the Y axis then right? For example the supplied k2 plots all tend to float between 45950 and 45980. I guess I saw the chain 6 in that run with a slight trend upwards and got worried.

I will run this with smaller iterations, and monitor the chains again and let you know how things go, ie: n.chains= 5, n.iter = 1e4,

I appreciate your guidance.

@gbradburd
Copy link
Owner

I guess I’m still not sure I understand what you mean by “agree”. The MCMC algorithm stochastically samples from the posterior probability surface, so no two independent runs should be identical. Instead, if they’re performing well, you should see that they’re converging on the same stationary distribution - ie, if you were to compare the marginal distributions for any given parameter across multiple independent runs, they should be on top of each other. But the sampled value of the parameter in a given MCMC iteration isn't expected to be the same between any two runs. I agree that the chain 6 for K=2 is maybe trending somewhere, but if 5/6 chains largely agree, that's pretty good.

Let me know if you have further troubles on this issue.
-g

@crcardenas
Copy link
Author

HI Gideon,

I have run the analysis, and I still get the ESS warnings, but the distributions seem to be OK. But this is one K4 run performed particularly poorly.

I would like to confirm before I close this, If one run of K4 spatial analysis performed poorly is it OK to include it when comparing layer contributions? It seems that additional layers above K2 add little information.
image

However, I have a new problem. I understand I should also run the x.validation() function, but I run into an error message. It seems to be similar to what others have experienced with smaller datasets: "there must be more loci in each partition than there are samples," and then adjusting the training proportion smaller (say 60%) gives me a "... you have specified an invalid data partition "data" element that is not positive definite.." I hesitate to further filter this dataset down, but I have started with the samples with the most missing data.

But this might require a different issues post, because its not relevant to the high Rhat issues. Let me know if you would like me to open a new issue. In order to expedite this, I will be running much smaller iterations (1,000 instead of 10,000, and 5 iterations), and see if I can get the cross validation method to work.

Thanks again for all your help,
Cody Cardenas

@gbradburd
Copy link
Owner

Hi Cody,

Yeah that one run looks wonky, but if the others don't then I wouldn't worry too much about it.

W/r/t the xvalidation - it's a hard rule of the likelihood function that there have to be more samples than loci in each partition. And as you've found, missing data can contribute to positive-definite woes. If you'd like to open a new issue you're welcome to, but I'm not sure how much help I can provide, as these problems seem to be features of your data. You could try dropping samples with lots of missing data (that'll probably help with the positive-definite stuff, and will definitely help decrease the samples:loci ratio). Alternatively, you could try lumping samples from the same location.

hope that helps,
-Gideon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants