-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multitrait GWAS (-lmm) has NaN values for Se(Ve) #45
Comments
@lindrothlab Have you first tried to run the basic linear model (e.g., |
Also, mvLMM is sensitive to phenotype normalization. We usually quantile transform each phenotype to a standard normal distribution before analysis. |
Individually, each trait seems to run okay with GEMMA using -lmm 2 and -lm 2 (producing complete log and .assoc.txt files). Yet, the l_mle values are all 100,000 (see attached figure) for both traits. Otherwise, the output looks seemingly similar to output from the sample mouse data. Both traits are rank-based transformed and normally distributed. One of the traits has one missing value (coded as -9) while the other trait has no missing data. |
@lindrothlab The maximum likelihood estimates of lambda are hitting the maximum allowed value (the default value of
@lindrothlab Do you get similar values for |
Yes, for both of the traits the l_mle in the univariate LMM analyses are all 1e5. |
@lindrothlab How are you computing the relatedness matrix? Can you tell us anything about the genetic relatedness of these samples? Could any of them share recent ancestors? @xiangzhou Have you ever seen very large ML estimates of lambda like these before? Do you think it could be an issue with the relatedness matrix not being s.p.d. (symmetric positive definite)? Would you suggest inspecting the eigenvalues of the relatedness matrix? |
To recap, we get odd behaviour with both of the univariate & multivariate LMMs.
This could be an issue, but hard to say for sure.
This is quite possibly the root of the problem. The relatedness matrix is expected to be symmetric positive definite (s.p.d.), and an important s.p.d. matrix (@xiangzhou please correct me if I've made any incrorrect assumptions here. Does GEMMA check that all the eigenvalues are positive?) @lindrothlab Maybe you can try removing the related individuals and see if this removes the zero eigenvalues? If that does not work, perhaps you can share the data (e.g., via Dropbox) and I can try to see if I can get GEMMA to work. |
Ok, I'll give this a go and get back to you. Thanks! |
After getting rid of one sample from all of the related pairs (relatedness ~ 0.12), the kinship matrix still does not seem to be quite right. Most of the eigenvalues vary from 0.5 to 0.1, but the very last eigenvalue is 5.5e-13. And this is fairly similar to the eigenvalues from the kinship matrix that included the related samples (most values varied from 0.5 to 0.07, but the last eigenvalue was 2.5e-13). I then tried the univariate GWAS for each trait (without the related samples) and got similar results as before: a few l_remle values of 1e5 for one of the traits and then NaNs in the se(Ve) when I tried to run a multivariate GWAS. Could I share the data via GoogleDrive or Dropbox? Thank you! |
@lindrothlab Please go ahead and send us a Dropbox or Google Drive link (Dropbox is preferred); if you are concerned about making the link publicly available, you can send it to our email addresses. |
Here's the data in plink format. Since the data aren't human or any other model form that plink supports, the SNPs have all been assigned to chromosome one. Please let me know is you have any questions. Thank you! https://www.dropbox.com/sh/xs3zyn35rc7c89p/AACthEmSWB8za3v2fb5mMt4pa?dl=0 |
I've standardized the kinship matrix according to https://aeolister.wordpress.com/2016/06/27/fixing-a-non-positive-definite-kinship-matrix/ and the eigenvalues are slightly improved (I've added this kinship matrix "normalizedKinshipForGEMMA" to the dropbox folder). Also, the trait that was previously giving 1e5 values for l_remle is no longer having this issue and the multitrait GWAS has run and I think the output file looks okay!
So, I think this has worked. I'm not sure why the centered kinship matrix needed this standardization. i.e., Whether something is odd with the input SNP data? or if this is "normal" given that nature of SNP datasets. |
@lindrothlab That's great news! Could you please send us the original Do you have a lot of SNPs with very low allele frequencies? I'm curious why this "standardization" made such a big difference. |
Yes, certainly! To generate the kinship matrix, I used: I had tried both a centered and a standardized relatedness matrix in GEMMA (-gk 1 and -gk 2) and both of them seemed to have similar eigenvalue properties leading to issues with multivariate GWAS and 1e5 l_remle estimates for some traits. To fix the matrix (make it a positive semi definite matrix), I used a "second level" of standardization with an R function on this blog (https://aeolister.wordpress.com/2016/06/27/fixing-a-non-positive-definite-kinship-matrix/). To do this in R: # upload kinship matrix from GEMMA
rrk <- as.matrix(read.csv("~/Desktop/removedrelated_K.cXX.txt", sep = "", row.names = NULL, header = FALSE))
# R function to normalize the kinship matrix, making it positive semi definite
normalize_kinmat <- function(kinmat){
#normalize kinship so that Kij \in [0,1]
tmp=kinmat - min(kinmat)
tmp=tmp/max(tmp)
tmp[1:9,1:9]
#fix eigenvalues to positive
diag(tmp)=diag(tmp)-min(eigen(tmp)$values)
tmp[1:9,1:9]
return(tmp)
}
normk <- normalize_kinmat(rrk)
str(normk)
eigen_normk <- eigen(normk) # Check the eigenvalues to see whether the properties have improved
eigen_normk$values
# export matrix to use in GEMMA
write.table(normk, "~/Desktop/normalizedKinshipForGEMMA.cxx.txt", sep = "\t", col.names = FALSE, row.names = FALSE) And yes, our SNP dataset does have a lot of low frequency alleles. Perhaps this explains why the kinship matrix needed to be manipulated. |
@lindrothlab Thank you for sharing. I have not seen this particular approach used before. It seems reasonable, though not immediately clear to me why it would work. I've added a "documentation" label to this Issue in case others find this approach useful. |
To make sure this was not related to my fixes over the last days I have replicated this issue and added a test for positive definite after loading K. It is interesting to note that mvlmm runs 'forever' with the first se_ve nan example and works with the adjusted K. Both K's are positive definite. I should check also for symmetry. I think we should add an option in GEMMA to adjust K as described and drop related pairs and perhaps low AF. wdyt? |
@pjotrp There is already a Dropping related pairs is an interesting idea. My experience is that if a kinship matrix is not s.p.d. then there is something wrong with your relatedness calculation; while useful as a last resort, I don't think we should add the above solution as a feature to GEMMA---I don't think it is good practice. |
I have added more checks for the state of K, see https://github.com/genenetwork/GEMMA/blob/issue45/src/gemma.cpp#L1894 (don't merge this branch just yet, I need to clean it up). When checking above corrected normalizedKinshipForGEMMA.cxx.txt I find it still has small eigen values (1.9e-15):
@xiangzhou at what level should we warn about small eigen values? I have set it to 1e-6. A warning is not harmful and may be suggestive. |
Thanks @pjotrp. Giving a warning is a great idea. However, what is important is the ratio of the largest and smallest eigenvalues; this is the condition number of a matrix. This tells us how numerically stable the linear system is. |
this piece nicely describes ins and outs of non-positive definite matrices and adding values to the diagonal such as done in this issue: http://www2.gsu.edu/~mkteer/npdmatri.html |
The code base should be a lot better now. Please reopen if you encounter problems. |
Hello,
I have attempted multitrait GWAS with two traits and a kinship matrix (no other covariates) and I think something must be wrong with my input data. Here's the command line output:
The se(Ve) has NaNs and the "Reading SNPs" does not advance from 0.00%. The command does not execute fully. Do you have any suggestions for what could be wrong with the input data? I can attach the data as a zip if needed.
Thank you very much!
Hilary
The text was updated successfully, but these errors were encountered: