Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multitrait GWAS (-lmm) has NaN values for Se(Ve) #45

Closed
lindrothlab opened this issue Jun 20, 2017 · 23 comments
Closed

Multitrait GWAS (-lmm) has NaN values for Se(Ve) #45

lindrothlab opened this issue Jun 20, 2017 · 23 comments

Comments

@lindrothlab
Copy link

lindrothlab commented Jun 20, 2017

Hello,

I have attempted multitrait GWAS with two traits and a kinship matrix (no other covariates) and I think something must be wrong with my input data. Here's the command line output:

lindroth@lindroth-5810:/usr/share/gemma_program$ ./gemma -bfile WisAsp_BCFfiltered_VCFfiltered_vcf-merge_VCFfiltered-take2_maf05_plink_LinkImpute_LDprune -k output/kinship.cXX.txt -lmm 4 -n 13 15  -o wisasp_multitrait

Reading Files ... 
## number of total individuals = 446
## number of analyzed individuals = 445
## number of covariates = 1
## number of phenotypes = 2
## number of total SNPs = 139570
## number of analyzed SNPs = 139570
Start Eigen-Decomposition...
REMLE estimate for Vg in the null model: 
2.0034	
-0.2032	2.0683	
se(Vg): 
1.5507	
1.0244	0.9791	
REMLE estimate for Ve in the null model: 
0.4170	
-0.4272	0.4377	
se(Ve): 
0.4100	
0.2699	0.2571	
REMLE likelihood = -1177.0001
MLE estimate for Vg in the null model: 
2.2770	
-0.4818	2.3521	
se(Vg): 
0.2408	
0.1944	0.2496	
MLE estimate for Ve in the null model: 
0.3447	
-0.3532	0.3618	
se(Ve): 
-nan	
-nan	0.0000	
MLE likelihood = -684011.4970
Reading SNPs                                                    0.00%

The se(Ve) has NaNs and the "Reading SNPs" does not advance from 0.00%. The command does not execute fully. Do you have any suggestions for what could be wrong with the input data? I can attach the data as a zip if needed.

Thank you very much!
Hilary

@pcarbo
Copy link
Collaborator

pcarbo commented Jun 21, 2017

@lindrothlab Have you first tried to run the basic linear model (e.g., -lm 2) on one or both of your phenotypes? Or have you tried the univariate LMM (e.g., -lmm 2) on one or both of your phenotypes? This would be a good starting point if you haven't tried this already.

@xiangzhou
Copy link
Collaborator

Also, mvLMM is sensitive to phenotype normalization. We usually quantile transform each phenotype to a standard normal distribution before analysis.

@lindrothlab
Copy link
Author

Individually, each trait seems to run okay with GEMMA using -lmm 2 and -lm 2 (producing complete log and .assoc.txt files). Yet, the l_mle values are all 100,000 (see attached figure) for both traits. Otherwise, the output looks seemingly similar to output from the sample mouse data.

assoc_example

Both traits are rank-based transformed and normally distributed. One of the traits has one missing value (coded as -9) while the other trait has no missing data.

@pcarbo
Copy link
Collaborator

pcarbo commented Jun 22, 2017

@lindrothlab The maximum likelihood estimates of lambda are hitting the maximum allowed value (the default value of lmax is 1e5). The REMLE estimates are also quite high, which suggests that the correlation defined by the relatedness matrix is a good fit for the phenotypic correlations.

Individually, each trait seems to run okay with GEMMA using -lmm 2 and -lm 2 (producing complete log and .assoc.txt files).

@lindrothlab Do you get similar values for l_remle and/or l_mle in the univariate LMM analyses?

@lindrothlab
Copy link
Author

Yes, for both of the traits the l_mle in the univariate LMM analyses are all 1e5.

@lindrothlab
Copy link
Author

Here's some more information. When using -lmm 4 for each of the traits in a univariate analysis, the l_mle values are all 1e5, and the l_remle values both 10.7 and 11.7 on average for each trait but there is quite a bit of variance across traits. The standard deviation for the l_remle values are 0.97 for one trait but 378 for the other trait because two of the l_remle values are set to 1e5. The attached screenshot shows these assoc files filtered to show the top ten l_remle values for each trait.
comparing_l_remle_values

@pcarbo
Copy link
Collaborator

pcarbo commented Jun 23, 2017

@lindrothlab How are you computing the relatedness matrix? Can you tell us anything about the genetic relatedness of these samples? Could any of them share recent ancestors?

@xiangzhou Have you ever seen very large ML estimates of lambda like these before? Do you think it could be an issue with the relatedness matrix not being s.p.d. (symmetric positive definite)? Would you suggest inspecting the eigenvalues of the relatedness matrix?

@lindrothlab
Copy link
Author

Most of the samples are unrelated. Approximately 6 sample pairs have a relatedness of 0.128 and 20 sample pairs have a relatedness of 0.065. The diagonal values are 0.277.

Here are some summary statistics for the entire matrix:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0075820 -0.0017300 -0.0006334 0.0000000 0.0004818 0.2770000

And here is a summary of the eigenvalues:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.2354 0.2611 0.2640 0.2902 0.5525

And here is a heatmap of the relatedness matrix.
kmatrix

Thank you!

@pcarbo
Copy link
Collaborator

pcarbo commented Jun 26, 2017

To recap, we get odd behaviour with both of the univariate & multivariate LMMs.

Approximately 6 sample pairs have a relatedness of 0.128 and 20 sample pairs have a relatedness of 0.065.

This could be an issue, but hard to say for sure.

And here is a summary of the eigenvalues:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.2354 0.2611 0.2640 0.2902 0.5525

This is quite possibly the root of the problem. The relatedness matrix is expected to be symmetric positive definite (s.p.d.), and an important s.p.d. matrix (@xiangzhou please correct me if I've made any incrorrect assumptions here. Does GEMMA check that all the eigenvalues are positive?)

@lindrothlab Maybe you can try removing the related individuals and see if this removes the zero eigenvalues? If that does not work, perhaps you can share the data (e.g., via Dropbox) and I can try to see if I can get GEMMA to work.

@lindrothlab
Copy link
Author

Ok, I'll give this a go and get back to you. Thanks!

@lindrothlab
Copy link
Author

After getting rid of one sample from all of the related pairs (relatedness ~ 0.12), the kinship matrix still does not seem to be quite right. Most of the eigenvalues vary from 0.5 to 0.1, but the very last eigenvalue is 5.5e-13. And this is fairly similar to the eigenvalues from the kinship matrix that included the related samples (most values varied from 0.5 to 0.07, but the last eigenvalue was 2.5e-13).

I then tried the univariate GWAS for each trait (without the related samples) and got similar results as before: a few l_remle values of 1e5 for one of the traits and then NaNs in the se(Ve) when I tried to run a multivariate GWAS.

Could I share the data via GoogleDrive or Dropbox? Thank you!

@pcarbo
Copy link
Collaborator

pcarbo commented Jun 27, 2017

@lindrothlab Please go ahead and send us a Dropbox or Google Drive link (Dropbox is preferred); if you are concerned about making the link publicly available, you can send it to our email addresses.

@lindrothlab
Copy link
Author

Here's the data in plink format. Since the data aren't human or any other model form that plink supports, the SNPs have all been assigned to chromosome one. Please let me know is you have any questions. Thank you! https://www.dropbox.com/sh/xs3zyn35rc7c89p/AACthEmSWB8za3v2fb5mMt4pa?dl=0

@lindrothlab
Copy link
Author

lindrothlab commented Jun 27, 2017

I've standardized the kinship matrix according to https://aeolister.wordpress.com/2016/06/27/fixing-a-non-positive-definite-kinship-matrix/ and the eigenvalues are slightly improved (I've added this kinship matrix "normalizedKinshipForGEMMA" to the dropbox folder). Also, the trait that was previously giving 1e5 values for l_remle is no longer having this issue and the multitrait GWAS has run and I think the output file looks okay!

lindroth@lindroth-5810:/usr/share/gemma_program$ sudo ./gemma -bfile WisAsp_BCFfiltered_VCFfiltered_vcf-merge_VCFfiltered-take2_maf05_plink_LinkImpute_LDprune_removedrelatedsamples -k output/normalizedKinshipForGEMMA.cxx.txt -lmm 4 -n 13 15 -o wisasp_CT_tremulacin_lmm_4_removedrelated_normK_normK
Reading Files ... 
## number of total individuals = 439
## number of analyzed individuals = 438
## number of covariates = 1
## number of phenotypes = 2
## number of total SNPs = 139570
## number of analyzed SNPs = 139570
Start Eigen-Decomposition...
REMLE estimate for Vg in the null model: 
0.7480	
0.1266	0.7491	
se(Vg): 
0.4445	
0.2981	0.3311	
REMLE estimate for Ve in the null model: 
0.6412	
-0.5383	0.6685	
se(Ve): 
0.1862	
0.1297	0.1436	
REMLE likelihood = -1158.8505
MLE estimate for Vg in the null model: 
0.8261	
0.1098	0.7810	
se(Vg): 
0.4615	
0.3071	0.3382	
MLE estimate for Ve in the null model: 
0.6072	
-0.5296	0.6535	
se(Ve): 
0.1909	
0.1321	0.1453	
MLE likelihood = -1159.6585
Reading SNPs  ==================================================100.00%
Segmentation fault (core dumped)

So, I think this has worked. I'm not sure why the centered kinship matrix needed this standardization. i.e., Whether something is odd with the input SNP data? or if this is "normal" given that nature of SNP datasets.

@pcarbo
Copy link
Collaborator

pcarbo commented Jun 27, 2017

@lindrothlab That's great news! Could you please send us the original gemma command you used to generate the relatedness/kinship matrix, and the steps you took to fix the matrix? This could be useful for others. Also, did you try the -gk 2 option in GEMMA which computes a "standardized" relatedness matrix?

Do you have a lot of SNPs with very low allele frequencies? I'm curious why this "standardization" made such a big difference.

@lindrothlab
Copy link
Author

lindrothlab commented Jun 29, 2017

Yes, certainly!

To generate the kinship matrix, I used:
lindroth@lindroth-5810:/usr/share/gemma_program$ ./gemma -bfile WisAsp_BCFfiltered_VCFfiltered_vcf-merge_VCFfiltered-take2_maf05_plink_LinkImpute_LDprune_removedrelatedsamples -gk 1 -o removedrelated_K

I had tried both a centered and a standardized relatedness matrix in GEMMA (-gk 1 and -gk 2) and both of them seemed to have similar eigenvalue properties leading to issues with multivariate GWAS and 1e5 l_remle estimates for some traits.

To fix the matrix (make it a positive semi definite matrix), I used a "second level" of standardization with an R function on this blog (https://aeolister.wordpress.com/2016/06/27/fixing-a-non-positive-definite-kinship-matrix/). To do this in R:

# upload kinship matrix from GEMMA
rrk <- as.matrix(read.csv("~/Desktop/removedrelated_K.cXX.txt", sep = "", row.names = NULL, header = FALSE))

# R function to normalize the kinship matrix, making it positive semi definite
normalize_kinmat <- function(kinmat){
  #normalize kinship so that Kij \in [0,1]
  tmp=kinmat - min(kinmat)
  tmp=tmp/max(tmp)
  tmp[1:9,1:9]
  #fix eigenvalues to positive
  diag(tmp)=diag(tmp)-min(eigen(tmp)$values)
  tmp[1:9,1:9]  
  return(tmp)
}

normk <- normalize_kinmat(rrk) 
str(normk)
eigen_normk <- eigen(normk) # Check the eigenvalues to see whether the properties have improved
eigen_normk$values

# export matrix to use in GEMMA
write.table(normk, "~/Desktop/normalizedKinshipForGEMMA.cxx.txt", sep = "\t", col.names = FALSE, row.names = FALSE)

And yes, our SNP dataset does have a lot of low frequency alleles. Perhaps this explains why the kinship matrix needed to be manipulated.
allelefreq

@pcarbo
Copy link
Collaborator

pcarbo commented Jun 29, 2017

@lindrothlab Thank you for sharing. I have not seen this particular approach used before. It seems reasonable, though not immediately clear to me why it would work. I've added a "documentation" label to this Issue in case others find this approach useful.

@pjotrp
Copy link
Member

pjotrp commented Aug 15, 2017

To make sure this was not related to my fixes over the last days I have replicated this issue and added a test for positive definite after loading K. It is interesting to note that mvlmm runs 'forever' with the first se_ve nan example and works with the adjusted K. Both K's are positive definite. I should check also for symmetry.

I think we should add an option in GEMMA to adjust K as described and drop related pairs and perhaps low AF.

wdyt?

@pcarbo
Copy link
Collaborator

pcarbo commented Aug 15, 2017

@pjotrp There is already a -maf flag for filtering out SNPs based on minor allele frequency.

Dropping related pairs is an interesting idea.

My experience is that if a kinship matrix is not s.p.d. then there is something wrong with your relatedness calculation; while useful as a last resort, I don't think we should add the above solution as a feature to GEMMA---I don't think it is good practice.

@pjotrp
Copy link
Member

pjotrp commented Aug 19, 2017

I have added more checks for the state of K, see https://github.com/genenetwork/GEMMA/blob/issue45/src/gemma.cpp#L1894 (don't merge this branch just yet, I need to clean it up).

When checking above corrected normalizedKinshipForGEMMA.cxx.txt I find it still has small eigen values (1.9e-15):

**** DEBUG: 1.98646e-15 in src/mathfunc.cpp at line 268 in bool checkMatrixEigen(const gsl_matrix*, double)
**** WARNING: K has small or negative eigenvalues! in src/gemma.cpp at line 2734 in void GEMMA::BatchRun(PARAM&)

@xiangzhou at what level should we warn about small eigen values? I have set it to 1e-6. A warning is not harmful and may be suggestive.

@pcarbo
Copy link
Collaborator

pcarbo commented Aug 19, 2017

Thanks @pjotrp. Giving a warning is a great idea. However, what is important is the ratio of the largest and smallest eigenvalues; this is the condition number of a matrix. This tells us how numerically stable the linear system is.

@pjotrp
Copy link
Member

pjotrp commented Aug 21, 2017

this piece nicely describes ins and outs of non-positive definite matrices and adding values to the diagonal such as done in this issue: http://www2.gsu.edu/~mkteer/npdmatri.html

@pjotrp
Copy link
Member

pjotrp commented Nov 23, 2017

The code base should be a lot better now. Please reopen if you encounter problems.

@pjotrp pjotrp closed this as completed Nov 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants