Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN results #5

Closed
llandsmeer opened this issue Jul 18, 2021 · 12 comments
Closed

NaN results #5

llandsmeer opened this issue Jul 18, 2021 · 12 comments

Comments

@llandsmeer
Copy link

llandsmeer commented Jul 18, 2021

Dear APEX authors,

I just APEX on our genotypes+RNA expression dataset. It runs fine in QTLtools, but when I run it in APEX I get this as output:

1       693731  1010717 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
1       693731  1035805 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
1       693731  1131236 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
1       693731  1658945 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
1       693731  1793786 0       -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan    nan     -nan    -nan    nan
[... continues for 1832 rows, each n_cis_variants + 4 columns]

One possible hint might be that the cis_gene_table has all egene_pval at 0.5 and all resid_sd at -nan (and n_cis_variants in the expected range of a few thousand SNPs).

I performed all the necessary steps to convert the vcf/bed/covariates to the required input formats and tried with both NA values replaced by 0 or kept as in in the covariates.

What could possible cause this?

Kind regards,
Lennart Landsmeer

@corbinq
Copy link
Owner

corbinq commented Jul 18, 2021 via email

@llandsmeer
Copy link
Author

Hi Corbin,

Thanks for answering. This is the command:

apex cis \
    -b APEX_TRAIT.bed.gz \
    -c APEX_COV_1A_NaNto0 \
    -v "${gen_imputed}.chr1.vcf.gz" \
    -o apex_1a/chr1

where the two APEX_* files are files converted from QTLtools input format by removing the strand column from the trait file (data had no missing values and was already filtered and inverse normal transformed) & the factors in the covariates file were split into distinct dummy 0/1 variables (tried both removing NA's or adding a separate dummy for NA values, NA values for the continuous covariatas where also either kept as is or replaced with 0) & QTLtools covariate exclusion list already applied to the covariate file

This is the resulting CLI output (containing sample size, no of traits & no of covariates)

Using 2 threads.
1 present in both bcf and bed file.
581258 total variants on selected chromosomes.

Found 2124 samples in bcf file ...
Found 2322 samples in covariate file ...
Found 624 samples in expression bed file ...
Found 624 samples in common across all three files.

Processed data for 108 covariates across 624 samples.
Processed expression for 1838 genes across 624 samples.
Processed genotype data for 581258 variants.

Using OLS (no GRM specified).
Started cis-QTL analysis ...
Scaling expression traits ...
Calculating genotype-covariate covariance...
Calculating genotype residual variances ...Done.
Calculating expression residuals...
Scaling expression residuals ...
Processed 206 cis-QTL blocks out of 206 total

@llandsmeer
Copy link
Author

llandsmeer commented Jul 18, 2021

To check whether it was because of the covariates, I redid the analysis using only PC1 & PC2 of the genotype:

20 present in both bcf and bed file.
167044 total variants on selected chromosomes.

Found 2124 samples in bcf file ...
Found 2322 samples in covariate file ...
Found 624 samples in expression bed file ...
Found 624 samples in common across all three files.

Processed data for 2 covariates across 624 samples.
Processed expression for 491 genes across 624 samples.
Processed genotype data for 167044 variants.

Using OLS (no GRM specified).
Started cis-QTL analysis ...
Scaling expression traits ...
Calculating genotype-covariate covariance...
Calculating genotype residual variances ...Done.
Calculating expression residuals...
Scaling expression residuals ...
Processed 54 cis-QTL blocks out of 54 total

Now it sees to work (no more NaN)!

Edit:
Redid analysis with all dummy variables/factors removed -> same result (no NaN)
What could possible happen that creates NaN values if I include those dummy variables?

@corbinq
Copy link
Owner

corbinq commented Jul 18, 2021 via email

@llandsmeer
Copy link
Author

Dear Corbin,

Thanks for the fast response.

I do not have missing data in my covariates (or anywhere else)

Kind regards,
Lennart Landsmeer

@llandsmeer
Copy link
Author

llandsmeer commented Jul 18, 2021

Ok I narrowed 1 specific case down even further.

I have 3 dummy variables for sample sex. M, F & Missing. All are 0/1 vectors. If I include only 1 of those or any combination of 2 (M&F, F&Missing or Missing&M) I do not get NaN values. If I include all three (M&F&Missing) I do get NaN's

Edit: This simple pattern does sadly not hold in general, and the way how different combinations of dummy variables might or might not produce NaN's looks pretty random to me. For example, combining the Hospital dummy's (fine by itself) with the Male&Female dummy's (also fine) results in NaN's

@corbinq
Copy link
Owner

corbinq commented Jul 18, 2021 via email

@llandsmeer
Copy link
Author

llandsmeer commented Jul 18, 2021

Yes it is indeed perfect multicollinearity in the M/F/Missing case. However, I just did a VIF test for the Hospital+M+F case and they are not perfectly multicollinear, even after sample overlapping (altough VIF is pretty high), but this still generates NaN values.

@corbinq
Copy link
Owner

corbinq commented Jul 18, 2021 via email

@llandsmeer
Copy link
Author

Yes I was just reading through the code and implemented the make_half_hat_matrix function in python. numpy.linalg.eig(X.T@X) works, the design matrix cross product is invertible (and I only looked at the covbedvcf sample overlap). No idea what goed wrong :(

@corbinq
Copy link
Owner

corbinq commented Jul 18, 2021 via email

@llandsmeer
Copy link
Author

So after more trials - it seemed like there was an even more complicated perfect colinearity problem. After automatically one-by-one removal of all VIF=Inf covariates APEX runs smoothly again. Sorry for taking up your time and thanks for the support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants