-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN results #5
Comments
Hi Lennart,
Thanks for the bug report. Please share the following information:
* The specific command line that you used to run APEX here.
* The sample size, number of traits, and number of covariates in this
analysis.
Thanks,
Corbin
…On Sun, Jul 18, 2021, 8:01 AM Lennart Landsmeer ***@***.***> wrote:
Dear APEX authors,
I just APEX on our genotypes+RNA expression dataset. It runs fine in
QTLtools, but when I run it in APEX I get this as output:
1 693731 1010717 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan
1 693731 1035805 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan
1 693731 1131236 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan
1 693731 1658945 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan
1 693731 1793786 0 -nan -nan -nan -nan -nan -nan -nan -nan nan -nan -nan nan
[... continues for 1832 rows, each n_cis_variants + 4 columns]
One possible hint might be that the cis_gene_table has all egene_pval at
0.5 and all resid_sd at -nan (and n_cis_variants in the expected range of
a few thouasand SNPs).
What could possible cause this?
Kind regards,
Lennart Landsmeer
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWQAM45HBJ4ZEBEWWNM5XDTYK7ANANCNFSM5ASCJCGQ>
.
|
Hi Corbin, Thanks for answering. This is the command:
where the two This is the resulting CLI output (containing sample size, no of traits & no of covariates)
|
To check whether it was because of the covariates, I redid the analysis using only PC1 & PC2 of the genotype:
Now it sees to work (no more NaN)! Edit: |
Hi,
Thanks for sharing the command and narrowing down the issue.
APEX does not currently support missing data in the covariate or trait
file. (It does, however, support missing genotypes, and defaults to
impute-to-mean.)
For covariate and trait data, you could first apply listwise deletion or
impute-to-mean to generate new input files, and then pass those files to
APEX.
I will make a note to make sure we return an explicit error message when
missing traits or covariates are encountered.
Best,
Corbin
…On Sun, Jul 18, 2021 at 11:18 AM Lennart Landsmeer ***@***.***> wrote:
To check whether it was because of the covariates, I redid the analysis
using only PC1 & PC2 of the genotype:
20 present in both bcf and bed file.
167044 total variants on selected chromosomes.
Found 2124 samples in bcf file ...
Found 2322 samples in covariate file ...
Found 624 samples in expression bed file ...
Found 624 samples in common across all three files.
Processed data for 2 covariates across 624 samples.
Processed expression for 491 genes across 624 samples.
Processed genotype data for 167044 variants.
Using OLS (no GRM specified).
Started cis-QTL analysis ...
Scaling expression traits ...
Calculating genotype-covariate covariance...
Calculating genotype residual variances ...Done.
Calculating expression residuals...
Scaling expression residuals ...
Processed 54 cis-QTL blocks out of 54 total
Now it sees to work (no more NaN)!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWQAM3QNAMK32DNJ37FLNDTYLWC3ANCNFSM5ASCJCGQ>
.
|
Dear Corbin, Thanks for the fast response. I do not have missing data in my covariates (or anywhere else) Kind regards, |
Ok I narrowed 1 specific case down even further. I have 3 dummy variables for sample sex. M, F & Missing. All are 0/1 vectors. If I include only 1 of those or any combination of 2 (M&F, F&Missing or Missing&M) I do not get NaN values. If I include all three (M&F&Missing) I do get NaN's Edit: This simple pattern does sadly not hold in general, and the way how different combinations of dummy variables might or might not produce NaN's looks pretty random to me. For example, combining the Hospital dummy's (fine by itself) with the Male&Female dummy's (also fine) results in NaN's |
Aha. This is an issue of perfect multicollinearity. APEX appends an
intercept column to the covariate matrix. If you include dummy variables
for all 3 values, you have a perfect linear dependency in the covariate
design matrix. In that case, X^T X is not invertible, and hence the NaNs.
If your factor (sex) has 3 levels, then you need to include dummy variables
for only 2 of the values. The particular choice of which two levels (missing,
male, female) you choose to include as dummy variables and which is reference has no effect on the eQTL association
results.
In the future, we could check for perfect multicollinearity in the design
matrix to avoid this issue.
…On Sun, Jul 18, 2021 at 11:49 AM Lennart Landsmeer ***@***.***> wrote:
Ok I narrowed 1 specific case down even further.
I have 3 dummy variables for sample sex. M, F & Missing. All are 0/1
vectors. If I include only 1 of those or any combination of 2 (M&F,
F&Missing or Missing&M) I do not get NaN values. If I include all three
(M&F&Missing) I do get NaN's
*Edit*: This simple pattern does sadly not hold in general, and the way
how different combinations of dummy variables might or might not produce
NaN's looks pretty random to me
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWQAMZB66K7LPF26PLPKBDTYLZYTANCNFSM5ASCJCGQ>
.
|
Yes it is indeed perfect multicollinearity in the M/F/Missing case. However, I just did a VIF test for the Hospital+M+F case and they are not perfectly multicollinear, even after sample overlapping (altough VIF is pretty high), but this still generates NaN values. |
Hmm, strange. To be sure, in R, you might try constructing a design matrix
X with four columns,
X = [Intercept, Hospital-dummy, Male-dummy, Female-dummy]
Then check whether solve(crossprod(X)) fails.
To be sure, you will want to subset X to the individuals that are present
across all three files. It could be invertible in the overall sample, but
not in the subset.
…On Sun, Jul 18, 2021, 12:45 PM Lennart Landsmeer ***@***.***> wrote:
Yes it is indeed perfect multicollinearity in the M/F/Missing case.
However, I just did a VIF test for the Hospital+M+F case and they are
*not* perfectly multicollinear (altough VIF is pretty high), but this
generate NaN
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWQAM2Z77WSONXEKHLLLX3TYMAKZANCNFSM5ASCJCGQ>
.
|
Yes I was just reading through the code and implemented the |
Thanks. If you can send the covariate data, or just the [Sex, Hospital]
columns to ***@***.***, I could try to take a look later.
…On Sun, Jul 18, 2021 at 1:55 PM Lennart Landsmeer ***@***.***> wrote:
Yes I was just reading through the code and implemented the
make_half_hat_matrix function in python. ***@***.***) works,
the design matrix cross product is invertible (and I only looked at the cov
*bed*vcf sample overlap). No idea what goed wrong :(
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWQAMZ6QCH5TPN5DJKR7S3TYMIQ5ANCNFSM5ASCJCGQ>
.
|
So after more trials - it seemed like there was an even more complicated perfect colinearity problem. After automatically one-by-one removal of all VIF=Inf covariates APEX runs smoothly again. Sorry for taking up your time and thanks for the support |
Dear APEX authors,
I just APEX on our genotypes+RNA expression dataset. It runs fine in QTLtools, but when I run it in APEX I get this as output:
One possible hint might be that the
cis_gene_table
has allegene_pval
at 0.5 and allresid_sd
at-nan
(andn_cis_variants
in the expected range of a few thousand SNPs).I performed all the necessary steps to convert the vcf/bed/covariates to the required input formats and tried with both NA values replaced by 0 or kept as in in the covariates.
What could possible cause this?
Kind regards,
Lennart Landsmeer
The text was updated successfully, but these errors were encountered: