Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Singular values & proportion of explained variance #46

Closed
YannDussert opened this issue Feb 17, 2020 · 14 comments
Closed

Singular values & proportion of explained variance #46

YannDussert opened this issue Feb 17, 2020 · 14 comments

Comments

@YannDussert
Copy link

Hi,

In the pcadapt documentation, the singular.values vector in a pcadapt object is described as "the vector containing the K ordered squared root of the proportion of variance explained by each PC". The values for the y-axis on the scree plot are also computed by "squaring" the singular values.

However, should the proportion of explained variance for the i-th PC not rather be computed as:

singular.values[i]^2 / sum(singular.values^2) 

?

I am sorry if I missed something and if I am mistaken.

Best regards,
Yann

@privefl
Copy link
Member

privefl commented Feb 17, 2020

In the code, we have

singular.values = obj.pca$d / sqrt((nrow(obj.pca$u) - 1) * length(obj.pca$pass)),

where (nrow(obj.pca$u) - 1) * length(obj.pca$pass) should approximate the total variance of the scaled matrix.

Recall that only the first PCs are computed, so you can't sum over only the first values to get the total variance.

Does the result of cumsum(singular.values[i]^2) look weird to you?

@YannDussert
Copy link
Author

Thank you for you reply.

I was surprised by the fact that the total proportion of explained variance was not 1, but after your reply, I realized that the function uses a truncated SVD, so, as you said, only the first K PCs are taken into account. Am I correct?

Best,
Yann

@privefl
Copy link
Member

privefl commented Feb 17, 2020

Yes, only the first K PCs are used.
For applications such as in human genetics, the first 10 PCs can explain only 10% of the total variance sometimes. Are you seeing something like this?

@ldutoit
Copy link

ldutoit commented Jan 26, 2022

Hi,

I am having issues because our total explained variance is well above 1 (1.8 using singular values over the first 20 axes). Have you seen this before?

thumbnail_scree plot - K = 20

Do you know why it happens? I can't get my head around it.

thank you very much for your help!

Ludo

@privefl
Copy link
Member

privefl commented Jan 26, 2022

Did you forget to square the singular values?

@ldutoit
Copy link

ldutoit commented Jan 26, 2022

Thanks!

No I did not. From the scree plot: The first few axes along sum up to over 1, and this is confirmed by the singular values:

data <- read.pcadapt("myfile.vcf", type = "vcf")
x <- pcadapt(input = data, K = 20)

##calculate EV
EV <- (x$singular.values^2)

EV2 <-EV*100

EV2
##EV2labels## "PC1, 39.2%", "PC2, 25.7%", "PC3, 21.5%", "PC4, 20.2%"

Very much appreciate the help (and the package).

Ludo

@privefl
Copy link
Member

privefl commented Jan 26, 2022

Weird indeed. What is the size of your data?
Are you allowed to share it?

@ldutoit
Copy link

ldutoit commented Jan 26, 2022

Hi Florian,

it is only 52k SNPs. I sent it by email to ******.21@gmail.com. Hopefully it is the right address!

Ludo

@privefl
Copy link
Member

privefl commented Jan 26, 2022

library(pcadapt)
data <- read.pcadapt("tmp-data/out.recode.vcf", type = "vcf")
x <- pcadapt(input = data, K = 20)

X <- bed2matrix(data)
table(X[, x$pass], exclude = NULL)
#       0       1       2    <NA> 
# 1789596   35334  327475  140215 

X.scaled <- scale(X, center = 2 * x$af, scale = sqrt(2 * x$af * (1 - x$af)))
colMeans(X.scaled[, x$pass]^2, na.rm = TRUE)  # should be all ~1, but are ~2

It seems that your genotypes do not follow HWE at all (too few 1s).
So that the total variance is approx twice of what the package approximates.

@ldutoit
Copy link

ldutoit commented Jan 26, 2022

That makes sense.

Is the HWE assumption for the imputation or within the pca?

Would it be correct to take the total estimated variance over the first say 100 axes as the "total explained variance" and then look at each of those axes at a proportion of that variance, basically scaling it?

@privefl
Copy link
Member

privefl commented Jan 27, 2022

For the scaling used in the PCA (sqrt(2p(1-p))), it is supposed to give you variables that have variance 1 under HWE.

No, I don't think it would be correct. But usually, the singular values (or square of them) decrease almost linearly so that you can extrapolate the rest of them and therefore the total variance.
Also, the total variance is just the sum of squared elements of the (scaled) matrix.
So, basically, I think you could just use colMeans(X.scaled[, x$pass]^2, na.rm = TRUE) * nrow(X) as the total variance.

@privefl
Copy link
Member

privefl commented Jan 27, 2022

Ok, I think that should work. You can extrapolate the variance explained like this (using K = 100 as input):

y <- cumsum(x$singular.values^2)
plot(y, log = "")
y2 <- splinefun(seq_along(y), y, method = "monoH.FC")(101:(nrow(x$scores) - 1))
plot(c(y, y2))
EV <- x$singular.values^2 / max(y2)  

@ldutoit
Copy link

ldutoit commented Jan 27, 2022

Thank you so much, very grateful for your help.

That is awesome!

Interestingly, simply scaling using the total variance give relatively similar results.

See below for the first five axes:

 (x$singular.values**2/sum(x$singular.values**2))[1:5]
#0.19307577 0.12665384 0.10608894 0.09965733 0.07435843

EV[1:5]
#0.18669681 0.12246936 0.10258390 0.09636479 0.07190173

Thank you.

Ludo

@privefl
Copy link
Member

privefl commented Apr 15, 2024

There is now a better estimate of the total variance used in v4.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants