Singular values & proportion of explained variance #46

YannDussert · 2020-02-17T11:37:10Z

Hi,

In the pcadapt documentation, the singular.values vector in a pcadapt object is described as "the vector containing the K ordered squared root of the proportion of variance explained by each PC". The values for the y-axis on the scree plot are also computed by "squaring" the singular values.

However, should the proportion of explained variance for the i-th PC not rather be computed as:

singular.values[i]^2 / sum(singular.values^2)

?

I am sorry if I missed something and if I am mistaken.

Best regards,
Yann

The text was updated successfully, but these errors were encountered:

privefl · 2020-02-17T12:32:18Z

In the code, we have

pcadapt/R/pcadapt.R

Line 246 in 307c7e2

    
           singular.values = obj.pca$d / sqrt((nrow(obj.pca$u) - 1) * length(obj.pca$pass)),

where (nrow(obj.pca$u) - 1) * length(obj.pca$pass) should approximate the total variance of the scaled matrix.

Recall that only the first PCs are computed, so you can't sum over only the first values to get the total variance.

Does the result of cumsum(singular.values[i]^2) look weird to you?

YannDussert · 2020-02-17T13:48:25Z

Thank you for you reply.

I was surprised by the fact that the total proportion of explained variance was not 1, but after your reply, I realized that the function uses a truncated SVD, so, as you said, only the first K PCs are taken into account. Am I correct?

Best,
Yann

privefl · 2020-02-17T18:03:57Z

Yes, only the first K PCs are used.
For applications such as in human genetics, the first 10 PCs can explain only 10% of the total variance sometimes. Are you seeing something like this?

ldutoit · 2022-01-26T20:15:04Z

Hi,

I am having issues because our total explained variance is well above 1 (1.8 using singular values over the first 20 axes). Have you seen this before?

Do you know why it happens? I can't get my head around it.

thank you very much for your help!

Ludo

privefl · 2022-01-26T20:35:27Z

Did you forget to square the singular values?

ldutoit · 2022-01-26T21:02:47Z

Thanks!

No I did not. From the scree plot: The first few axes along sum up to over 1, and this is confirmed by the singular values:

data <- read.pcadapt("myfile.vcf", type = "vcf")
x <- pcadapt(input = data, K = 20)

##calculate EV
EV <- (x$singular.values^2)

EV2 <-EV*100

EV2
##EV2labels## "PC1, 39.2%", "PC2, 25.7%", "PC3, 21.5%", "PC4, 20.2%"

Very much appreciate the help (and the package).

Ludo

privefl · 2022-01-26T21:06:10Z

Weird indeed. What is the size of your data?
Are you allowed to share it?

ldutoit · 2022-01-26T21:11:45Z

Hi Florian,

it is only 52k SNPs. I sent it by email to ******.21@gmail.com. Hopefully it is the right address!

Ludo

privefl · 2022-01-26T22:06:47Z

library(pcadapt)
data <- read.pcadapt("tmp-data/out.recode.vcf", type = "vcf")
x <- pcadapt(input = data, K = 20)

X <- bed2matrix(data)
table(X[, x$pass], exclude = NULL)
#       0       1       2    <NA> 
# 1789596   35334  327475  140215 

X.scaled <- scale(X, center = 2 * x$af, scale = sqrt(2 * x$af * (1 - x$af)))
colMeans(X.scaled[, x$pass]^2, na.rm = TRUE)  # should be all ~1, but are ~2

It seems that your genotypes do not follow HWE at all (too few 1s).
So that the total variance is approx twice of what the package approximates.

ldutoit · 2022-01-26T23:16:53Z

That makes sense.

Is the HWE assumption for the imputation or within the pca?

Would it be correct to take the total estimated variance over the first say 100 axes as the "total explained variance" and then look at each of those axes at a proportion of that variance, basically scaling it?

privefl · 2022-01-27T07:51:53Z

For the scaling used in the PCA (sqrt(2p(1-p))), it is supposed to give you variables that have variance 1 under HWE.

No, I don't think it would be correct. But usually, the singular values (or square of them) decrease almost linearly so that you can extrapolate the rest of them and therefore the total variance.
Also, the total variance is just the sum of squared elements of the (scaled) matrix.
So, basically, I think you could just use colMeans(X.scaled[, x$pass]^2, na.rm = TRUE) * nrow(X) as the total variance.

privefl · 2022-01-27T08:09:48Z

Ok, I think that should work. You can extrapolate the variance explained like this (using K = 100 as input):

y <- cumsum(x$singular.values^2)
plot(y, log = "")
y2 <- splinefun(seq_along(y), y, method = "monoH.FC")(101:(nrow(x$scores) - 1))
plot(c(y, y2))
EV <- x$singular.values^2 / max(y2)

ldutoit · 2022-01-27T20:22:55Z

Thank you so much, very grateful for your help.

That is awesome!

Interestingly, simply scaling using the total variance give relatively similar results.

See below for the first five axes:

 (x$singular.values**2/sum(x$singular.values**2))[1:5]
#0.19307577 0.12665384 0.10608894 0.09965733 0.07435843

EV[1:5]
#0.18669681 0.12246936 0.10258390 0.09636479 0.07190173

Thank you.

Ludo

privefl · 2024-04-15T13:12:34Z

There is now a better estimate of the total variance used in v4.4.

privefl mentioned this issue Jun 10, 2020

How to obtain details of PCA drown by pcadapt #52

Closed

privefl closed this as completed Jun 10, 2020

privefl mentioned this issue Jul 18, 2021

How to get the genotype file #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Singular values & proportion of explained variance #46

Singular values & proportion of explained variance #46

YannDussert commented Feb 17, 2020

privefl commented Feb 17, 2020 •

edited

Loading

YannDussert commented Feb 17, 2020

privefl commented Feb 17, 2020

ldutoit commented Jan 26, 2022 •

edited

Loading

privefl commented Jan 26, 2022

ldutoit commented Jan 26, 2022

privefl commented Jan 26, 2022

ldutoit commented Jan 26, 2022

privefl commented Jan 26, 2022

ldutoit commented Jan 26, 2022 •

edited

Loading

privefl commented Jan 27, 2022

privefl commented Jan 27, 2022 •

edited

Loading

ldutoit commented Jan 27, 2022 •

edited

Loading

privefl commented Apr 15, 2024

Singular values & proportion of explained variance #46

Singular values & proportion of explained variance #46

Comments

YannDussert commented Feb 17, 2020

privefl commented Feb 17, 2020 • edited Loading

YannDussert commented Feb 17, 2020

privefl commented Feb 17, 2020

ldutoit commented Jan 26, 2022 • edited Loading

privefl commented Jan 26, 2022

ldutoit commented Jan 26, 2022

privefl commented Jan 26, 2022

ldutoit commented Jan 26, 2022

privefl commented Jan 26, 2022

ldutoit commented Jan 26, 2022 • edited Loading

privefl commented Jan 27, 2022

privefl commented Jan 27, 2022 • edited Loading

ldutoit commented Jan 27, 2022 • edited Loading

privefl commented Apr 15, 2024

privefl commented Feb 17, 2020 •

edited

Loading

ldutoit commented Jan 26, 2022 •

edited

Loading

ldutoit commented Jan 26, 2022 •

edited

Loading

privefl commented Jan 27, 2022 •

edited

Loading

ldutoit commented Jan 27, 2022 •

edited

Loading