Inconsistency between Fst and deltaDAF (derived allele frequency) with pcadapt #39

Mary-00 · 2019-07-03T05:22:27Z

Dear Developers,
With reference to #31, I’m working with a list of SNP variants related to a complex disease. I would like to find which SNPs are outlier between my population and each population of 1000 genome project. To this end, I calculated Fst, deltaDAF, and used pcadapt (version 4.0.3) to find outlier SNPs between my population and each population of 1000 genome. The below code was used:

vcf2pcadapt("file1.vcf", output = "file2.pcadapt", allele.sep = c("/", "|"))
geno_pcadapt <- read.pcadapt ("file2.pcadapt", type="pcadapt")
x <- pcadapt (input = geno_pcadapt, K = 2)
plot (x, option = "screeplot")
res <- pcadapt (geno_pcadapt, K = 2, LD.clumping = list(size=200, thr = 0.1))
p.adj <- p.adjust (res$pvalue, method=”fdr”)
alpha <- 0.05
outliers <- which (p.adj < alpha)

While the results of Fst and deltaDAF are highly correlated (based on Pearson’s coefficient), the pcadapt returned so different result that doesn’t match with Fst and delatDAF. I’m not sure how I shall interpret these results or there may be anything wrong with pcadapt.
I also test Outflank, but it gave me strange output even for Fst; as its author was not responsive, I leave this analysis. Anyway, could you please kindly suggest me how I can interpret these results?

Almost all plots is similar to #31; however, please let me know if you need more information.

Best

The text was updated successfully, but these errors were encountered:

privefl · 2019-07-03T05:55:36Z

Do you still have only a few thousands SNPs?
Then, did you try to add null SNPs as I recommended in the other issue?

Mary-00 · 2019-07-03T06:16:23Z

Yes, I have about 2400 SNPs. Sorry, what's your mean with null SNP and please kindly tell me how I shall add it?

privefl · 2019-07-03T06:33:27Z

If you go back to #31, I point to #24 that is a similar issue as yours I think.
At the end, I show how to add null SNPs because it is important for pcadapt.

mblumuga · 2019-07-03T08:56:53Z

If I get it right, you repeat the following operation
find SNPs differentiated between "your" population and a population of choice of the 1000 genome

If this is correct, you should use K=1 because PC1 will be the axis corresponding to differentiation between "your" population and a population of choice of the 1000 genome

Mary-00 · 2019-07-03T09:40:18Z

Right, but, now I also calculated deltaDAF and find high correlation between Fst and deltaDAF. Yes, I used K=1 for pcadapt. I'm following privefl's comment and adding null SNP.

Mary-00 · 2019-07-03T10:59:56Z

Hi,
Privefl, I followed your instruction and used the below command:

vcf2pcadapt("file1.vcf", output = "file2.pcadapt", allele.sep = c("/", "|"))
geno <- read.pcadapt ("file2.pcadapt", type="pcadapt")

adding Null SNP
mat <- pcadapt::bed2matrix(geno)
N <- nrow(mat)
M <- 50e3
mat_null <- sapply(runif(M, min = 0.05, max = 0.5), function(af) {rbinom(N, size = 2, prob = af)})
mat2 <- cbind(mat, mat_null)
geno <- read.pcadapt(mat2, type = "lfmm")

x <- pcadapt (input = geno, K = 1)
res <- pcadapt (geno, K = 1, LD.clumping = list(size=200, thr = 0.1))
p.adj <- p.adjust (res$pvalue, method=”fdr”)
alpha <- 0.05
outliers <- which (p.adj < alpha)

with this code, pcadapt returned all my SNP (2400) as outlier!!. Could you please kindly check the code and let me know your suggestion?

Many thanks

privefl · 2019-07-03T11:01:54Z

Does the histogram of p-values looks good?
Can you share your vcf file?

Mary-00 · 2019-07-03T11:32:08Z

Thanks a lot for your prompt feedback. Here is p-value histogram

Hope the problem solved here; honestly, I should consult with our head for sharing.

privefl · 2019-07-03T12:56:28Z

Seems good.

Mary-00 · 2019-07-06T05:59:07Z

Hi privefl,
Thanks, but I get all my SNPs as outliers! that is not good. Please kindly share me if you have another suggestion/advice to solve the issue or I share the vcf file with you?

privefl · 2019-07-06T07:02:57Z

I'm just saying that the p-values looks calibrated, which is good.

For your SNPs, maybe they are all outliers given how you chosed this subset in the first place.
Or maybe, pcadapt is not fit for your particular problem. You should run it on the entire chip if possible.

mblumuga · 2019-07-06T12:18:30Z

Nope, your procedure does not seem good to me @Mary-00. You have simulated null SNPS using
mat_null <- sapply(runif(M, min = 0.05, max = 0.5), function(af) {rbinom(N, size = 2, prob = af)})
I do not expect this procedure to be correct.

What you should do instead is to include both SNP variants related to a complex disease and random SNPs in the genome (or all other SNPs). Doing that, you will see that the SNPs related to the disease will not be systematically considered as outliers.

Mary-00 · 2019-07-06T13:58:51Z

Hi
Thank you for all your feedback.
I used the code as privefl suggested.

Actually, the selected variants were resulted from the whole genome sequencing, not chip; so the all variants are located at the various VCF files for each chromosome. This is also true for 1000 genome population, which the corresponded variants are in the various VCF files for each chromosome; for using all variants for pcadapt, I should merged the vcf files of all chromosomes in my population and 1000 genome population, then merged them and feed a very huge file into pcadapt, yes? If pcadapt can handle such a big file? So, using random SNP may be more useful than all variants, could you please kindly share me your idea about it?

In the case of using random SNP, I should extract the random SNP with matched (similar) allele frequency and LD value with my SNP list from my population and 1000 genome population, am I correct? As far as I searched, I could not find any tool/script for extracting such a random SNPs (my mean random SNP with similar allele frequency and LD value with my SNP list). Could you please kindly share me if you have any suggestion/solution to get such a random SNP and how I should add them to my real variants?

I would like to stay with pcadapt in my work, hop the problem solved.

Many thanks in advance

privefl · 2019-07-06T18:47:18Z

I would use PLINK to subset variants.
You can provide a list of variant IDs as well as randomly keeping a subset proportion as well as performing some pruning.

mblumuga · 2019-07-06T19:40:52Z

I should merged the vcf files of all chromosomes in my population and 1000 genome population, then merged them and feed a very huge file into pcadapt, yes?
Yes, pcadapt can handle very large genotype files. If you can use all polymorphic variants, it is even better than a random subsample.
Will be away from github for some time.

Mary-00 · 2019-07-07T06:57:46Z

Thanks for your feedback. First, I prefer trying random SNP as privefl suggested;however, I'm a bit confused what exactly shall I do. Could you please advice me how I can get random SNP considering similar allele frequency and LD value to my snp list from various populations of 1000 genome?

Thanks

privefl · 2019-07-07T07:03:47Z

I don't know how to do this precisely.
I would just subset to SNPs with MAF > 5% and then get a random proportion.

Mary-00 · 2019-07-07T08:57:51Z

OK, so I considered just MAF as you suggested; however, I found that my population has many variants with allele frequency (AF) of lower than 0.05, unlike the 1000 genome populations; actually, these variants are really different, so how we don't consider them?

Regarding the random SNP selection, the number of random SNP should be similar to my SNP number, around 2400, is it right?

privefl · 2020-08-17T11:40:29Z

I see that I haven't followed up on this.
Did you manage to solve the issue?

privefl mentioned this issue Mar 27, 2020

Great difference of outliers detection using same cutoffs pcadapt/OutFlanks #47

Closed

privefl closed this as completed Jan 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency between Fst and deltaDAF (derived allele frequency) with pcadapt #39

Inconsistency between Fst and deltaDAF (derived allele frequency) with pcadapt #39

Mary-00 commented Jul 3, 2019

privefl commented Jul 3, 2019

Mary-00 commented Jul 3, 2019

privefl commented Jul 3, 2019

mblumuga commented Jul 3, 2019

Mary-00 commented Jul 3, 2019 •

edited

Loading

Mary-00 commented Jul 3, 2019

privefl commented Jul 3, 2019

Mary-00 commented Jul 3, 2019 •

edited

Loading

privefl commented Jul 3, 2019

Mary-00 commented Jul 6, 2019 •

edited

Loading

privefl commented Jul 6, 2019 •

edited

Loading

mblumuga commented Jul 6, 2019

Mary-00 commented Jul 6, 2019

privefl commented Jul 6, 2019

mblumuga commented Jul 6, 2019

Mary-00 commented Jul 7, 2019

privefl commented Jul 7, 2019

Mary-00 commented Jul 7, 2019 •

edited

Loading

privefl commented Aug 17, 2020

Inconsistency between Fst and deltaDAF (derived allele frequency) with pcadapt #39

Inconsistency between Fst and deltaDAF (derived allele frequency) with pcadapt #39

Comments

Mary-00 commented Jul 3, 2019

privefl commented Jul 3, 2019

Mary-00 commented Jul 3, 2019

privefl commented Jul 3, 2019

mblumuga commented Jul 3, 2019

Mary-00 commented Jul 3, 2019 • edited Loading

Mary-00 commented Jul 3, 2019

privefl commented Jul 3, 2019

Mary-00 commented Jul 3, 2019 • edited Loading

privefl commented Jul 3, 2019

Mary-00 commented Jul 6, 2019 • edited Loading

privefl commented Jul 6, 2019 • edited Loading

mblumuga commented Jul 6, 2019

Mary-00 commented Jul 6, 2019

privefl commented Jul 6, 2019

mblumuga commented Jul 6, 2019

Mary-00 commented Jul 7, 2019

privefl commented Jul 7, 2019

Mary-00 commented Jul 7, 2019 • edited Loading

privefl commented Aug 17, 2020

Mary-00 commented Jul 3, 2019 •

edited

Loading

Mary-00 commented Jul 3, 2019 •

edited

Loading

Mary-00 commented Jul 6, 2019 •

edited

Loading

privefl commented Jul 6, 2019 •

edited

Loading

Mary-00 commented Jul 7, 2019 •

edited

Loading