Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency between Fst and deltaDAF (derived allele frequency) with pcadapt #39

Closed
Mary-00 opened this issue Jul 3, 2019 · 19 comments

Comments

@Mary-00
Copy link

Mary-00 commented Jul 3, 2019

Dear Developers,
With reference to #31, I’m working with a list of SNP variants related to a complex disease. I would like to find which SNPs are outlier between my population and each population of 1000 genome project. To this end, I calculated Fst, deltaDAF, and used pcadapt (version 4.0.3) to find outlier SNPs between my population and each population of 1000 genome. The below code was used:

vcf2pcadapt("file1.vcf", output = "file2.pcadapt", allele.sep = c("/", "|"))
geno_pcadapt <- read.pcadapt ("file2.pcadapt", type="pcadapt")
x <- pcadapt (input = geno_pcadapt, K = 2)
plot (x, option = "screeplot")
res <- pcadapt (geno_pcadapt, K = 2, LD.clumping = list(size=200, thr = 0.1))
p.adj <- p.adjust (res$pvalue, method=”fdr”)
alpha <- 0.05
outliers <- which (p.adj < alpha)

While the results of Fst and deltaDAF are highly correlated (based on Pearson’s coefficient), the pcadapt returned so different result that doesn’t match with Fst and delatDAF. I’m not sure how I shall interpret these results or there may be anything wrong with pcadapt.
I also test Outflank, but it gave me strange output even for Fst; as its author was not responsive, I leave this analysis. Anyway, could you please kindly suggest me how I can interpret these results?

Almost all plots is similar to #31; however, please let me know if you need more information.

Best

@privefl
Copy link
Member

privefl commented Jul 3, 2019

Do you still have only a few thousands SNPs?
Then, did you try to add null SNPs as I recommended in the other issue?

@Mary-00
Copy link
Author

Mary-00 commented Jul 3, 2019

Yes, I have about 2400 SNPs. Sorry, what's your mean with null SNP and please kindly tell me how I shall add it?

@privefl
Copy link
Member

privefl commented Jul 3, 2019

If you go back to #31, I point to #24 that is a similar issue as yours I think.
At the end, I show how to add null SNPs because it is important for pcadapt.

@mblumuga
Copy link
Member

mblumuga commented Jul 3, 2019

If I get it right, you repeat the following operation
find SNPs differentiated between "your" population and a population of choice of the 1000 genome

If this is correct, you should use K=1 because PC1 will be the axis corresponding to differentiation between "your" population and a population of choice of the 1000 genome

@Mary-00
Copy link
Author

Mary-00 commented Jul 3, 2019

Right, but, now I also calculated deltaDAF and find high correlation between Fst and deltaDAF. Yes, I used K=1 for pcadapt. I'm following privefl's comment and adding null SNP.

@Mary-00
Copy link
Author

Mary-00 commented Jul 3, 2019

Hi,
Privefl, I followed your instruction and used the below command:

vcf2pcadapt("file1.vcf", output = "file2.pcadapt", allele.sep = c("/", "|"))
geno <- read.pcadapt ("file2.pcadapt", type="pcadapt")

adding Null SNP
mat <- pcadapt::bed2matrix(geno)
N <- nrow(mat)
M <- 50e3
mat_null <- sapply(runif(M, min = 0.05, max = 0.5), function(af) {rbinom(N, size = 2, prob = af)})
mat2 <- cbind(mat, mat_null)
geno <- read.pcadapt(mat2, type = "lfmm")

x <- pcadapt (input = geno, K = 1)
res <- pcadapt (geno, K = 1, LD.clumping = list(size=200, thr = 0.1))
p.adj <- p.adjust (res$pvalue, method=”fdr”)
alpha <- 0.05
outliers <- which (p.adj < alpha)

with this code, pcadapt returned all my SNP (2400) as outlier!!. Could you please kindly check the code and let me know your suggestion?

Many thanks

@privefl
Copy link
Member

privefl commented Jul 3, 2019

Does the histogram of p-values looks good?
Can you share your vcf file?

@Mary-00
Copy link
Author

Mary-00 commented Jul 3, 2019

Thanks a lot for your prompt feedback. Here is p-value histogram

pval4

Hope the problem solved here; honestly, I should consult with our head for sharing.

@privefl
Copy link
Member

privefl commented Jul 3, 2019

Seems good.

@Mary-00
Copy link
Author

Mary-00 commented Jul 6, 2019

Hi privefl,
Thanks, but I get all my SNPs as outliers! that is not good. Please kindly share me if you have another suggestion/advice to solve the issue or I share the vcf file with you?

@privefl
Copy link
Member

privefl commented Jul 6, 2019

I'm just saying that the p-values looks calibrated, which is good.

For your SNPs, maybe they are all outliers given how you chosed this subset in the first place.
Or maybe, pcadapt is not fit for your particular problem. You should run it on the entire chip if possible.

@mblumuga
Copy link
Member

mblumuga commented Jul 6, 2019

Nope, your procedure does not seem good to me @Mary-00. You have simulated null SNPS using
mat_null <- sapply(runif(M, min = 0.05, max = 0.5), function(af) {rbinom(N, size = 2, prob = af)})
I do not expect this procedure to be correct.

What you should do instead is to include both SNP variants related to a complex disease and random SNPs in the genome (or all other SNPs). Doing that, you will see that the SNPs related to the disease will not be systematically considered as outliers.

@Mary-00
Copy link
Author

Mary-00 commented Jul 6, 2019

Hi
Thank you for all your feedback.
I used the code as privefl suggested.

Actually, the selected variants were resulted from the whole genome sequencing, not chip; so the all variants are located at the various VCF files for each chromosome. This is also true for 1000 genome population, which the corresponded variants are in the various VCF files for each chromosome; for using all variants for pcadapt, I should merged the vcf files of all chromosomes in my population and 1000 genome population, then merged them and feed a very huge file into pcadapt, yes? If pcadapt can handle such a big file? So, using random SNP may be more useful than all variants, could you please kindly share me your idea about it?

In the case of using random SNP, I should extract the random SNP with matched (similar) allele frequency and LD value with my SNP list from my population and 1000 genome population, am I correct? As far as I searched, I could not find any tool/script for extracting such a random SNPs (my mean random SNP with similar allele frequency and LD value with my SNP list). Could you please kindly share me if you have any suggestion/solution to get such a random SNP and how I should add them to my real variants?

I would like to stay with pcadapt in my work, hop the problem solved.

Many thanks in advance

@privefl
Copy link
Member

privefl commented Jul 6, 2019

I would use PLINK to subset variants.
You can provide a list of variant IDs as well as randomly keeping a subset proportion as well as performing some pruning.

@mblumuga
Copy link
Member

mblumuga commented Jul 6, 2019

I should merged the vcf files of all chromosomes in my population and 1000 genome population, then merged them and feed a very huge file into pcadapt, yes?
Yes, pcadapt can handle very large genotype files. If you can use all polymorphic variants, it is even better than a random subsample.
Will be away from github for some time.

@Mary-00
Copy link
Author

Mary-00 commented Jul 7, 2019

Thanks for your feedback. First, I prefer trying random SNP as privefl suggested;however, I'm a bit confused what exactly shall I do. Could you please advice me how I can get random SNP considering similar allele frequency and LD value to my snp list from various populations of 1000 genome?

Thanks

@privefl
Copy link
Member

privefl commented Jul 7, 2019

I don't know how to do this precisely.
I would just subset to SNPs with MAF > 5% and then get a random proportion.

@Mary-00
Copy link
Author

Mary-00 commented Jul 7, 2019

OK, so I considered just MAF as you suggested; however, I found that my population has many variants with allele frequency (AF) of lower than 0.05, unlike the 1000 genome populations; actually, these variants are really different, so how we don't consider them?

Regarding the random SNP selection, the number of random SNP should be similar to my SNP number, around 2400, is it right?

@privefl
Copy link
Member

privefl commented Aug 17, 2020

I see that I haven't followed up on this.
Did you manage to solve the issue?

@privefl privefl closed this as completed Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants