-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistency between Fst and deltaDAF (derived allele frequency) with pcadapt #39
Comments
Do you still have only a few thousands SNPs? |
Yes, I have about 2400 SNPs. Sorry, what's your mean with null SNP and please kindly tell me how I shall add it? |
If I get it right, you repeat the following operation If this is correct, you should use K=1 because PC1 will be the axis corresponding to differentiation between "your" population and a population of choice of the 1000 genome |
Right, but, now I also calculated deltaDAF and find high correlation between Fst and deltaDAF. Yes, I used K=1 for pcadapt. I'm following privefl's comment and adding null SNP. |
Hi,
with this code, pcadapt returned all my SNP (2400) as outlier!!. Could you please kindly check the code and let me know your suggestion? Many thanks |
Does the histogram of p-values looks good? |
Seems good. |
Hi privefl, |
I'm just saying that the p-values looks calibrated, which is good. For your SNPs, maybe they are all outliers given how you chosed this subset in the first place. |
Nope, your procedure does not seem good to me @Mary-00. You have simulated null SNPS using What you should do instead is to include both SNP variants related to a complex disease and random SNPs in the genome (or all other SNPs). Doing that, you will see that the SNPs related to the disease will not be systematically considered as outliers. |
Hi Actually, the selected variants were resulted from the whole genome sequencing, not chip; so the all variants are located at the various VCF files for each chromosome. This is also true for 1000 genome population, which the corresponded variants are in the various VCF files for each chromosome; for using all variants for pcadapt, I should merged the vcf files of all chromosomes in my population and 1000 genome population, then merged them and feed a very huge file into pcadapt, yes? If pcadapt can handle such a big file? So, using random SNP may be more useful than all variants, could you please kindly share me your idea about it? In the case of using random SNP, I should extract the random SNP with matched (similar) allele frequency and LD value with my SNP list from my population and 1000 genome population, am I correct? As far as I searched, I could not find any tool/script for extracting such a random SNPs (my mean random SNP with similar allele frequency and LD value with my SNP list). Could you please kindly share me if you have any suggestion/solution to get such a random SNP and how I should add them to my real variants? I would like to stay with pcadapt in my work, hop the problem solved. Many thanks in advance |
I would use PLINK to subset variants. |
|
Thanks for your feedback. First, I prefer trying random SNP as privefl suggested;however, I'm a bit confused what exactly shall I do. Could you please advice me how I can get random SNP considering similar allele frequency and LD value to my snp list from various populations of 1000 genome? Thanks |
I don't know how to do this precisely. |
OK, so I considered just MAF as you suggested; however, I found that my population has many variants with allele frequency (AF) of lower than 0.05, unlike the 1000 genome populations; actually, these variants are really different, so how we don't consider them? Regarding the random SNP selection, the number of random SNP should be similar to my SNP number, around 2400, is it right? |
I see that I haven't followed up on this. |
Dear Developers,
With reference to #31, I’m working with a list of SNP variants related to a complex disease. I would like to find which SNPs are outlier between my population and each population of 1000 genome project. To this end, I calculated Fst, deltaDAF, and used pcadapt (version 4.0.3) to find outlier SNPs between my population and each population of 1000 genome. The below code was used:
While the results of Fst and deltaDAF are highly correlated (based on Pearson’s coefficient), the pcadapt returned so different result that doesn’t match with Fst and delatDAF. I’m not sure how I shall interpret these results or there may be anything wrong with pcadapt.
I also test Outflank, but it gave me strange output even for Fst; as its author was not responsive, I leave this analysis. Anyway, could you please kindly suggest me how I can interpret these results?
Almost all plots is similar to #31; however, please let me know if you need more information.
Best
The text was updated successfully, but these errors were encountered: