-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fatal error while running munge_sumstats.py #43
Comments
Hi Tushar, The fix for this error will depend on your data, but possible things to check include:
As an aside, there’s a ldsc users group (https://groups.google.com/forum/#!forum/ldsc_users) that reaches a larger number of people and is great for troubleshooting these kinds of questions. It also helps us separate initial troubleshooting / usage questions / discussion from confirmed bugs/etc on the github issue tracker for future ldsc versions. Not a big deal and I’m still happy to help here, but something to know for future reference. Cheers, On Feb 26, 2016, at 2:17 PM, Tushar Harishbhai Dave notifications@github.com wrote:
|
Thanks Raymond for your prompt response. And I apologize if I am not asking my doubts at the right location. In future, I will ask my questions on the google group that you have mentioned. Thanks for the possible fix that you have provided. Below are my answers to your fixes:
But now, I am having another error with my other GWAS data file. Below is the script and the error: SCRIPT: ERROR: Traceback (most recent call last): Any possible solution for this error? Thanks for your help. |
Hi Tushar, If that�s ok, the next thing I�d check is whether you have any SNPs where the entry for A1 is empty/missing. Cheers, On Feb 26, 2016, at 4:15 PM, Tushar Harishbhai Dave notifications@github.com wrote:
|
Hello Raymond, Below is the log file content:
Interpreting column names as follows: Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist ERROR converting summary statistics: Traceback (most recent call last): |
This looks like it could be an issue with pandas? Could you try running the following tiny python script:
If you're running an old version of pandas (below 0.16), try updating pandas (e.g., using pip or conda) |
Hello Brendan, I don't think that it's a package version issue as the installed package versions are greater than the required ones. |
Hello Raymond and Brendan, I found the problem and it is column name issue. I haven't notice that my file has columns with A1 and A2 labels, so I specified explicitly them as CODED_ALLELE and NONCODED_ALLELE because my file has them. The script has identified all the four columns and probably got confused while processing the file. So, I removed the line from my script which defining the column names explicitly and the script just run fine. I have another question for you both. With --merge-alleles option, I am using HapMap SNP list file as suggested in the wiki. I am just wondering that does it ok if I use HapMap SNP list file for the data that were imputed using 1000Genome reference panels. I am asking this question because I losing a lot of SNPs because they are not present in HapMaP file. Below log lines will help you understand my point: Read 10727831 SNPs from --sumstats file. What are your thoughts on this? Thanks for your help. |
Is there any way to print out the removed SNPs in another file? |
Hi Tushar, To your earlier question about --merge-alleles, the large number of removed SNPs is expected here. If you look at the w_hm3.snplist file it only contains ~1.2 million SNPs, so it looks like you have nearly all of them and then the remaining 9 million non-HapMap SNPs are excluded. Even without the SNP list, many of these would probably be excluded by the MAF and info score filters regardless. The HapMap3 SNP list is applicable to 1000 Genomes imputed data. The list provides a robust set of common SNPs that is reliably imputed across studies which makes it a good baseline standard, especially for analyzing genetic correlation. Cheers, On Feb 27, 2016, at 4:27 PM, Tushar Harishbhai Dave notifications@github.com wrote:
|
Thanks Raymond for your explanation. My curiosity (about ldsc) is not ending. Because of that I have another question for you. Below find the log file for my second summary statistics file:
Interpreting column names as follows: Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Metadata: If you see, the log line just above the line with Metadata text says "Writing summary statistics for 1217311 SNPs (1199126 with the nonmissing beta) to /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz". How can the numbers in that line be different from each other? Also, when I count the lines of the result files I got 1217311. In fact, for my both the data set I got the same number (i.e. 1217311) of lines in their respective result files? Why is that so? Why there are blanks column values in the result files? Is it because they are either removed or not found in the HapMap 3 file? As next step to my analysis, I am running LD Score regression. I am using the same script which is explained in the wiki (but with my input files). The question for this analysis is that does the algorithm will run LD regression on the common SNPs only? I meant the SNPs which are present in both the data files. I hope my questions are clear enough. Thanks again for your help. |
Hi Tushar, For ease of comparison across datasets, --merge-alleles prints output to sumstats.gz for all SNPs in the --merge-alleles file. SNPs that aren’t present in your input data will be output with missing values. So using w_hm3.snplist you will always have 1217311 SNPs in the output, and in this case your input data has results for 1199126 of those SNPs (after filtering). The log for ldsc should indicate the number of SNPs remaining in the regression (i.e. after merging SNPs present in the file of reference panel LD scores). Cheers, On Feb 29, 2016, at 2:39 PM, Tushar Harishbhai Dave notifications@github.com wrote:
|
Hello, Raymond If I understand you correctly, not matter how many SNPs are left over after filtration process I will always get the same number of SNPs (i.e. 1217311) in the output file. Can I use 1000Genome SNPs instead of HapMap3? If so, can you point me to the website from where I can download it? How reliable the LD regression results are if the number of SNPs < 200K? Unfortunately, I have ~114K SNPs and scripts prints out WARNING message. Thanks. |
Hello Raymond, When I tried to convert the Heritability Observed Score to liability score I got following error:
Beginning analysis at Mon Feb 29 16:51:27 2016 Heritability of phenotype 1ERROR computing rg for phenotype 2/2, from file /local/projects-t2/SIGN/analysis/depression/LDSC/sign.sumstats.gz. Can you please help me to solve this? Thanks for your help. |
Hi Tushar, Having only ~114K is unexpected and potentially problematic. Have you checked that everything is working properly with munge_sumstats.py for that dataset? Using 1000 Genomes SNPs is fine. 1000 Genomes data is available http://www.1000genomes.org/data. I’d recommend consulting the Methods sections for each of the first 3 LD score papers listed under citation in the README for a number of additional comments about choosing SNPs to include in the regression. Cheers, On Feb 29, 2016, at 4:54 PM, Tushar Harishbhai Dave notifications@github.com wrote:
|
Hello Raymond, I don't see any issue with munge_sumstats.py run. Though, I have pasted the content of the log files below. Can you please have a quick look at it and let me know if you see any potential issue? Thanks for your help. ====> Log file for first dataset ==========
Interpreting column names as follows: Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Metadata: Conversion finished at Fri Feb 26 15:53:56 2016 Total time elapsed: 8.95s====> Log file for second dataset ====
Interpreting column names as follows: Reading list of SNPs for allele merge from /local/projects-t2/SIGN/analysis/depression/LDSC/w_hm3.snplist Metadata: Conversion finished at Sat Feb 27 16:11:20 2016 Total time elapsed: 1.0m:27.45sI have one question about the second dataset log file. As you can see, the script did not find 9528559 SNPs in HapMap3 SNP list file. SO it removed them from the analysis. The script is also removed 9528694 SNPs with MAF <= 0.01. I am just wondering that the script actually did not remove 9528559 + 9528694 SNPs. So why did the script log that two numbers? Can you please elaborate that? Thanks for all your help. |
Hi Tushar, The log is incorrectly reporting SNPs removed from merge-alleles as also removed for MAF. So the actual number for MAF should be 9528694-9528559=135. Cheers, On Feb 29, 2016, at 7:31 PM, Tushar Harishbhai Dave notifications@github.com wrote:
|
Thanks Raymond for your efforts. I also thought that 135 SNPs were actually removed as a result of MAF <= 0.01. I would like to ask you a question about 1K Genome data. What should I do to get a list of 1K genome SNPs? Should I just download the VCF file of each chromosome and extract only three columns (i..e ID, Ref, Alt) and then concatenate all the chromosome file into one file? Then use that combined file as input to --merge-alleles option. Or is there any file available online which has all chromosome data? Thanks. |
I’m not aware of a full list of variants provided for 1KG, so extracting from the VCFs is probably the necessary solution. You’ll also want to filter that list of variants for e.g. maf, indels, strand-ambiguous SNPs, etc., and you’ll want to make sure your GWAS data are being filtered for imputation quality. Again, the 3 papers have more detailed discussion of this. Also, should point out that for genetic correlation you’re ultimately going to be restricted to the overlapping SNPs between your two datasets regardless, so I doubt this will have a large impact since the vast majority of SNPs in your smaller dataset are already passing the HapMap3 list. Cheers, On Mar 1, 2016, at 11:32 AM, Tushar Harishbhai Dave notifications@github.com wrote:
|
I agree with you about the overlapping SNPs point. No matter what reference data set I use to compare the alleles, LD score regression will only be performed for overlapping SNPs only. Thanks for all your help. Best, |
Hello Raymond, Just curious, can I use meta-analysis results as summary statistics? Or is ldsc restricted to GWAS results? |
Yep, genome-wide meta-analysis results are perfectly suitable for ldsc. On Mar 1, 2016, at 1:26 PM, Tushar Harishbhai Dave notifications@github.com wrote:
|
Thanks for your prompt response. |
Hi Tushar, The 1000G variants we used to compute LD scores are in the 1000G plink bim Best, Hilary On Tue, Mar 1, 2016 at 1:30 PM, Tushar Harishbhai Dave <
|
Hello,
After successful installation of ldsc on my system, I tried to run munge_sumstats.py script to format my input files to ldsc format files. I thought this will be a smooth process but it's not. While running munge_sumstats.py script, I am facing a fatal error about median values of beta. For reference, I have pasted the script and the fatal error below:
SCRIPT:
python munge_sumstats.py
--sumstats inFile.txt
--N 18759
--out out
--merge-alleles hapmap.txt
ERROR:
ERROR converting summary statistics:
Traceback (most recent call last):
File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 654, in munge_sumstats
check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 366, in check_median
raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of beta is -3.6 (should be close to 0). This column may be mislabeled.
After this exercise, I thought that my input file has negative beta values that might cause the error. So I converted all negative values to positive and ran the same script. Unfortunately, the same error occurred. For reference, I have pasted the error below:
ERROR converting summary statistics:
Traceback (most recent call last):
File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 654, in munge_sumstats
check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 366, in check_median
raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of beta is 3.6 (should be close to 0). This column may be mislabeled.
I hope that I am clear enough in explaining my issue.
Can someone elaborate me that why am I getting this error?
Thanks.
The text was updated successfully, but these errors were encountered: