Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] parse nucleotides error in GWAS summary file #57

Closed
wavefancy opened this issue Mar 1, 2019 · 7 comments
Closed

[BUG] parse nucleotides error in GWAS summary file #57

wavefancy opened this issue Mar 1, 2019 · 7 comments

Comments

@wavefancy
Copy link

Hi Bjarni,

There seems nucleotides parsing error in the latest version(1.0) of LDpred, the exact same inputs, in the previous version(python2 version), there's no nucleotides non mathing error. However, in python3 version about 1/4 of nucleotides non mathing error. Then I checked the h5 coodfile, I found the recode is wrong in this file.

For example:
I have a recode like below in my GWAS summary:

cat data/refined.summary.chr20.txt | body  grep -i 9997251
hg19chrc        pos     MarkerName      Allele1 Allele2 Freq.Allele1    Beta    SE      p N
chr20   9997251 20:9997251      A       C       0.8043  -0.0008113259   0.0062338585    0.8964   1131035

The A/C alleles were coded as T/C in the cood file.

h5dump -d /sum_stats/chrom_20/sids data/coord.file.chr20 | grep -i 20:9997251
   (27994): "20:9997251\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",

h5dump -d /sum_stats/chrom_20/nts data/coord.file.chr20 | grep -i  '(27994'
   (27994,0): "T", "C",

And there are many this type of errors.

Please have a check!

Best regards
Wallace

@bvilhjal
Copy link
Owner

bvilhjal commented Mar 4, 2019

Hi,

I'm not sure this is an error, as currently nucleotide encoding followed is the one provided in the validation genotypes. Hence, if it seems like a nucleotide in the validation genotypes is the complement, then the summary stats nucleotide is changed to its compliment, to match the validation genotypes.

Could that explain what you're seeing.

Best,
Bjarni

@bvilhjal bvilhjal closed this as completed Mar 4, 2019
@wavefancy
Copy link
Author

wavefancy commented Mar 4, 2019 via email

@bvilhjal
Copy link
Owner

bvilhjal commented Mar 4, 2019

Hi,

I went through the coordination code, and refactored it, and made minor changes that I believe should improve performance..

I didn’t find the bug yet though?

Best,
Bjarni

@bvilhjal bvilhjal reopened this Mar 4, 2019
@wavefancy
Copy link
Author

Hi Bjarni,

This is actually a real bug, It happened when the input GWAS summary has extreme P values (0 or 1). If this condition happened, the program will ignore add nucleotides and p-value derived beta to the system, which makes the system shifting the alleles and betas to its next record.

This code makes this bug is in the file of "sum_stats_parsers.py":

                pval_read = float(l[header_dict[pval]])
                chrom_dict[chrom]['ps'].append(pval_read)
                if isinf(stats.norm.ppf(pval_read)):
                     invalid_p += 1
                     continue

If I move this section to the very begining (line 166), if we have extreme p value, just totally ignore loading any info. about this record, which fixed this problem. [Now you are loading half info., which messed up.]

                pval_read = float(l[header_dict[pval]])
                if isinf(stats.norm.ppf(pval_read)):
                    invalid_p += 1
                    continue
                chrom_dict[chrom]['ps'].append(pval_read)

Sorry, this code is based on the version of Feb272019. It's not the one you revised recently. I changed the code structure a little bit (add one more step for only compute LD score), to make it can be run in parallel chr by chr. Here!

I am glad to work together with you to merge this to your repo to make LDpred can be run in parallel chr by chr if you are also happy with it. I think this is a great feature people like it, as it can save lots of memory needed, and also the waiting time.

Best regards
Wallace

@bvilhjal
Copy link
Owner

bvilhjal commented Mar 6, 2019

Thanks a lot Wallace! That's really helpful. I really only went carefully through the coordination code, and not the sum stats parsing code. But in hindsight, it makes lots of sense that the bug is in the sum stats parsing code, because I recently changed it a lot. I'll commit a fix today (or tomorrow).

Regarding parallelising for multiple chromosomes, it's a good idea. I would like to add that feature, but as you know everything takes time. If you submit a merge request (that is not too difficult to merge), then that would be tremendous help. Otherwise, I'll probably get to that in the coming weeks.

Finally, Wallace, it's great to work together on this, and thanks a lot for these contributions, I really appreciate them!

Best,
Bjarni

@bvilhjal bvilhjal closed this as completed Mar 6, 2019
@bvilhjal bvilhjal reopened this Mar 6, 2019
@wavefancy
Copy link
Author

Hi Bjarni,

Please check Here is my version of code running in parallel. Base on your code at Feb272019, with a few bugs I found fixed. You may have a try to merge to your repo, here a guide Merging 2 Different Git Repositories Without Losing your History

Another bug I found yesteray for loading GWAS summary with user specified snp level sample size, --ncol.
Change header_dict[ncol] to l[header_dict[ncol]] would fix the problem in the sum_stats_parsers.py file [Already fixed in my repo]. Currently you are treating the index number as sample size.

Best regards
Wallace

bvilhjal added a commit that referenced this issue Mar 6, 2019
Fixed bug, and included an summary report for gibbs.
bvilhjal added a commit that referenced this issue Mar 6, 2019
Fixed bug
@bvilhjal
Copy link
Owner

bvilhjal commented Mar 6, 2019

I believe I just fixed this issue now.

@bvilhjal bvilhjal closed this as completed Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants