[BUG] parse nucleotides error in GWAS summary file #57

wavefancy · 2019-03-01T15:30:15Z

Hi Bjarni,

There seems nucleotides parsing error in the latest version(1.0) of LDpred, the exact same inputs, in the previous version(python2 version), there's no nucleotides non mathing error. However, in python3 version about 1/4 of nucleotides non mathing error. Then I checked the h5 coodfile, I found the recode is wrong in this file.

For example:
I have a recode like below in my GWAS summary:

cat data/refined.summary.chr20.txt | body  grep -i 9997251
hg19chrc        pos     MarkerName      Allele1 Allele2 Freq.Allele1    Beta    SE      p N
chr20   9997251 20:9997251      A       C       0.8043  -0.0008113259   0.0062338585    0.8964   1131035

The A/C alleles were coded as T/C in the cood file.

h5dump -d /sum_stats/chrom_20/sids data/coord.file.chr20 | grep -i 20:9997251
   (27994): "20:9997251\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",

h5dump -d /sum_stats/chrom_20/nts data/coord.file.chr20 | grep -i  '(27994'
   (27994,0): "T", "C",

And there are many this type of errors.

Please have a check!

Best regards
Wallace

The text was updated successfully, but these errors were encountered:

bvilhjal · 2019-03-04T07:24:58Z

Hi,

I'm not sure this is an error, as currently nucleotide encoding followed is the one provided in the validation genotypes. Hence, if it seems like a nucleotide in the validation genotypes is the complement, then the summary stats nucleotide is changed to its compliment, to match the validation genotypes.

Could that explain what you're seeing.

Best,
Bjarni

wavefancy · 2019-03-04T12:09:21Z

I have thought about the compliment issue. However, a AC alleles should be TG, but it coded as TC. This is the problem. I checked many other sites it reported as mismatch by ldpred, it actually can't be explained by switching to compliment. And the same input in the python2 version has no problem but in the python3 has many problems. This is wired. Best regards Wallace

On Mon, Mar 4, 2019 at 2:25 AM Bjarni J. Vilhjalmsson < ***@***.***> wrote: Hi, I'm not sure this is an error, as currently nucleotide encoding followed is the one provided in the validation genotypes. Hence, if it seems like a nucleotide in the validation genotypes is the complement, then the summary stats nucleotide is changed to its compliment, to match the validation genotypes. Could that explain what you're seeing. Best, Bjarni — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#57 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABZncjk90PeYxdUr4jQ_byLPHzMyPBOEks5vTMpKgaJpZM4bZP_S> .

-- Best regards Minxian Wang --------------------------------- Research Fellow Beth Israel Deaconess Medical Center Harvard Medical School 99 Brookline Ave Boston, MA 02215

bvilhjal · 2019-03-04T14:43:40Z

Hi,

I went through the coordination code, and refactored it, and made minor changes that I believe should improve performance..

I didn’t find the bug yet though?

Best,
Bjarni

wavefancy · 2019-03-05T20:56:14Z

Hi Bjarni,

This is actually a real bug, It happened when the input GWAS summary has extreme P values (0 or 1). If this condition happened, the program will ignore add nucleotides and p-value derived beta to the system, which makes the system shifting the alleles and betas to its next record.

This code makes this bug is in the file of "sum_stats_parsers.py":

                pval_read = float(l[header_dict[pval]])
                chrom_dict[chrom]['ps'].append(pval_read)
                if isinf(stats.norm.ppf(pval_read)):
                     invalid_p += 1
                     continue

If I move this section to the very begining (line 166), if we have extreme p value, just totally ignore loading any info. about this record, which fixed this problem. [Now you are loading half info., which messed up.]

                pval_read = float(l[header_dict[pval]])
                if isinf(stats.norm.ppf(pval_read)):
                    invalid_p += 1
                    continue
                chrom_dict[chrom]['ps'].append(pval_read)

Sorry, this code is based on the version of Feb272019. It's not the one you revised recently. I changed the code structure a little bit (add one more step for only compute LD score), to make it can be run in parallel chr by chr. Here!

I am glad to work together with you to merge this to your repo to make LDpred can be run in parallel chr by chr if you are also happy with it. I think this is a great feature people like it, as it can save lots of memory needed, and also the waiting time.

Best regards
Wallace

bvilhjal · 2019-03-06T07:49:48Z

Thanks a lot Wallace! That's really helpful. I really only went carefully through the coordination code, and not the sum stats parsing code. But in hindsight, it makes lots of sense that the bug is in the sum stats parsing code, because I recently changed it a lot. I'll commit a fix today (or tomorrow).

Regarding parallelising for multiple chromosomes, it's a good idea. I would like to add that feature, but as you know everything takes time. If you submit a merge request (that is not too difficult to merge), then that would be tremendous help. Otherwise, I'll probably get to that in the coming weeks.

Finally, Wallace, it's great to work together on this, and thanks a lot for these contributions, I really appreciate them!

Best,
Bjarni

wavefancy · 2019-03-06T15:21:49Z

Hi Bjarni,

Please check Here is my version of code running in parallel. Base on your code at Feb272019, with a few bugs I found fixed. You may have a try to merge to your repo, here a guide Merging 2 Different Git Repositories Without Losing your History

Another bug I found yesteray for loading GWAS summary with user specified snp level sample size, --ncol.
Change header_dict[ncol] to l[header_dict[ncol]] would fix the problem in the sum_stats_parsers.py file [Already fixed in my repo]. Currently you are treating the index number as sample size.

Best regards
Wallace

Fixed bug, and included an summary report for gibbs.

Fixed bug

bvilhjal · 2019-03-06T15:36:29Z

I believe I just fixed this issue now.

bvilhjal closed this as completed Mar 4, 2019

bvilhjal reopened this Mar 4, 2019

bvilhjal closed this as completed Mar 6, 2019

bvilhjal reopened this Mar 6, 2019

bvilhjal added a commit that referenced this issue Mar 6, 2019

#57

e62ab51

Fixed bug, and included an summary report for gibbs.

bvilhjal added a commit that referenced this issue Mar 6, 2019

#57

d92c794

Fixed bug

bvilhjal closed this as completed Mar 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] parse nucleotides error in GWAS summary file #57

[BUG] parse nucleotides error in GWAS summary file #57

wavefancy commented Mar 1, 2019

bvilhjal commented Mar 4, 2019

wavefancy commented Mar 4, 2019 via email

bvilhjal commented Mar 4, 2019

wavefancy commented Mar 5, 2019

bvilhjal commented Mar 6, 2019

wavefancy commented Mar 6, 2019

bvilhjal commented Mar 6, 2019

[BUG] parse nucleotides error in GWAS summary file #57

[BUG] parse nucleotides error in GWAS summary file #57

Comments

wavefancy commented Mar 1, 2019

bvilhjal commented Mar 4, 2019

wavefancy commented Mar 4, 2019 via email

bvilhjal commented Mar 4, 2019

wavefancy commented Mar 5, 2019

bvilhjal commented Mar 6, 2019

wavefancy commented Mar 6, 2019

bvilhjal commented Mar 6, 2019