Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

region extraction pipeline failing in allele flip handling #132

Open
dianacornejo opened this issue Jan 12, 2022 · 7 comments
Open

region extraction pipeline failing in allele flip handling #132

dianacornejo opened this issue Jan 12, 2022 · 7 comments

Comments

@dianacornejo
Copy link
Contributor

Hi @changebio I'm going to bring the discussion here so that I don't forget about it.

I ran the region extraction pipeline with the merge exome and impute data from the UKB, and then proceed to run the fine mapping analysis. I noticed something was wrong on how the allele flip is being handled.
Here one example in chr22

22	50549676	A	G	chr22:50549676:A:G	-0.122039	0.0200183	1.3205048480570123e-09
22	50549676	G	A	chr22:50549676:G:A	0.123389	0.0200053	8.481838520726142e-10

That led to weird results on the fine mapping analysis. As you can see the same variant (with flipped alleles is being used for fine mapping giving 2 different pip results)

chr22.50549067.A.AG 0.667913905147005
chr22.50549676.A.G 0.399442941057422
chr22.50549676.G.A 0.60547973816813

Any thoughts on how to solve this issue are appreciated!

Thank you

@changebio
Copy link
Contributor

@dianacornejo I have a question about the example. Does it mean that there are duplicated SNPs (one of them flipped) in the regenie sumstats?

@dianacornejo
Copy link
Contributor Author

@changebio yes in the sumstats, I think what's happening is one allele comes from the exome data and the other one from the imputed data. In this particular case both the exome and imputed have the exact same sample size, different from before in which the imputed had more samples.

@changebio
Copy link
Contributor

changebio commented Jan 17, 2022

@dianacornejo I figured out why you got the error. There are duplicated SNPs in genotype data with shifted a0 and a1, which is not considered in my function. I am fixing it
Screen Shot 2022-01-17 at 12 08 17 PM

@changebio
Copy link
Contributor

@gaow How should we deal with SNPs, which will be the "same" after shifting over a0 and a1. But they have different beta in sumstats, which means they are not just shifted a0 and a1 and also have different genotypes.
Screen Shot 2022-01-17 at 12 28 14 PM

@gaow
Copy link
Contributor

gaow commented Jan 17, 2022

@changebio I'm confused with the context under which this is discussed. The data you show above look perfectly normal to me if they come from two different association tests.

In @dianacornejo 's original example, the two variants in question are claimed to have the same sample size (and possibly almost the same samples) -- which I would say so because the summary stats are a little bit different and not by too much. In @changebio 's last ticket, these variants have different summary stats because they are association tests on different data. It's not a duplicate. It's a merger artifact to me. I thought in merging the imputed and sequence data we go by whichever with a larger sample size and simply drop the other one, if they are the same variant (after flipping as necessary)?

@gaow
Copy link
Contributor

gaow commented Jan 17, 2022

@dianacornejo It turns out what he found and what you reported are both legit issues and are separate problems. @changebio told me offline that he ran into the 2nd issue when he investigated on your initial issue. Moving forward, @changebio will fix the issue you reported, but we need your help on the issue he observe above. I'm going to open a ticket in the UKB repo since the problem is data-set specific. But @changebio will formalize this into QC on sumstats to help catching the issue before merging summary stats.

@dianacornejo
Copy link
Contributor Author

@gaow yes I can see the second issue being different from the first I reported. Because what he shows is only on the exome data, not in the merged data where I reported the first problem. Sure anything I can do just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants