Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed Diploid / Haploid calls result in incorrect zygosity call #227

Closed
davmlaw opened this issue Nov 29, 2021 · 14 comments
Closed

Mixed Diploid / Haploid calls result in incorrect zygosity call #227

davmlaw opened this issue Nov 29, 2021 · 14 comments

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Nov 29, 2021

DRAGEN 05.021.604.3.7.7 produced a VCF with a genotype call of "1" on chrX for a male. This seems valid - the VCF spec says "Haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondrion, are indicated by having only one allele value"

CyVCF2 doesn't seem to handle this if the VCF row contains a mix of diploid / haploid calls (eg chrX record joint called with a Male/Female)

Attached a minimal version to reproduce below:

test.vcf.gz

print(f"{metadata.version('cyvcf2')=}")                                                                                                                  
for v in Reader("test.vcf.gz", strict_gt=True):  # Strict=False has similar issue but makes it HET rather than unknown
    print(v) 
    print(f"{v.num_hom_ref=} / {v.num_het=} / {v.num_unknown=} / {v.num_hom_alt=} ")
    print(f"{v.ploidy=}, PL sample records= {len(v.format('PL')[0])}")

Outputs:

metadata.version('cyvcf2')='0.30.12'

Both Haploid works - recognises 1 as HOM ALT

NC_000023.11	8466747	.	A	C	.	PASS	.	GT:PL	.:.	1:36,0

v.num_hom_ref=0 / v.num_het=0 / v.num_unknown=1 / v.num_hom_alt=1 
v.ploidy=1, PL sample records= 2

Both Diploid works - explicitly include the missing genotype call:

NC_000023.11	8466747	.	A	C	.	PASS	.	GT:PL	./.:.	./1:36,0,40

v.num_hom_ref=0 / v.num_het=0 / v.num_unknown=2 / v.num_hom_alt=0 
v.ploidy=2, PL sample records= 3

One Diploid / One Haploid is wrong - it treats the haploid as missing a zygosity call - but it also ploidy and PL width (usually ploidy+1) diverge:

NC_000023.11	8466747	.	A	T	.	PASS	.	GT:PL	./.:.	1:36,0

v.num_hom_ref=0 / v.num_het=0 / v.num_unknown=2 / v.num_hom_alt=0 
v.ploidy=2, PL sample records= 2

Expected result: GT=1 should be consistently read as HOM ALT, regardless of other genotypes on the line

@brentp
Copy link
Owner

brentp commented Nov 29, 2021

yes, a lot of the features like gt_alts, and num_{het,hom_ref,hom_alt} assume diploid. I'd be happy to accept a PR to fix this.

@grahamgower
Copy link
Collaborator

I just got bitten by this too. It looks like a lot of things are broken for mixed ploidy. htslib/vcf.h has this to say about bcf_get_genotypes():

Use the returned number of written values for accessing valid entries of dst, as ndst is only a
watermark that can be higher than the returned value, i.e. the end of dst can contain carry-over
values from previous calls to bcf_get_format_*() on lines with more values per sample.

However, cyvcf uses "ndst" instead of the returned value here:

cyvcf2/cyvcf2/cyvcf2.pyx

Lines 1354 to 1356 in 036b58f

if bcf_get_genotypes(self.vcf.hdr, self.b, &gts, &ndst) <= 0:
raise Exception("couldn't get genotypes for variant")
return newGenotypes(gts, int(ndst/self.vcf.n_samples), self.vcf.n_samples)

and here:
ngts = bcf_get_genotypes(self.vcf.hdr, self.b, &self._gt_types, &ndst)

Curiously, thiis means Variant.genotypes is doing the right thing, but not Variant.genotype.array().

I guess it would be easy to fix these cases. But at first glance it seems that fixing gt_types (and dependents) is a lot more complicated and might need backwards-incompatible changes.

@brentp
Copy link
Owner

brentp commented Dec 1, 2021

@grahamgower , thanks for pointing this out. I'll make these changes now and see if I can get Variant.genotype.array() fixed as well.

brentp added a commit that referenced this issue Dec 1, 2021
@brentp
Copy link
Owner

brentp commented Dec 1, 2021

with the changes recommended by @grahamgower along with a fix in the helper function and these genotypes shown in original comment:

. 1
./. ./1
./. 1

I now get:

 [0, 0, 1, 1],
 [0, 1, 1, 0], # (with `strict_gt=True`, this becomes [0, 0, 2, 0])
 [0, 0, 1, 1]

for [v.num_hom_ref, v.num_het, v.num_unknown, v.num_hom_alt]

I have pushed that fix and will make a release soon. Would be good if either/both of you could verify this matches what you expect.

I'd also reocmmend to use the Gentoypes object as it should be pretty fast and you can calculate this type of thing fairly well. I'm also open to adding methods (in cython) to that for speed and ease-of-use.

@brentp
Copy link
Owner

brentp commented Dec 1, 2021

btw, I left this open in case you uncover more issues. Given test-cases as above, I think they can be fixed.

@grahamgower
Copy link
Collaborator

Thanks a bunch @brentp! So to be clear, which features should I avoid if I want to support ploidy>2 in my application?

@brentp
Copy link
Owner

brentp commented Dec 1, 2021

for ploidy > 2, avoid:

  1. variant.gt_types, variant.gt_bases, and all the variant.gt_* properties (except gt_quals should be ok).
    instead, use e.g. variant.format('AD') and get ref and alt depths from that.

  2. num_het, num_hom_ref, num_*, aaf, num_called
    instead, use variant.genotype or variant.genotypes with preference for the former.

The gt_ and num_ attributes like those were added since cyvcf2 was (as the name implies) copying the API of cyvcf which was in cython, but did not use htslib.
I would replace most of those with simpler, fewer methods on the result of variant.genotype, and can do so if there is a need.

@grahamgower
Copy link
Collaborator

Perfect, thanks again @brentp. And to answer your earlier question...

Would be good if either/both of you could verify this matches what you expect.

Yes, this matches my expectations.

@davmlaw
Copy link
Contributor Author

davmlaw commented Dec 2, 2021

Hi, thanks for the very quick fix, still an issue I think:

3 GTs of "0/1 0 0/1" (strict_gt=True) gives me:

In [9]: v.num_hom_ref                                                                                                                                                                             
Out[9]: 0

In [10]: v.genotype                                                                                                                                                                               
Out[10]: 
[[ 0  1  0]
 [ 0 -2  1]
 [ 0  1  0]]

In [11]: v.genotypes                                                                                                                                                                              
Out[11]: [[0, 1, False], [0, True], [0, 1, False]]

Expected num_hom_ref = 1

@davmlaw
Copy link
Contributor Author

davmlaw commented Dec 2, 2021

PS hello @grahamgower long time no see!

@grahamgower
Copy link
Collaborator

Hey @davmlaw! Sorry, I didn't even notice it was your issue! I see you're still going strong with variantgrid. :)

brentp added a commit that referenced this issue Dec 2, 2021
@brentp
Copy link
Owner

brentp commented Dec 2, 2021

Thanks for following up. I just pushed 8c41b9c so that the variant with gentoypes:

0/1     0       0/1

and code:

    print(v.num_hom_ref, v.num_het, v.num_unknown, v.num_hom_alt)
    print(v.genotype)
    print(v.genotypes)

now gives:

1 2 0 0
[[ 0  1  0]
 [ 0 -2  1]
 [ 0  1  0]]
[[0, 1, False], [0, True], [0, 1, False]]

note that -2 is a fill value for v.genotype indicating the end of the variant since this is a numpy array and all rows must be same length.
Let me know if you see any other problems.

@davmlaw
Copy link
Contributor Author

davmlaw commented Dec 2, 2021

Hi, works for me, just ran it through the VCF I had problems with before and everything was as I expected.

Happy to close the issue if you are

@brentp brentp closed this as completed in 2addcaf Dec 2, 2021
@brentp
Copy link
Owner

brentp commented Dec 2, 2021

This is fixed in v0.30.13. Thanks for reporting and providing the test-cases!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants