Support looser validation stringency for loading some VCF Integer fields #1213

Closed
fnothaft opened this Issue Oct 17, 2016 · 7 comments

Comments

Projects
None yet
2 participants
@fnothaft
Member

fnothaft commented Oct 17, 2016

As reported on Gitter, a user ran into a problem loading a VCF file into ADAM:

Hello everybody (again). I am running the vcf2adam option to convert a VCF file to ADAM, and I get a java.lang.NumberFormatException that stops the conversion.
I am trying to debug the error, and I have found the error happens while trying to convert a FORMAT KEY Value of the VCF header from String to Int, while the FORMAT KEY Value is a Float.
I think I already have the error located at line 199 of the "VariantAnnotationConverter.scala" file, cause the PQ (phaseQuality) is a Float, but the function applied to convert the string is the "attrAsInt" instead the "attrAsFloat".

As per the VCF spec, our implementation is correct:

PQ : phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set). We note that we have not yet included the specific measure for precisely defining “phasing quality”; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality. (Integer)

However, it would be useful to have a relaxed validation stringency that would allow us to load float values into an integer field.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 17, 2016

Member

-0

If a VCF file is using a reserved key incorrectly, for the version of the VCF spec that matches the version of VCF spec in use for the file, then I don't believe we should attempt to workaround the problem.

The fix is to move the Type=Float-defined PQ annotation in the file to a non-reserved key and notify the offending upstream tool or data provider that they are not adhering to the specification.

That said, VCF is poorly specified and reserved keys change meanings, types, and cardinalities between versions. If we (or htsjdk) is not validating against specific VCF versions, then that should be addressed. There is also the problem of htsjdk not yet supporting VCF version 4.3.

Member

heuermh commented Oct 17, 2016

-0

If a VCF file is using a reserved key incorrectly, for the version of the VCF spec that matches the version of VCF spec in use for the file, then I don't believe we should attempt to workaround the problem.

The fix is to move the Type=Float-defined PQ annotation in the file to a non-reserved key and notify the offending upstream tool or data provider that they are not adhering to the specification.

That said, VCF is poorly specified and reserved keys change meanings, types, and cardinalities between versions. If we (or htsjdk) is not validating against specific VCF versions, then that should be addressed. There is also the problem of htsjdk not yet supporting VCF version 4.3.

@fnothaft fnothaft referenced this issue in bigdatagenomics/bdg-formats Nov 7, 2016

Open

Refactor Genotype and GenotypeAnnotation #108

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 7, 2016

Member

Thinking about this further, I would support using relaxed validation stringency to continue processing on NumberFormatExceptions and similar, logging warnings on LENIENT. I don't think converting incorrectly typed user data on load is a good idea.

Member

heuermh commented Nov 7, 2016

Thinking about this further, I would support using relaxed validation stringency to continue processing on NumberFormatExceptions and similar, logging warnings on LENIENT. I don't think converting incorrectly typed user data on load is a good idea.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 7, 2016

Member

That sounds reasonable. Would there be any value to storing "bad" keys in the attribute map, e.g., if you give us "PQ"->"Fish", we store it in the attribute map as "ERROR_PQ"->"Fish", or something like that? I'm thinking that would allow the user to go and correct the error programmatically after loading their data in, but I'm not sure if that would actually be useful or if it would just make everything worse.

As an aside,

There is also the problem of htsjdk not yet supporting VCF version 4.3.

I didn't realize that. What is it missing/is there an issue to track that?

Member

fnothaft commented Nov 7, 2016

That sounds reasonable. Would there be any value to storing "bad" keys in the attribute map, e.g., if you give us "PQ"->"Fish", we store it in the attribute map as "ERROR_PQ"->"Fish", or something like that? I'm thinking that would allow the user to go and correct the error programmatically after loading their data in, but I'm not sure if that would actually be useful or if it would just make everything worse.

As an aside,

There is also the problem of htsjdk not yet supporting VCF version 4.3.

I didn't realize that. What is it missing/is there an issue to track that?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 7, 2016

Member

There is also the problem of htsjdk not yet supporting VCF version 4.3.

I didn't realize that. What is it missing/is there an issue to track that?

samtools/htsjdk#694, which according to their review party issue 548 is "on hold until resources are found to finish PR"

Member

heuermh commented Nov 7, 2016

There is also the problem of htsjdk not yet supporting VCF version 4.3.

I didn't realize that. What is it missing/is there an issue to track that?

samtools/htsjdk#694, which according to their review party issue 548 is "on hold until resources are found to finish PR"

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 7, 2016

Member

OK, looks like they're most of the way though. I imagine that VCF4.3 will make it in reasonably soon.

Member

fnothaft commented Nov 7, 2016

OK, looks like they're most of the way though. I imagine that VCF4.3 will make it in reasonably soon.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 9, 2016

Member

An implementation-level question: If say an individual VCF INFO value fails to validate, should checking stringency for what to do happen at that field level, or at the higher Variant or VariantAnnotation level? In other words, is the try catch check stringency around each field parse and set or around the entire block of fields?

Member

heuermh commented Nov 9, 2016

An implementation-level question: If say an individual VCF INFO value fails to validate, should checking stringency for what to do happen at that field level, or at the higher Variant or VariantAnnotation level? In other words, is the try catch check stringency around each field parse and set or around the entire block of fields?

@fnothaft fnothaft added the wontfix label Mar 3, 2017

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 3, 2017

Member

In the VCF refactor, we decided to not support this. Closing.

Member

fnothaft commented Mar 3, 2017

In the VCF refactor, we decided to not support this. Closing.

@fnothaft fnothaft closed this Mar 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment