New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push validation checks down to INFO/FORMAT fields #1676

Closed
fnothaft opened this Issue Aug 22, 2017 · 12 comments

Comments

Projects
2 participants
@fnothaft
Member

fnothaft commented Aug 22, 2017

Currently, we toss any genotype/variant that fails any part of validation. We should just drop the failing INFO/FORMAT tag, if possible.

@fnothaft fnothaft added the bug label Aug 22, 2017

@fnothaft fnothaft self-assigned this Aug 22, 2017

fnothaft added a commit to fnothaft/adam that referenced this issue Aug 22, 2017

[ADAM-1676] Add more finely grained validation for INFO/FORMAT fields.
Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion
level, which keeps a variant/genotype record around even if a bad INFO/FORMAT
field was attached to that record.
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 22, 2017

Member

I might need some convincing this is the right thing to do.

For example in #1512, if the +Inf value is for a mapped field (say Variant.alleleFrequency) we'd be left with a variant with null for that field, which would later be written out as missing value (.).

Member

heuermh commented Aug 22, 2017

I might need some convincing this is the right thing to do.

For example in #1512, if the +Inf value is for a mapped field (say Variant.alleleFrequency) we'd be left with a variant with null for that field, which would later be written out as missing value (.).

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 22, 2017

Member

The way I think about it is that if you have a +Inf value that you can't parse and then drop the whole record, you've lost all of the data at the variant site for all samples (because we drop the whole offending variant context) just because a single tag (which may or may have not been useful) was bad. E.g., in the context of #1566, this would allow us to keep the variants with the bad phase set IDs; we would just drop the phase set tag on those records.

As an aside, null shouldn't be written out as a . unless there are multiple genotypes at the variant site that have that tag.

Member

fnothaft commented Aug 22, 2017

The way I think about it is that if you have a +Inf value that you can't parse and then drop the whole record, you've lost all of the data at the variant site for all samples (because we drop the whole offending variant context) just because a single tag (which may or may have not been useful) was bad. E.g., in the context of #1566, this would allow us to keep the variants with the bad phase set IDs; we would just drop the phase set tag on those records.

As an aside, null shouldn't be written out as a . unless there are multiple genotypes at the variant site that have that tag.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 22, 2017

Member

#1566 is different I think, all the values are bad, and validation chokes on the header line.

Member

heuermh commented Aug 22, 2017

#1566 is different I think, all the values are bad, and validation chokes on the header line.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 22, 2017

Member

With this patch, you can load all of the variants in the GIAB VCFs by setting validation stringency to lenient. You lose the PS fields, but you get all the variants.

Member

fnothaft commented Aug 22, 2017

With this patch, you can load all of the variants in the GIAB VCFs by setting validation stringency to lenient. You lose the PS fields, but you get all the variants.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 22, 2017

Member

You lose the PS fields, but you get all the variants.

What does that look like? Genotype.phased set to false or unset (null)? Genotype.phaseSetId set to null? Is the incorrect header ##FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT"> still present? Can you write back out to VCF or to ADAM format?

Member

heuermh commented Aug 22, 2017

You lose the PS fields, but you get all the variants.

What does that look like? Genotype.phased set to false or unset (null)? Genotype.phaseSetId set to null? Is the incorrect header ##FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT"> still present? Can you write back out to VCF or to ADAM format?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 22, 2017

Member

Shows as phased=true:

zIRISz:adam fnothaft$ ./bin/adam-submit print HG002.gt/ | grep 817185
2017-08-22 10:11:36 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
{"variant": {"contigName": "chr1", "start": 817185, "end": 817186, "names": [], "referenceAllele": "G", "alternateAllele": "A", "filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "annotation": {"ancestralAllele": null, "alleleCount": null, "readDepth": null, "forwardReadDepth": null, "reverseReadDepth": null, "referenceReadDepth": null, "referenceForwardReadDepth": null, "referenceReverseReadDepth": null, "alleleFrequency": null, "cigar": null, "dbSnp": null, "hapMap2": null, "hapMap3": null, "validated": null, "thousandGenomes": null, "somatic": false, "transcriptEffects": [], "attributes": {"callable": "CS_HiSeqPE300xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable", "filt": "CS_HiSeqMatePairSentieon_filt", "datasetsmissingcall": "10XChromium,IonExome,SolidSE75bp", "platformnames": "Illumina,CG", "callsetnames": "HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes", "datasets": "4", "callsets": "7", "datasetnames": "HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair", "platforms": "2"}}}, "contigName": "chr1", "start": 817185, "end": 817186, "variantCallingAnnotations": {"filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "downsampled": null, "baseQRankSum": null, "fisherStrandBiasPValue": null, "rmsMapQ": null, "mapq0Reads": null, "mqRankSum": null, "readPositionRankSum": null, "genotypePriors": [], "genotypePosteriors": [], "vqslod": null, "culprit": null, "attributes": {"IGT": "1/1", "ADALL": "0,362"}}, "sampleId": "HG002", "sampleDescription": null, "processingDescription": null, "alleles": ["ALT", "ALT"], "expectedAlleleDosage": null, "referenceReadDepth": 75, "alternateReadDepth": 603, "readDepth": 1131, "minReadDepth": null, "genotypeQuality": 312, "genotypeLikelihoods": [], "nonReferenceLikelihoods": [], "strandBiasComponents": [], "splitFromMultiAllelic": false, "phased": true, "phaseSetId": null, "phaseQuality": null}
zIRISz:adam fnothaft$ cat ~/IdeaProjects/adam/HG002.small.vcf | grep 817186
chr1	817186	.	G	A	50	PASS	platforms=2;platformnames=Illumina,CG;datasets=4;datasetnames=HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair;callsets=7;callsetnames=HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=10XChromium,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable;filt=CS_HiSeqMatePairSentieon_filt	GT:DP:ADALL:AD:GQ:IGT:IPS:PS	1|1:1131:0,362:75,603:312:1/1:.:PATMAT
Member

fnothaft commented Aug 22, 2017

Shows as phased=true:

zIRISz:adam fnothaft$ ./bin/adam-submit print HG002.gt/ | grep 817185
2017-08-22 10:11:36 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
{"variant": {"contigName": "chr1", "start": 817185, "end": 817186, "names": [], "referenceAllele": "G", "alternateAllele": "A", "filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "annotation": {"ancestralAllele": null, "alleleCount": null, "readDepth": null, "forwardReadDepth": null, "reverseReadDepth": null, "referenceReadDepth": null, "referenceForwardReadDepth": null, "referenceReverseReadDepth": null, "alleleFrequency": null, "cigar": null, "dbSnp": null, "hapMap2": null, "hapMap3": null, "validated": null, "thousandGenomes": null, "somatic": false, "transcriptEffects": [], "attributes": {"callable": "CS_HiSeqPE300xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable", "filt": "CS_HiSeqMatePairSentieon_filt", "datasetsmissingcall": "10XChromium,IonExome,SolidSE75bp", "platformnames": "Illumina,CG", "callsetnames": "HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes", "datasets": "4", "callsets": "7", "datasetnames": "HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair", "platforms": "2"}}}, "contigName": "chr1", "start": 817185, "end": 817186, "variantCallingAnnotations": {"filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "downsampled": null, "baseQRankSum": null, "fisherStrandBiasPValue": null, "rmsMapQ": null, "mapq0Reads": null, "mqRankSum": null, "readPositionRankSum": null, "genotypePriors": [], "genotypePosteriors": [], "vqslod": null, "culprit": null, "attributes": {"IGT": "1/1", "ADALL": "0,362"}}, "sampleId": "HG002", "sampleDescription": null, "processingDescription": null, "alleles": ["ALT", "ALT"], "expectedAlleleDosage": null, "referenceReadDepth": 75, "alternateReadDepth": 603, "readDepth": 1131, "minReadDepth": null, "genotypeQuality": 312, "genotypeLikelihoods": [], "nonReferenceLikelihoods": [], "strandBiasComponents": [], "splitFromMultiAllelic": false, "phased": true, "phaseSetId": null, "phaseQuality": null}
zIRISz:adam fnothaft$ cat ~/IdeaProjects/adam/HG002.small.vcf | grep 817186
chr1	817186	.	G	A	50	PASS	platforms=2;platformnames=Illumina,CG;datasets=4;datasetnames=HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair;callsets=7;callsetnames=HiSeqPE300xSentieon,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=10XChromium,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xSentieon_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable;filt=CS_HiSeqMatePairSentieon_filt	GT:DP:ADALL:AD:GQ:IGT:IPS:PS	1|1:1131:0,362:75,603:312:1/1:.:PATMAT
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 22, 2017

Member

..."phased": true, "phaseSetId": null

If we've kept phased as true and removed the phaseSetId, then according to the VCF spec all genotypes in that file are considered phased and in the same phase set.

"All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set."
https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L473

zIRISz:adam fnothaft$ cat ~/IdeaProjects/adam/HG002.small.vcf | grep 817186
chr1 817186 . G A 50 PASS platforms=2;platfo...

Is this from the input VCF? I would imagine attempting to write as VCF or ADAM format will throw the same htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS as in #1566.

Member

heuermh commented Aug 22, 2017

..."phased": true, "phaseSetId": null

If we've kept phased as true and removed the phaseSetId, then according to the VCF spec all genotypes in that file are considered phased and in the same phase set.

"All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set."
https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L473

zIRISz:adam fnothaft$ cat ~/IdeaProjects/adam/HG002.small.vcf | grep 817186
chr1 817186 . G A 50 PASS platforms=2;platfo...

Is this from the input VCF? I would imagine attempting to write as VCF or ADAM format will throw the same htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS as in #1566.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 22, 2017

Member

For what it is worth, on the PS Type=String issue specifically, this works ok, and allows ADAM to read and write after remapping

$ dsh-bio remap-phase-set \
  -i truth_small_variants.vcf.gz \
  -o truth_small_variants.remapped_phase_sets.vcf.gz

$ adam-shell

scala> val genotypes = sc.loadGenotypes("truth_small_variants.vcf.gz")
java.lang.IllegalArgumentException: Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
  at org.bdgenomics.adam.converters.VariantContextConverter$.org$bdgenomics$adam$converters$VariantContextConverter$$auditLine$1(VariantContextConverter.scala:175)
...

scala> val genotypes = sc.loadGenotypes("truth_small_variants.remapped_phase_sets.vcf.gz")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD =
RDDBoundGenotypeRDD(MapPartitionsRDD[7] at flatMap at VariantContextRDD.scala:183,SequenceDictionary{...

scala> val g = genotypes.rdd.first()
g: org.bdgenomics.formats.avro.Genotype = {"variant": {"contigName": "chr1", "start": 817185, "end": 817186, "names": [], "referenceAllele": "G", "alternateAllele": "A", "filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "annotation": {"ancestralAllele": null, "alleleCount": null, "readDepth": null, "forwardReadDepth": null, "reverseReadDepth": null, "referenceReadDepth": null, "referenceForwardReadDepth": null, "referenceReverseReadDepth": null, "alleleFrequency": null, "cigar": null, "dbSnp": null, "hapMap2": null, "hapMap3": null, "validated": null, "thousandGenomes": null, "somatic": false, "transcriptEffects": [], "attributes": {"callable": "CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable", "filt": "CS_SolidSE75GATKHC_filt", "datas...

scala> g.getPhaseSetId
res1: Integer = 0

scala> g.getPhased
res2: Boolean = true
Member

heuermh commented Aug 22, 2017

For what it is worth, on the PS Type=String issue specifically, this works ok, and allows ADAM to read and write after remapping

$ dsh-bio remap-phase-set \
  -i truth_small_variants.vcf.gz \
  -o truth_small_variants.remapped_phase_sets.vcf.gz

$ adam-shell

scala> val genotypes = sc.loadGenotypes("truth_small_variants.vcf.gz")
java.lang.IllegalArgumentException: Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
  at org.bdgenomics.adam.converters.VariantContextConverter$.org$bdgenomics$adam$converters$VariantContextConverter$$auditLine$1(VariantContextConverter.scala:175)
...

scala> val genotypes = sc.loadGenotypes("truth_small_variants.remapped_phase_sets.vcf.gz")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD =
RDDBoundGenotypeRDD(MapPartitionsRDD[7] at flatMap at VariantContextRDD.scala:183,SequenceDictionary{...

scala> val g = genotypes.rdd.first()
g: org.bdgenomics.formats.avro.Genotype = {"variant": {"contigName": "chr1", "start": 817185, "end": 817186, "names": [], "referenceAllele": "G", "alternateAllele": "A", "filtersApplied": true, "filtersPassed": true, "filtersFailed": [], "annotation": {"ancestralAllele": null, "alleleCount": null, "readDepth": null, "forwardReadDepth": null, "reverseReadDepth": null, "referenceReadDepth": null, "referenceForwardReadDepth": null, "referenceReverseReadDepth": null, "alleleFrequency": null, "cigar": null, "dbSnp": null, "hapMap2": null, "hapMap3": null, "validated": null, "thousandGenomes": null, "somatic": false, "transcriptEffects": [], "attributes": {"callable": "CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable", "filt": "CS_SolidSE75GATKHC_filt", "datas...

scala> g.getPhaseSetId
res1: Integer = 0

scala> g.getPhased
res2: Boolean = true
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 23, 2017

Member

Yeah, that approach is fine for the GIAB files which have a single, well defined validation error and are reasonably small files. That said, its pretty easy to imagine someone putting together a VCF where many lines from a VCF fail validation due to bad tags, or where there are multiple tags that fail validation. As long as our default is to have strict validation, I don't think there's a harm to adding this, since the user will get an exception the first time they load the VCF and will have to choose to override the stringency.

Member

fnothaft commented Aug 23, 2017

Yeah, that approach is fine for the GIAB files which have a single, well defined validation error and are reasonably small files. That said, its pretty easy to imagine someone putting together a VCF where many lines from a VCF fail validation due to bad tags, or where there are multiple tags that fail validation. As long as our default is to have strict validation, I don't think there's a harm to adding this, since the user will get an exception the first time they load the VCF and will have to choose to override the stringency.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 23, 2017

Member

I would think of it this way; if we validate at the tag level, users can load data that they know is bad and then write code in ADAM (similar to what you did in dsh) to fix the bad fields. If we don't validate at the tag level, they just lose the records.

As a compromise, I offer this: how about we go from one stringency field to two? We could have a first validation flag which is for tags (throw/warn/ignore for bad tags) and then a second for records. This way, we can allow people to be really permissive (load in records that they know have bad tags because either they don't care about the tags or because they have code that can fix the bad tags) while also making the system very conservative by default.

Member

fnothaft commented Aug 23, 2017

I would think of it this way; if we validate at the tag level, users can load data that they know is bad and then write code in ADAM (similar to what you did in dsh) to fix the bad fields. If we don't validate at the tag level, they just lose the records.

As a compromise, I offer this: how about we go from one stringency field to two? We could have a first validation flag which is for tags (throw/warn/ignore for bad tags) and then a second for records. This way, we can allow people to be really permissive (load in records that they know have bad tags because either they don't care about the tags or because they have code that can fix the bad tags) while also making the system very conservative by default.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 23, 2017

Member

Throwing out either the offending record or the values for an offending tag across all records seems ok to me. Is that what you are suggesting?

Member

heuermh commented Aug 23, 2017

Throwing out either the offending record or the values for an offending tag across all records seems ok to me. Is that what you are suggesting?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 29, 2017

Member

#1695 is an instance of this

Member

heuermh commented Aug 29, 2017

#1695 is an instance of this

fnothaft added a commit to fnothaft/adam that referenced this issue Sep 12, 2017

[ADAM-1676] Add more finely grained validation for INFO/FORMAT fields.
Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion
level, which keeps a variant/genotype record around even if a bad INFO/FORMAT
field was attached to that record.

fnothaft added a commit to fnothaft/adam that referenced this issue Sep 13, 2017

[ADAM-1676] Add more finely grained validation for INFO/FORMAT fields.
Resolves bigdatagenomics#1676. Pushes validation checking down to the single field conversion
level, which keeps a variant/genotype record around even if a bad INFO/FORMAT
field was attached to that record.

heuermh added a commit that referenced this issue Sep 14, 2017

[ADAM-1676] Add more finely grained validation for INFO/FORMAT fields.
Resolves #1676. Pushes validation checking down to the single field conversion
level, which keeps a variant/genotype record around even if a bad INFO/FORMAT
field was attached to that record.

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment