New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

giab-NA12878 truth_small_variants.vcf.gz header issues #1566

Closed
heuermh opened this Issue Jun 15, 2017 · 4 comments

Comments

Projects
2 participants
@heuermh
Member

heuermh commented Jun 15, 2017

Two issues found in this validation VCF file: they use Type=String for the PS phase set VCF INFO field, and specify a BAD_PS phase set for GT VCF FORMAT field with Number=-1 cardinality restriction.

$ wget https://s3.amazonaws.com/biodata/collections/hg38/validation/giab-NA12878/truth_small_variants.vcf.gz

$ adam-submit transformVariants \
  truth_small_variants.vcf.gz \
  truth_small_variants.variants.adam

Command body threw exception:
java.lang.IllegalArgumentException: Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
Exception in thread "main" java.lang.IllegalArgumentException: Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
	at org.bdgenomics.adam.converters.VariantContextConverter$.org$bdgenomics$adam$converters$VariantContextConverter$$auditLine$1(VariantContextConverter.scala:172)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$cleanAndMixInSupportedLines$1$$anonfun$apply$3.apply(VariantContextConverter.scala:191)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$cleanAndMixInSupportedLines$1$$anonfun$apply$3.apply(VariantContextConverter.scala:190)
	at scala.Option.fold(Option.scala:158)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$cleanAndMixInSupportedLines$1.apply(VariantContextConverter.scala:190)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$cleanAndMixInSupportedLines$1.apply(VariantContextConverter.scala:185)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
	at org.bdgenomics.adam.converters.VariantContextConverter$.cleanAndMixInSupportedLines(VariantContextConverter.scala:185)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1.apply(ADAMContext.scala:1075)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1.apply(ADAMContext.scala:1060)
	at scala.Option.fold(Option.scala:158)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMContext.loadVcf(ADAMContext.scala:1060)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1730)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1728)
	at scala.Option.fold(Option.scala:158)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMContext.loadVariants(ADAMContext.scala:1726)

$ adam-submit transformVariants \
  truth_small_variants.vcf.gz \
  truth_small_variants.variants.adam \
  -stringency LENIENT

2017-06-14 17:29:38 WARN  ADAMContext:175 - Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
2017-06-14 17:29:39 WARN  VariantContextConverter:336 - Caught exception java.lang.NumberFormatException: For input string: "PATMAT" when converting [VC Unknown @ chr18:41959868 Q50.00 of type=SNP alleles=[G*, A] attr={callable=[CS_HiSeqPE300xGATK_callable, CS_CGnormal_callable, CS_HiSeqPE300xfreebayes_callable], callsetnames=[HiSeqPE300xGATK, CGnormal, HiSeqPE300xfreebayes, 10XGATKhaplo, SolidSE75GATKHC], callsets=5, datasetnames=[HiSeqPE300x, CGnormal, 10XChromium, SolidSE75bp], datasets=4, datasetsmissingcall=[IonExome, SolidPE50x50bp], platformnames=[Illumina, CG, 10X, Solid], platforms=4} GT=[[INTEGRATION A|G* GQ 1213 DP 885 AD 232,257 {ADALL=193,203, IGT=0/1, PS=PATMAT}]].
2017-06-14 17:29:39 WARN  VariantContextConverter:336 - Caught exception java.lang.NumberFormatException: For input string: "PATMAT" when converting [VC Unknown @ chr10:28062598 Q50.00 of type=SNP alleles=[T*, C] attr={callable=CS_HiSeqPE300xfreebayes_callable, callsetnames=[HiSeqPE300xfreebayes, HiSeqPE300xGATK, 10XGATKhaplo, SolidSE75GATKHC], callsets=4, datasetnames=[HiSeqPE300x, 10XChromium, SolidSE75bp], datasets=3, datasetsmissingcall=[CGnormal, IonExome, SolidPE50x50bp], platformnames=[Illumina, 10X, Solid], platforms=3} GT=[[INTEGRATION C|C GQ 337 DP 353 AD 0,0 {ADALL=0,181, IGT=1/1, PS=PATMAT}]].
...

$ adam-submit transformVariants \
  truth_small_variants.vcf.gz \
  truth_small_variants.variants.adam \
  -stringency SILENT

$ adam-shell

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val variants = sc.loadVariants("truth_small_variants.variants.adam")
warning: while trying to read VCF header from file received exception: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS
htsjdk.tribble.TribbleException: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:
  at htsjdk.variant.bcf2.BCF2Codec.error(BCF2Codec.java:478)
  at htsjdk.variant.bcf2.BCF2Codec.readHeader(BCF2Codec.java:149)
  at org.seqdoop.hadoop_bam.util.VCFHeaderReader.readHeaderFrom(VCFHeaderReader.java:67)
  at org.bdgenomics.adam.rdd.ADAMContext.org$bdgenomics$adam$rdd$ADAMContext$$readVcfHeader(ADAMContext.scala:228)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadHeaderLines$1.apply(ADAMContext.scala:234)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadHeaderLines$1.apply(ADAMContext.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at org.bdgenomics.adam.rdd.ADAMContext.loadHeaderLines(ADAMContext.scala:234)
  at org.bdgenomics.adam.rdd.ADAMContext.loadParquetVariants(ADAMContext.scala:1175)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1733)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1728)
  at scala.Option.fold(Option.scala:158)
  at org.apache.spark.rdd.Timer.time(Timer.scala:48)
  at org.bdgenomics.adam.rdd.ADAMContext.loadVariants(ADAMContext.scala:1726)
  ... 50 elided

$ cat truth_small_variants.variants.adam/_header 
##fileformat=VCFv4.2
##CL=vcffilter -i filtered-phase-transfer.vcf.gz -o - --javascript "ensureFormatHeader(\"##FORMAT=<ID=PS,Number=1,Type=String,Description=\\\"Phase set for GT\\\">\"); function record() {if(INTEGRATION.GT==\"1/1\") { INTEGRATION.IPS=\".\"; INTEGRATION.PS=\"HOMVAR\"; INTEGRATION.GT=\"1|1\";} else {if((INTEGRATION.GT==\"0/1\" || INTEGRATION.GT==\"1/2\" || INTEGRATION.GT==\"2/1\" || INTEGRATION.GT==\"1/0\") ) {if(INTEGRATION.IPS.length>1) {INTEGRATION.PS=INTEGRATION.IPS; INTEGRATION.GT=INTEGRATION.IGT;} else {INTEGRATION.PS=\".\";};} else { if((INTEGRATION.IPS.length<2)) { INTEGRATION.IPS=\".\";} INTEGRATION.PS=\"PATMAT\";};};}"
##FILTER=<ID=GQlessthan70,Description="Sum of GQ for datasets with this genotype less than 70">
##FILTER=<ID=alleleimbalance,Description="Filtered calls where the net allele balance for unfiltered datasets is <0.2 or >0.8">
##FILTER=<ID=allfilteredanddisagree,Description="All callsets have this call filtered or outside the callable regions and they have discordant genotypes or variant calls">
##FILTER=<ID=allfilteredbutagree,Description="All callsets have this call filtered or outside the callable regions but they have the same genotype">
##FILTER=<ID=cgonly,Description="Filtered calls where only Complete Genomics had this call and it was completely missing from any other callset">
##FILTER=<ID=discordanthet,Description="Filtered calls where a passing call is het and a high GQ but filtered call is hom var, since often the het is wrong">
##FILTER=<ID=discordantunfiltered,Description="Callsets with unfiltered calls have discordant genotypes or variant calls">
##FILTER=<ID=overlappingcall,Description="Filtered sites that are within 50bp of another passing call but none of the callsets that support the 2 calls match">
##FILTER=<ID=questionableindel,Description="Filtered calls where some callsets have a filtered indel larger than 10bp and another dataset has an implied homozygous reference call">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=ADALL,Number=R,Type=Integer,Description="Net allele depths across all datasets">
##FORMAT=<ID=BAD_PS,Number=-1,Type=String,Description="Phase set for GT">
...

@fnothaft fnothaft added the discussion label Jun 22, 2017

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 22, 2017

Member

@heuermh is there anything actionable for us with this issue?

Member

fnothaft commented Jun 22, 2017

@heuermh is there anything actionable for us with this issue?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 22, 2017

Member

For the first error message, failing on Type=String for phase set PS is going to be common.

I was thinking of creating an issue to fix the second error message (tribble complaining about BCF), but that is actually an issue with Hadoop-BAM HadoopGenomics/Hadoop-BAM#132

Member

heuermh commented Jun 22, 2017

For the first error message, failing on Type=String for phase set PS is going to be common.

I was thinking of creating an issue to fix the second error message (tribble complaining about BCF), but that is actually an issue with Hadoop-BAM HadoopGenomics/Hadoop-BAM#132

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 22, 2017

Member

Oh, I see. Sorry, I misread this issue as their file had PS=Int in the header but PS=String in the VCF lines. Ignore me.

Member

fnothaft commented Jun 22, 2017

Oh, I see. Sorry, I misread this issue as their file had PS=Int in the header but PS=String in the VCF lines. Ignore me.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh
Member

heuermh commented Aug 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment