New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception thrown in VariantContextConverter.formatAllelicDepth despite SILENT validation stringency #1695

Closed
heuermh opened this Issue Aug 29, 2017 · 1 comment

Comments

Projects
1 participant
@heuermh
Member

heuermh commented Aug 29, 2017

$ wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh38/supplementaryFiles/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_all.vcf.gz

$ dsh-bio remap-phase-set \
  -i HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_all.vcf.gz \
  -o HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_all.fixed-phase-set.vcf.gz

$ adam-shell

import org.bdgenomics.adam.rdd.ADAMContext._

val variants = sc.loadVariants("HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_all.fixed-phase-set.vcf.gz")

val filtered = variants.transform(rdd => { rdd.filter(_.getFiltersPassed()) })

filtered.toVariantContextRDD().saveAsVcf("HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_all.fixed-phase.set.adam-filtered.vcf", asSingleFile = true, deferMerging = false, disableFastConcat = false, stringency = htsjdk.samtools.ValidationStringency.SILENT)

...
java.lang.ArrayIndexOutOfBoundsException: 3
	at org.bdgenomics.adam.converters.VariantContextConverter.formatAllelicDepth(VariantContextConverter.scala:779)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$40.apply(VariantContextConverter.scala:926)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$40.apply(VariantContextConverter.scala:926)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$77$$anonfun$apply$79.apply(VariantContextConverter.scala:1666)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$77$$anonfun$apply$79.apply(VariantContextConverter.scala:1666)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$78.apply(VariantContextConverter.scala:1668)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$78.apply(VariantContextConverter.scala:1668)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$convert$2(VariantContextConverter.scala:1668)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeGenotypeFormatFn$1.apply(VariantContextConverter.scala:1708)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeGenotypeFormatFn$1.apply(VariantContextConverter.scala:1708)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3$$anonfun$6.apply(VariantContextConverter.scala:335)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3$$anonfun$6.apply(VariantContextConverter.scala:334)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3.apply(VariantContextConverter.scala:334)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3.apply(VariantContextConverter.scala:324)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:324)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$14.apply(ADAMContext.scala:2017)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$14.apply(ADAMContext.scala:2017)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:146)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:144)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:159)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:89)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:88)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 30, 2017

Member

Unfortunately until the genotype refactor in bdg-formats is complete, the handling for VCF INFO fields and VCF FORMAT fields is slightly different.

Read depth (AD) is Number=R Type=Integer VCF INFO field

  private[converters] def formatReadDepth(
    vc: HtsjdkVariantContext,
    vab: VariantAnnotation.Builder,
    v: Variant,
    index: Int): VariantAnnotation.Builder = {

    val ad = vc.getAttributeAsList("AD")
    if (ad.size > (index + 1)) {
      vab.setReferenceReadDepth(toInt(ad.get(0)))
      vab.setReadDepth(toInt(ad.get(index + 1)))
    }
    vab
  }

Note the (index + 1) check against the size of the array in the HtsjdkVariantContext. If the size is not correct, the field value(s) are not set. This should perhaps be an exception thrown and/or validation stringency should be considered.

Allelic Depth (AD) is a Number=R Type=Integer VCF FORMAT field

  private[converters] def formatAllelicDepth(g: HtsjdkGenotype,
                                             gb: Genotype.Builder,
                                             gIdx: Int,
                                             gIndices: Array[Int]): Genotype.Builder = {

    // AD is an array type field
    if (g.hasAD) {
      val ad = g.getAD
      gb.setReferenceReadDepth(ad(0))
        .setAlternateReadDepth(ad(gIdx))
    } else {
      gb
    }
  }

There is no check here that the array in HtsjdkVariantContext is the correct length, so we get the ArrayIndexOutOfBoundsException. And since we split multi-allelic variant contexts, it is possible that Number=R VCF FORMAT fields simply don't make sense in those cases. Should we add an index check similar to that above?

Member

heuermh commented Aug 30, 2017

Unfortunately until the genotype refactor in bdg-formats is complete, the handling for VCF INFO fields and VCF FORMAT fields is slightly different.

Read depth (AD) is Number=R Type=Integer VCF INFO field

  private[converters] def formatReadDepth(
    vc: HtsjdkVariantContext,
    vab: VariantAnnotation.Builder,
    v: Variant,
    index: Int): VariantAnnotation.Builder = {

    val ad = vc.getAttributeAsList("AD")
    if (ad.size > (index + 1)) {
      vab.setReferenceReadDepth(toInt(ad.get(0)))
      vab.setReadDepth(toInt(ad.get(index + 1)))
    }
    vab
  }

Note the (index + 1) check against the size of the array in the HtsjdkVariantContext. If the size is not correct, the field value(s) are not set. This should perhaps be an exception thrown and/or validation stringency should be considered.

Allelic Depth (AD) is a Number=R Type=Integer VCF FORMAT field

  private[converters] def formatAllelicDepth(g: HtsjdkGenotype,
                                             gb: Genotype.Builder,
                                             gIdx: Int,
                                             gIndices: Array[Int]): Genotype.Builder = {

    // AD is an array type field
    if (g.hasAD) {
      val ad = g.getAD
      gb.setReferenceReadDepth(ad(0))
        .setAlternateReadDepth(ad(gIdx))
    } else {
      gb
    }
  }

There is no check here that the array in HtsjdkVariantContext is the correct length, so we get the ArrayIndexOutOfBoundsException. And since we split multi-allelic variant contexts, it is possible that Number=R VCF FORMAT fields simply don't make sense in those cases. Should we add an index check similar to that above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment