New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexOutOfBounds thrown when saving gVCF with no likelihoods #1673

Closed
fnothaft opened this Issue Aug 18, 2017 · 1 comment

Comments

Projects
2 participants
@fnothaft
Copy link
Member

fnothaft commented Aug 18, 2017

If saving a blocked gVCF where the non-reference blocks do not have likelihoods attached, we get the following error:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
	at htsjdk.variant.vcf.VCFEncoder.addGenotypeData(VCFEncoder.java:286)
	at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:136)
	at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:222)
	at org.seqdoop.hadoop_bam.VCFRecordWriter.writeRecord(VCFRecordWriter.java:140)
	at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:61)
	at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:38)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

This tracks back to some code in htsjdk that doesn't check if an array-type FORMAT field is empty before writing it out, and which indexes directly into the 0th element, which isn't great. That said, what's happening on our side, is that the conditional that checks whether we are at a gVCF record when the genotypeLikelihood field is unset is wrong, and is setting the PL on the genotype builder to an empty array.

@fnothaft fnothaft added the bug label Aug 18, 2017

@fnothaft fnothaft added this to the 0.23.0 milestone Aug 18, 2017

@fnothaft fnothaft self-assigned this Aug 18, 2017

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Aug 18, 2017

Fixed by #1674

@heuermh heuermh closed this Aug 18, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment