New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when writing a vcf file to Parquet #1810

Closed
ffinfo opened this Issue Nov 22, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@ffinfo
Contributor

ffinfo commented Nov 22, 2017

On gitter I did already mentioned this issue but now also including Code, vcf file and stacktrace.

For this I have used:

  • Scala 2.11.12
  • Spark 2.2.0
  • Adam-core 0.22.0 and 0.23.0-SNAPTSHOT

Code

import org.apache.spark.{SparkConf, SparkContext}

import org.bdgenomics.adam.rdd.ADAMContext._

object Main {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("test").setMaster("local[1]")
    val sc: SparkContext = SparkContext.getOrCreate(conf)

    val vcf = sc.loadVcf("/Users/pjvanthof/src/biopet-root/tools/vcfstats/src/test/resources/multi.vcf")

    vcf.toGenotypes().saveAsParquet("/Users/pjvanthof/test/vcf.adam")

    sc.stop()
  }
}

Vcf file

##fileformat=VCFv4.2
##INFO=<ID=DP,Number=.,Type=String,Description="Depth">
##INFO=<ID=EMPTY,Number=.,Type=String,Description="Depth">
##FORMAT=<ID=GT,Number=.,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=.,Type=Integer,Description="Genotype">
##FORMAT=<ID=EMPTY,Number=.,Type=Integer,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample_1	Sample_2	Sample_3
chrQ	1	.	A	T	.	.	DP=1	GT:DP	0/1:5	0/1:1	0/1:5
chrQ	2	.	A	T	.	.	DP=1,2,1	GT:DP	0/1:5	0/0:5	0/0:5
chrQ	3	.	A	T	.	.	DP=2	GT:DP	0/1:1	0/1:5	0/0:5
chrQ	4	.	A	T	.	.	DP=3	GT	0/0	0/0	0/0

Stacktrace

org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:178)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:89)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:88)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.Integer
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$toInt$1.apply(VariantContextConverter.scala:1236)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$toInt$1.apply(VariantContextConverter.scala:1235)
	at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$tryAndCatchStringCast(VariantContextConverter.scala:897)
	at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$toInt(VariantContextConverter.scala:1235)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$5$$anonfun$apply$53.apply(VariantContextConverter.scala:1418)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$5$$anonfun$apply$53.apply(VariantContextConverter.scala:1418)
	at scala.Option.map(Option.scala:146)
	at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$variantContextFieldExtractor(VariantContextConverter.scala:1381)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$5.apply(VariantContextConverter.scala:1418)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$5.apply(VariantContextConverter.scala:1416)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$76.apply(VariantContextConverter.scala:1669)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$76.apply(VariantContextConverter.scala:1669)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
	at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$convert$1(VariantContextConverter.scala:1669)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1683)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1683)
	at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:303)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$14.apply(ADAMContext.scala:2030)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$14.apply(ADAMContext.scala:2030)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:146)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:144)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:159)
	... 8 more
@ffinfo

This comment has been minimized.

Contributor

ffinfo commented Nov 22, 2017

Something good to know. When using this vcf file it does work:

##fileformat=VCFv4.2
##INFO=<ID=BLA,Number=.,Type=String,Description="Depth">
##INFO=<ID=EMPTY,Number=.,Type=String,Description="Depth">
##FORMAT=<ID=GT,Number=.,Type=String,Description="Genotype">
##FORMAT=<ID=BLA,Number=.,Type=Integer,Description="Genotype">
##FORMAT=<ID=EMPTY,Number=.,Type=Integer,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample_1	Sample_2	Sample_3
chrQ	1	.	A	T	.	.	BLA=1	GT:BLA	0/1:5	0/1:1	0/1:5
chrQ	2	.	A	T	.	.	BLA=1,2,1	GT:BLA	0/1:5	0/0:5	0/0:5
chrQ	3	.	A	T	.	.	BLA=2	GT:BLA	0/1:1	0/1:5	0/0:5
chrQ	4	.	A	T	.	.	BLA=3	GT	0/0	0/0	0/0

This means somewhere in adam the DP must always be a single number while the vcf specs does not force this ;)

@heuermh

This comment has been minimized.

Member

heuermh commented Nov 22, 2017

This means somewhere in adam the DP must always be a single number while the vcf specs does not force this ;)

Well, actually it does. DP is a reserved INFO field, and if you provide a definition in the ##INFO header line for ID=DP other than Number=1,Type=Integer, it will get clobbered by htsjdk.

From VCF 4.2 spec

DP : combined depth across samples, e.g. DP=154

From VCF 4.3 draft spec

Table 1: Reserved INFO fields

Field Number Type Description
DP 1 Integer Combined depth across samples

https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L350

We follow the 4.3 specification as close as possible. VCF INFO reserved key AD for total read depth for each allele is Number=R,Type=Integer, perhaps that better fits your use case.

That said, we should catch this on read and provide a useful error message.

@heuermh

This comment has been minimized.

Member

heuermh commented Nov 22, 2017

Sigh, if only the htsjdk devs had ever heard of a logging framework

if ( needsRepair ) {
  if ( GeneralUtils.DEBUG_MODE_ENABLED ) {
    System.err.println("Repairing standard header line for field ...
  }
  return standard;
}

https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/VCFStandardHeaderLines.java#L188

System.err.println, that's hot.

@ffinfo

This comment has been minimized.

Contributor

ffinfo commented Nov 23, 2017

Was not yet aware of the 4.3 specs. I think it's a good thing to have this in the specs. But indeed a bit more explanation on the error would be nice indeed. I did get confused by error like you notice ;)

@fnothaft

This comment has been minimized.

Member

fnothaft commented Jan 9, 2018

Just wanted to check in on this; is there any action needed on this, e.g. additional logging (either here or upstream) or more docs?

@heuermh

This comment has been minimized.

Member

heuermh commented Jan 9, 2018

Will need a bit more digging to see if we can catch the issue before htsjdk silently "repairs" it for us.

@fnothaft fnothaft added the wontfix label Mar 7, 2018

@fnothaft

This comment has been minimized.

Member

fnothaft commented Mar 7, 2018

I don't think there's more action on this; so I'm closing this as won't fix until htsjdk moves to VCF 4.3.

@fnothaft fnothaft closed this Mar 7, 2018

@heuermh heuermh added this to the 0.24.0 milestone Mar 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment