Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate failures to load ExAC.0.3.GRCh38.vcf variants #1351

Closed
heuermh opened this issue Jan 16, 2017 · 3 comments
Closed

Investigate failures to load ExAC.0.3.GRCh38.vcf variants #1351

heuermh opened this issue Jan 16, 2017 · 3 comments
Milestone

Comments

@heuermh
Copy link
Member

heuermh commented Jan 16, 2017

On git head after merging #1346 and #1348, the next error is

$ time ./bin/adam-submit \
  vcf2adam -only_variants \
  ExAC.0.3.GRCh38.vcf.gz ExAC.0.3.GRCh38.variants.adam

Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
...
2017-01-16 17:18:06 ERROR Utils:91 - Aborting task
htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately
line number 15364: Duplicate allele added to VariantContext: C
	at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:779)
	at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:356)
	at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:279)
	at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:257)
	at org.seqdoop.hadoop_bam.VCFRecordReader.nextKeyValue(VCFRecordReader.java:144)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
@fnothaft
Copy link
Member

This looks like an error in the liftover. Did you lift over yourself or is this an official release?

@heuermh
Copy link
Member Author

heuermh commented Jan 17, 2017

From Ensembl but not necessarily an official release, http://ftp.ensembl.org/pub/variation_genotype/homo_sapiens/ExAC.0.3.GRCh38.vcf.gz

I need to write something to go through the file and update the CSQ variant annotations to the ANN specification, so when I do that I'll filter out the invalid data as well.

@heuermh
Copy link
Member Author

heuermh commented Apr 19, 2017

Closing as WontFix, that particular file has gone 404 and a newer version of ExAC exists.

@heuermh heuermh closed this as completed Apr 19, 2017
@heuermh heuermh modified the milestone: 0.23.0 Jul 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants