gVCF - can't load multi-allelic sites #1202

Closed
jpdna opened this Issue Oct 8, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@jpdna
Member

jpdna commented Oct 8, 2016

I'm trying to input the file:
http://bioinformaticstools.mayo.edu/research/wp-content/plugins/download.php?url=https://s3-us-west-2.amazonaws.com/mayo-bic-tools/variant_miner/gvcfs/NA12878.chr22.g.vcf.gz

I believe I solved the SB tag problem of #1199

However, now on the VCF row below I get the following error:

chr22   18027817        .       CTTTT   C,CT,CTT,CTTT,<NON_REF> 1364.73 .       DP=78;MLEAC=0,0,1,1,0;MLEAF=0.00,0.00,0.500,0.500,0.00;MQ=70.00;MQ0=0   GT:AD:DP:GQ:PL:SB       3/4:0,4,6,28,17,0:55:99:1402,1244,2007,972,1382,1297,371,355,301,301,792,617,486,0,653,1038,1016,883,315,597,935:0,0,14,3

Note, clearly the row above is not a reference block, it is called multi-allelic site within the gVCF file

 x2.rdd.count
2016-10-08 13:04:26 ERROR Executor:95 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Multi-allelic site with non-ref symbolic allele[VC Unknown @ chr22:18027817-18027821 Q1364.73 of type=MIXED alleles=[CTTTT*, <NON_REF>, C, CT, CTT, CTTT] attr={DP=78, MLEAC=[0, 0, 1, 1, 0], MLEAF=[0.00, 0.00, 0.500, 0.500, 0.00], MQ=70.00, MQ0=0} GT=GT:AD:DP:GQ:PL:SB 3/4:0,4,6,28,17,0:55:99:1402,1244,2007,972,1382,1297,371,355,301,301,792,617,486,0,653,1038,1016,883,315,597,935:0,0,14,3
    at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:214)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1.apply(ADAMContext.scala:823)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1.apply(ADAMContext.scala:823)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1595)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1157)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-10-08 13:04:26 WARN  TaskSetManager:70 - Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Multi-allelic site with non-ref symbolic allele[VC Unknown @ chr22:18027817-18027821 Q1364.73 of type=MIXED alleles=[CTTTT*, <NON_REF>, C, CT, CTT, CTTT] attr={DP=78, MLEAC=[0, 0, 1, 1, 0], MLEAF=[0.00, 0.00, 0.500, 0.500, 0.00], MQ=70.00, MQ0=0} GT=GT:AD:DP:GQ:PL:SB    3/4:0,4,6,28,17,0:55:99:1402,1244,2007,972,1382,1297,371,355,301,301,792,617,486,0,653,1038,1016,883,315,597,935:0,0,14,3
    at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:214)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1.apply(ADAMContext.scala:823)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1.apply(ADAMContext.scala:823)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at org.apache.spark.util.Utils$.getIterat

exception is from:
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/converters/VariantContextConverter.scala#L214

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 8, 2016

Member

I'm working on a patch for all the gVCF stuff that should hopefully hit mid next week. Can you make a subset of that VCF that contains the header + the lines implicated here and in #1199 and open a PR with tests that reproduce the two failures? I can then pull them in to my patch to test my patch.

Member

fnothaft commented Oct 8, 2016

I'm working on a patch for all the gVCF stuff that should hopefully hit mid next week. Can you make a subset of that VCF that contains the header + the lines implicated here and in #1199 and open a PR with tests that reproduce the two failures? I can then pull them in to my patch to test my patch.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Oct 11, 2016

Member

Created PR #1205 to demonstrate the multi-allele site issue for gVCF

Member

jpdna commented Oct 11, 2016

Created PR #1205 to demonstrate the multi-allele site issue for gVCF

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Oct 16, 2016

Member

Just pinging on this to make sure the problem is described sufficiently above and by the #1205

Member

jpdna commented Oct 16, 2016

Just pinging on this to make sure the problem is described sufficiently above and by the #1205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment