genotypeType for genotypes with multiple OtherAlt alleles? #897

Closed
NeillGibson opened this Issue Dec 10, 2015 · 8 comments

Comments

Projects
4 participants
@NeillGibson
Contributor

NeillGibson commented Dec 10, 2015

Hi,

What should the GenotypeType be when there are multiple OtherAlt alleles in a Genotype?

Take for instance the following abbreviated VCF record which has a multi-allelic variant for 4
samples of a diploid species:

CHROM POS REF ALT Sample1 Sample2 Sample3 Sample4
1 100 A C,G,T A/C G/G G/T A,G

More than two alternative alleles won't happen that often for a snp but will happen often for mnp's and indels.

If I understand the Adam genotype schema correctly then this would lead to 12 genotype parquet records. One record for each combination of alternative allele and sample.

The first set of genotype / parquet record for each sample would have the following allele combinations for each sample.

Sample Alleles GenotypeAllele GenotypeType
1 A/C Ref / Alt HET
2 G/G OtherAlt / OtherAlt HOM_ALT
3 G/T OtherAlt / OtherAlt HET
4 A/G Ref / OtherAlt HET

But how can I determine the correct GenotypeType for sample 2 and 3 if I don't see the actual alleles but just two OtherAlt alleles? The actual allele information is as far as I know not available when processing the first set of genotype / parquet record (that are from the perspective of the first alternative allele) . This because the real allele information is in the other genotype parquet records...

The current getType method in RichGenotype would return NO_CALL as the GenotypeType for sample 2, 3 and 4 when processing the first set of genotype / parquet record.
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichGenotype.scala#L38

Is this correct? Or is there a way to determine if the OtherAlt alleles are the same or different?
Or should there be a special GenotypeType to handle cases of two OtherAlt alleles`?

I am trying to write a replacement for the getType method that handles the OtherAlt alleles and polyploid species.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 18, 2016

Member

I am thinking that we punt this to a later release than 0.20.0. Any comments?

Member

fnothaft commented May 18, 2016

I am thinking that we punt this to a later release than 0.20.0. Any comments?

@fnothaft fnothaft modified the milestones: 0.21.0, 0.20.0 Jun 3, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 3, 2016

Member

Punting to 0.21.0.

Member

fnothaft commented Jun 3, 2016

Punting to 0.21.0.

@heuermh heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 3, 2017

Member

This will be resolved by #577. Closing in favor of that ticket.

Member

fnothaft commented Mar 3, 2017

This will be resolved by #577. Closing in favor of that ticket.

@fnothaft fnothaft closed this Mar 3, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Mar 8, 2017

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Jun 13, 2017

Member

This was meant to be resolved by b1fce67, but was not. I am still having an issue when loading in multi-allelic variants. Here is an example line I am having issue with:
#CHROM POS ID REF ALT QUAL...
1 14522 . G A,C 195.16...

Here is the Stacktrace:
java.lang.ArrayIndexOutOfBoundsException: 1
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$fromArrayExtractor$4.apply(VariantContextConverter.scala:1196)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$fromArrayExtractor$4.apply(VariantContextConverter.scala:1196)
at scala.Option.map(Option.scala:145)
at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$fromArrayExtractor(VariantContextConverter.scala:1196)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$7.apply(VariantContextConverter.scala:1286)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$7.apply(VariantContextConverter.scala:1284)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$73.apply(VariantContextConverter.scala:1455)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$73.apply(VariantContextConverter.scala:1455)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$convert$1(VariantContextConverter.scala:1455)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1469)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1469)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3.apply(VariantContextConverter.scala:249)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3.apply(VariantContextConverter.scala:242)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:242)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$22.apply(ADAMContext.scala:1139)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$22.apply(ADAMContext.scala:1139)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Member

devin-petersohn commented Jun 13, 2017

This was meant to be resolved by b1fce67, but was not. I am still having an issue when loading in multi-allelic variants. Here is an example line I am having issue with:
#CHROM POS ID REF ALT QUAL...
1 14522 . G A,C 195.16...

Here is the Stacktrace:
java.lang.ArrayIndexOutOfBoundsException: 1
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$fromArrayExtractor$4.apply(VariantContextConverter.scala:1196)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$fromArrayExtractor$4.apply(VariantContextConverter.scala:1196)
at scala.Option.map(Option.scala:145)
at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$fromArrayExtractor(VariantContextConverter.scala:1196)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$7.apply(VariantContextConverter.scala:1286)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$7.apply(VariantContextConverter.scala:1284)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$73.apply(VariantContextConverter.scala:1455)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$73.apply(VariantContextConverter.scala:1455)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$convert$1(VariantContextConverter.scala:1455)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1469)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1469)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3.apply(VariantContextConverter.scala:249)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3.apply(VariantContextConverter.scala:242)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:242)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$22.apply(ADAMContext.scala:1139)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$22.apply(ADAMContext.scala:1139)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 13, 2017

Member

Hi @devin-petersohn

The commit you refer to appears to be the BQSR refactor. I'm guessing you meant another commit?

We're parsing an INFO field wrong in a multi-allelic record. Can you include the INFO field for the variant, as well as the VCF header?

Member

fnothaft commented Jun 13, 2017

Hi @devin-petersohn

The commit you refer to appears to be the BQSR refactor. I'm guessing you meant another commit?

We're parsing an INFO field wrong in a multi-allelic record. Can you include the INFO field for the variant, as well as the VCF header?

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Jun 13, 2017

Member

This was closed to track #577, which was closed by the commit referenced above.

For the VCF parse error, I simply added an allele to an existing line, but I now know you cannot do that.

I am trying to figure out how ADAM handles multi-allelic variants.

I also tried to use the VCF spec example and that failed:
Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:
htsjdk.tribble.TribbleException: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:

Member

devin-petersohn commented Jun 13, 2017

This was closed to track #577, which was closed by the commit referenced above.

For the VCF parse error, I simply added an allele to an existing line, but I now know you cannot do that.

I am trying to figure out how ADAM handles multi-allelic variants.

I also tried to use the VCF spec example and that failed:
Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:
htsjdk.tribble.TribbleException: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:

@fnothaft

This comment has been minimized.

Show comment
Hide comment

@fnothaft fnothaft closed this Jun 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment