Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genotypeType for genotypes with multiple OtherAlt alleles? #897

Closed
NeillGibson opened this issue Dec 10, 2015 · 8 comments
Closed

genotypeType for genotypes with multiple OtherAlt alleles? #897

NeillGibson opened this issue Dec 10, 2015 · 8 comments

Comments

@NeillGibson
Copy link
Contributor

@NeillGibson NeillGibson commented Dec 10, 2015

Hi,

What should the GenotypeType be when there are multiple OtherAlt alleles in a Genotype?

Take for instance the following abbreviated VCF record which has a multi-allelic variant for 4
samples of a diploid species:

CHROM POS REF ALT Sample1 Sample2 Sample3 Sample4
1 100 A C,G,T A/C G/G G/T A,G

More than two alternative alleles won't happen that often for a snp but will happen often for mnp's and indels.

If I understand the Adam genotype schema correctly then this would lead to 12 genotype parquet records. One record for each combination of alternative allele and sample.

The first set of genotype / parquet record for each sample would have the following allele combinations for each sample.

Sample Alleles GenotypeAllele GenotypeType
1 A/C Ref / Alt HET
2 G/G OtherAlt / OtherAlt HOM_ALT
3 G/T OtherAlt / OtherAlt HET
4 A/G Ref / OtherAlt HET

But how can I determine the correct GenotypeType for sample 2 and 3 if I don't see the actual alleles but just two OtherAlt alleles? The actual allele information is as far as I know not available when processing the first set of genotype / parquet record (that are from the perspective of the first alternative allele) . This because the real allele information is in the other genotype parquet records...

The current getType method in RichGenotype would return NO_CALL as the GenotypeType for sample 2, 3 and 4 when processing the first set of genotype / parquet record.
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichGenotype.scala#L38

Is this correct? Or is there a way to determine if the OtherAlt alleles are the same or different?
Or should there be a special GenotypeType to handle cases of two OtherAlt alleles`?

I am trying to write a replacement for the getType method that handles the OtherAlt alleles and polyploid species.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented May 18, 2016

I am thinking that we punt this to a later release than 0.20.0. Any comments?

@fnothaft fnothaft modified the milestones: 0.21.0, 0.20.0 Jun 3, 2016
@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jun 3, 2016

Punting to 0.21.0.

@heuermh heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016
@fnothaft
Copy link
Member

@fnothaft fnothaft commented Mar 3, 2017

This will be resolved by #577. Closing in favor of that ticket.

@fnothaft fnothaft closed this Mar 3, 2017
@heuermh heuermh added this to Completed in Release 0.23.0 Mar 8, 2017
@devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Jun 13, 2017

This was meant to be resolved by b1fce67, but was not. I am still having an issue when loading in multi-allelic variants. Here is an example line I am having issue with:
#CHROM POS ID REF ALT QUAL...
1 14522 . G A,C 195.16...

Here is the Stacktrace:
java.lang.ArrayIndexOutOfBoundsException: 1
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$fromArrayExtractor$4.apply(VariantContextConverter.scala:1196)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$fromArrayExtractor$4.apply(VariantContextConverter.scala:1196)
at scala.Option.map(Option.scala:145)
at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$fromArrayExtractor(VariantContextConverter.scala:1196)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$7.apply(VariantContextConverter.scala:1286)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$org$bdgenomics$adam$converters$VariantContextConverter$$lineToVariantContextExtractor$7.apply(VariantContextConverter.scala:1284)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$73.apply(VariantContextConverter.scala:1455)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$73.apply(VariantContextConverter.scala:1455)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.bdgenomics.adam.converters.VariantContextConverter.org$bdgenomics$adam$converters$VariantContextConverter$$convert$1(VariantContextConverter.scala:1455)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1469)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$makeVariantFormatFn$1.apply(VariantContextConverter.scala:1469)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3.apply(VariantContextConverter.scala:249)
at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$convert$3.apply(VariantContextConverter.scala:242)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.bdgenomics.adam.converters.VariantContextConverter.convert(VariantContextConverter.scala:242)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$22.apply(ADAMContext.scala:1139)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1$$anonfun$apply$22.apply(ADAMContext.scala:1139)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jun 13, 2017

Hi @devin-petersohn

The commit you refer to appears to be the BQSR refactor. I'm guessing you meant another commit?

We're parsing an INFO field wrong in a multi-allelic record. Can you include the INFO field for the variant, as well as the VCF header?

@devin-petersohn
Copy link
Member

@devin-petersohn devin-petersohn commented Jun 13, 2017

This was closed to track #577, which was closed by the commit referenced above.

For the VCF parse error, I simply added an allele to an existing line, but I now know you cannot do that.

I am trying to figure out how ADAM handles multi-allelic variants.

I also tried to use the VCF spec example and that failed:
Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:
htsjdk.tribble.TribbleException: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jun 13, 2017

@fnothaft fnothaft closed this Jun 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.