Transcript Effects ignored if more than 1 #1347

majkiw · 2017-01-10T11:59:52Z

Hi,

I was trying to run current master against our VCF files and I have noticed that whenever there is more than one value in transcript effects, the list is returned empty.

BTW we really appreciate that you added support for transcript effects - this was the last missing thing we required to be able to use ADAM and you added it just 2 weeks after we started testing. Thank you :)

I am attaching example for completeness, but I don't think it's needed, because it happens on every row with multiple transcript effects.

This would return 1 transcript effect:

16	57392733	rs4359426	A	C	981.77	PASS	AC=2;AF=1.00;AN=2;DB;DP=36;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;POSITIVE_TRAIN_SITE;QD=28.88;SOR=1.819;VQSLOD=13.20;culprit=MQ;set=variant;ANN=C|missense_variant|MODERATE|CCL22|CCL22|transcript|NM_002990|protein_coding|1/3|c.5A>C|p.Asp2Ala|39/2918|5/282|2/93||	GT:AD:DP:GQ:PL	1/1:0,34:34:99:1010,101,0

This would return no transcript effects:

16	56913513	rs5801	C	T	855.77	PASS	AC=1;AF=0.500;AN=2;BaseQRankSum=-1.843e+00;ClippingRankSum=-7.700e-02;DB;DP=80;ExcessHet=3.0103;FS=5.918;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-3.220e-01;POSITIVE_TRAIN_SITE;QD=11.11;ReadPosRankSum=0.852;SOR=0.754;VQSLOD=8.65;culprit=MQ;set=variant;ANN=T|synonymous_variant|LOW|SLC12A3|SLC12A3|transcript|NM_000339|protein_coding|11/26|c.1395C>T|p.Thr465Thr|1424/5567|1395/3093|465/1030||,T|synonymous_variant|LOW|SLC12A3|SLC12A3|transcript|NM_001126107|protein_coding|11/26|c.1392C>T|p.Thr464Thr|1421/5564|1392/3090|464/1029||,T|synonymous_variant|LOW|SLC12A3|SLC12A3|transcript|NM_001126108|protein_coding|11/26|c.1395C>T|p.Thr465Thr|1424/5540|1395/3066|465/1021||	GT:AD:DP:GQ:PL	0/1:36,41:77:99:884,0,825

Note: If no one is working on this I will probably look into this myself if I have time today or tomorrow.

The text was updated successfully, but these errors were encountered:

heuermh · 2017-01-10T13:42:10Z

Thanks for reporting this, @majkiw! I'll take a look this morning.

heuermh · 2017-01-10T14:08:11Z

These quick tests pass

diff --git a/adam-core/src/test/scala/org/bdgenomics/adam/converters/TranscriptEffectConverterSuite.scala b/adam-core/src/test/scala/org/bdgenomics/adam/converters/TranscriptEffectConverterSuite.scala
index c63029a0..5c7927cf 100644
--- a/adam-core/src/test/scala/org/bdgenomics/adam/converters/TranscriptEffectConverterSuite.scala
+++ b/adam-core/src/test/scala/org/bdgenomics/adam/converters/TranscriptEffectConverterSuite.scala
@@ -35,6 +35,7 @@ class TranscriptEffectConverterSuite extends ADAMFunSuite {
   final val INVALID_NUMBER = "T|upstream_gene_variant||TAS1R3|ENSG00000169962|transcript|ENST00000339381.5|protein_coding|1/2|c.-485C>T|||4|1/42|not a number|"
   final val INVALID_FRACTION = "T|upstream_gene_variant||TAS1R3|ENSG00000169962|transcript|ENST00000339381.5|protein_coding|not a number/2|c.-485C>T|||4|1/42|453|"
   final val VALID = "T|upstream_gene_variant||TAS1R3|ENSG00000169962|transcript|ENST00000339381.5|protein_coding|1/2|c.-485C>T|||4|1/42|453|"
+  final val MULTIPLE = VALID + "," + VALID + "," + VALID
 
   var variant: Variant = null
   var variantContext: VariantContext = null
@@ -134,6 +135,33 @@ class TranscriptEffectConverterSuite extends ADAMFunSuite {
     assert(te.getMessages.isEmpty)
   }
 
+  test("parse VCF ANN attribute with multiple transcript effects") {
+    val ann = TranscriptEffectConverter.parseAnn(MULTIPLE, ValidationStringency.STRICT)
+    assert(ann.length == 3)
+
+    ann.foreach(te => {
+      assert(te.getAlternateAllele == "T")
+      assert(te.getEffects.contains("upstream_gene_variant"))
+      assert(te.getGeneName == "TAS1R3")
+      assert(te.getGeneId == "ENSG00000169962")
+      assert(te.getFeatureType == "transcript")
+      assert(te.getFeatureId == "ENST00000339381.5")
+      assert(te.getBiotype == "protein_coding")
+      assert(te.getRank == 1)
+      assert(te.getTotal == 2)
+      assert(te.getTranscriptHgvs == "c.-485C>T")
+      assert(te.getProteinHgvs == null)
+      assert(te.getCdnaPosition == null)
+      assert(te.getCdnaLength == null)
+      assert(te.getCdsPosition == 4)
+      assert(te.getCdsLength == null)
+      assert(te.getProteinPosition == 1)
+      assert(te.getProteinLength == 42)
+      assert(te.getDistance == 453)
+      assert(te.getMessages.isEmpty)
+    })
+  }
+
   test("convert to transcript effect from null VCF ANN attribute in variant context") {
     when(variantContext.getAttributeAsString("ANN", null)).thenReturn(null)
 
@@ -253,6 +281,42 @@ class TranscriptEffectConverterSuite extends ADAMFunSuite {
     })
   }
 
+  test("convert to transcript effect from VCF ANN attribute in variant context multiple effects same alt allele") {
+    variant = Variant.newBuilder()
+      .setAlternateAllele("T")
+      .build()
+
+    when(variantContext.getAttributeAsString("ANN", null)).thenReturn(MULTIPLE)
+
+    val transcriptEffectsOpt = TranscriptEffectConverter.convertToTranscriptEffects(variant, variantContext)
+    assert(transcriptEffectsOpt.isDefined)
+
+    transcriptEffectsOpt.foreach(transcriptEffects => {
+      assert(transcriptEffects.size === 3)
+      transcriptEffects.foreach(te => {
+        assert(te.getAlternateAllele == "T")
+        assert(te.getEffects.contains("upstream_gene_variant"))
+        assert(te.getGeneName == "TAS1R3")
+        assert(te.getGeneId == "ENSG00000169962")
+        assert(te.getFeatureType == "transcript")
+        assert(te.getFeatureId == "ENST00000339381.5")
+        assert(te.getBiotype == "protein_coding")
+        assert(te.getRank == 1)
+        assert(te.getTotal == 2)
+        assert(te.getTranscriptHgvs == "c.-485C>T")
+        assert(te.getProteinHgvs == null)
+        assert(te.getCdnaPosition == null)
+        assert(te.getCdnaLength == null)
+        assert(te.getCdsPosition == 4)
+        assert(te.getCdsLength == null)
+        assert(te.getProteinPosition == 1)
+        assert(te.getProteinLength == 42)
+        assert(te.getDistance == 453)
+        assert(te.getMessages.isEmpty)
+      })
+    })
+  }
+
   test("convert transcript effect to VCF ANN attribute value") {
     val te = TranscriptEffect.newBuilder()
       .setAlternateAllele("T")

The problem may be that in the tests I'm mocking htsjdk with reasonable behavior, which of course may not be a reasonable thing to do.

majkiw · 2017-01-10T14:44:34Z

Hey,
Thanks for following.
From my investigation the error happens earlier and comes from this line:
https://github.com/Recombine/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala#L215

vc.getAttributeAsString("ANN", null)
would return a value like

[C|synonymous_variant|LOW|ADGRG1|ADGRG1|transcript|NM_001145771|protein_coding|8/15|c.996T>C|p.Thr332Thr|1726/4269|996/2082|332/693||, C|synonymous_variant|LOW|ADGRG1|ADGRG1|transcript|NM_001290142|protein_coding|7/14|c.486T>C|p.Thr162Thr|828/3371|486/1572|162/523||]

Notice extra brackets in the start/end of this string that break further parsing.

So for lists we need vc.getAttributeAsList("ANN"). I am not sure how to detect which one should be used.

heuermh · 2017-01-10T15:23:27Z

If the ANN VCF header line is Number=.,Type=String (the VCF annotation specification doesn't say, and I don't have any SnpEff-annotated VCF files in front of me), then I suppose accessing the attribute value as a list would be correct.

majkiw · 2017-01-12T09:46:20Z

Thank you for picking this up @heuermh.
I was however wondering.. Why was it silently ignored?
Shouldn't this have thrown exception or something?
I wasn't setting stringency and the default one seems to be STRICT.

heuermh mentioned this issue Jan 10, 2017

Use htsjdk getAttributeAsList for VCF INFO ANN key #1348

Merged

heuermh modified the milestones: 0.21.1, 0.22.0 Jan 10, 2017

fnothaft closed this as completed in #1348 Jan 16, 2017

heuermh modified the milestones: 0.22.0, 0.21.1 Jan 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcript Effects ignored if more than 1 #1347

Transcript Effects ignored if more than 1 #1347

majkiw commented Jan 10, 2017

heuermh commented Jan 10, 2017

heuermh commented Jan 10, 2017

majkiw commented Jan 10, 2017 •

edited

heuermh commented Jan 10, 2017

majkiw commented Jan 12, 2017

Transcript Effects ignored if more than 1 #1347

Transcript Effects ignored if more than 1 #1347

Comments

majkiw commented Jan 10, 2017

heuermh commented Jan 10, 2017

heuermh commented Jan 10, 2017

majkiw commented Jan 10, 2017 • edited

heuermh commented Jan 10, 2017

majkiw commented Jan 12, 2017

majkiw commented Jan 10, 2017 •

edited