Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript Effects ignored if more than 1 #1347

Closed
majkiw opened this issue Jan 10, 2017 · 5 comments
Closed

Transcript Effects ignored if more than 1 #1347

majkiw opened this issue Jan 10, 2017 · 5 comments
Milestone

Comments

@majkiw
Copy link

majkiw commented Jan 10, 2017

Hi,

I was trying to run current master against our VCF files and I have noticed that whenever there is more than one value in transcript effects, the list is returned empty.

BTW we really appreciate that you added support for transcript effects - this was the last missing thing we required to be able to use ADAM and you added it just 2 weeks after we started testing. Thank you :)

I am attaching example for completeness, but I don't think it's needed, because it happens on every row with multiple transcript effects.

This would return 1 transcript effect:

16	57392733	rs4359426	A	C	981.77	PASS	AC=2;AF=1.00;AN=2;DB;DP=36;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;POSITIVE_TRAIN_SITE;QD=28.88;SOR=1.819;VQSLOD=13.20;culprit=MQ;set=variant;ANN=C|missense_variant|MODERATE|CCL22|CCL22|transcript|NM_002990|protein_coding|1/3|c.5A>C|p.Asp2Ala|39/2918|5/282|2/93||	GT:AD:DP:GQ:PL	1/1:0,34:34:99:1010,101,0

This would return no transcript effects:

16	56913513	rs5801	C	T	855.77	PASS	AC=1;AF=0.500;AN=2;BaseQRankSum=-1.843e+00;ClippingRankSum=-7.700e-02;DB;DP=80;ExcessHet=3.0103;FS=5.918;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-3.220e-01;POSITIVE_TRAIN_SITE;QD=11.11;ReadPosRankSum=0.852;SOR=0.754;VQSLOD=8.65;culprit=MQ;set=variant;ANN=T|synonymous_variant|LOW|SLC12A3|SLC12A3|transcript|NM_000339|protein_coding|11/26|c.1395C>T|p.Thr465Thr|1424/5567|1395/3093|465/1030||,T|synonymous_variant|LOW|SLC12A3|SLC12A3|transcript|NM_001126107|protein_coding|11/26|c.1392C>T|p.Thr464Thr|1421/5564|1392/3090|464/1029||,T|synonymous_variant|LOW|SLC12A3|SLC12A3|transcript|NM_001126108|protein_coding|11/26|c.1395C>T|p.Thr465Thr|1424/5540|1395/3066|465/1021||	GT:AD:DP:GQ:PL	0/1:36,41:77:99:884,0,825

Note: If no one is working on this I will probably look into this myself if I have time today or tomorrow.

@heuermh
Copy link
Member

heuermh commented Jan 10, 2017

Thanks for reporting this, @majkiw! I'll take a look this morning.

@heuermh
Copy link
Member

heuermh commented Jan 10, 2017

These quick tests pass

diff --git a/adam-core/src/test/scala/org/bdgenomics/adam/converters/TranscriptEffectConverterSuite.scala b/adam-core/src/test/scala/org/bdgenomics/adam/converters/TranscriptEffectConverterSuite.scala
index c63029a0..5c7927cf 100644
--- a/adam-core/src/test/scala/org/bdgenomics/adam/converters/TranscriptEffectConverterSuite.scala
+++ b/adam-core/src/test/scala/org/bdgenomics/adam/converters/TranscriptEffectConverterSuite.scala
@@ -35,6 +35,7 @@ class TranscriptEffectConverterSuite extends ADAMFunSuite {
   final val INVALID_NUMBER = "T|upstream_gene_variant||TAS1R3|ENSG00000169962|transcript|ENST00000339381.5|protein_coding|1/2|c.-485C>T|||4|1/42|not a number|"
   final val INVALID_FRACTION = "T|upstream_gene_variant||TAS1R3|ENSG00000169962|transcript|ENST00000339381.5|protein_coding|not a number/2|c.-485C>T|||4|1/42|453|"
   final val VALID = "T|upstream_gene_variant||TAS1R3|ENSG00000169962|transcript|ENST00000339381.5|protein_coding|1/2|c.-485C>T|||4|1/42|453|"
+  final val MULTIPLE = VALID + "," + VALID + "," + VALID
 
   var variant: Variant = null
   var variantContext: VariantContext = null
@@ -134,6 +135,33 @@ class TranscriptEffectConverterSuite extends ADAMFunSuite {
     assert(te.getMessages.isEmpty)
   }
 
+  test("parse VCF ANN attribute with multiple transcript effects") {
+    val ann = TranscriptEffectConverter.parseAnn(MULTIPLE, ValidationStringency.STRICT)
+    assert(ann.length == 3)
+
+    ann.foreach(te => {
+      assert(te.getAlternateAllele == "T")
+      assert(te.getEffects.contains("upstream_gene_variant"))
+      assert(te.getGeneName == "TAS1R3")
+      assert(te.getGeneId == "ENSG00000169962")
+      assert(te.getFeatureType == "transcript")
+      assert(te.getFeatureId == "ENST00000339381.5")
+      assert(te.getBiotype == "protein_coding")
+      assert(te.getRank == 1)
+      assert(te.getTotal == 2)
+      assert(te.getTranscriptHgvs == "c.-485C>T")
+      assert(te.getProteinHgvs == null)
+      assert(te.getCdnaPosition == null)
+      assert(te.getCdnaLength == null)
+      assert(te.getCdsPosition == 4)
+      assert(te.getCdsLength == null)
+      assert(te.getProteinPosition == 1)
+      assert(te.getProteinLength == 42)
+      assert(te.getDistance == 453)
+      assert(te.getMessages.isEmpty)
+    })
+  }
+
   test("convert to transcript effect from null VCF ANN attribute in variant context") {
     when(variantContext.getAttributeAsString("ANN", null)).thenReturn(null)
 
@@ -253,6 +281,42 @@ class TranscriptEffectConverterSuite extends ADAMFunSuite {
     })
   }
 
+  test("convert to transcript effect from VCF ANN attribute in variant context multiple effects same alt allele") {
+    variant = Variant.newBuilder()
+      .setAlternateAllele("T")
+      .build()
+
+    when(variantContext.getAttributeAsString("ANN", null)).thenReturn(MULTIPLE)
+
+    val transcriptEffectsOpt = TranscriptEffectConverter.convertToTranscriptEffects(variant, variantContext)
+    assert(transcriptEffectsOpt.isDefined)
+
+    transcriptEffectsOpt.foreach(transcriptEffects => {
+      assert(transcriptEffects.size === 3)
+      transcriptEffects.foreach(te => {
+        assert(te.getAlternateAllele == "T")
+        assert(te.getEffects.contains("upstream_gene_variant"))
+        assert(te.getGeneName == "TAS1R3")
+        assert(te.getGeneId == "ENSG00000169962")
+        assert(te.getFeatureType == "transcript")
+        assert(te.getFeatureId == "ENST00000339381.5")
+        assert(te.getBiotype == "protein_coding")
+        assert(te.getRank == 1)
+        assert(te.getTotal == 2)
+        assert(te.getTranscriptHgvs == "c.-485C>T")
+        assert(te.getProteinHgvs == null)
+        assert(te.getCdnaPosition == null)
+        assert(te.getCdnaLength == null)
+        assert(te.getCdsPosition == 4)
+        assert(te.getCdsLength == null)
+        assert(te.getProteinPosition == 1)
+        assert(te.getProteinLength == 42)
+        assert(te.getDistance == 453)
+        assert(te.getMessages.isEmpty)
+      })
+    })
+  }
+
   test("convert transcript effect to VCF ANN attribute value") {
     val te = TranscriptEffect.newBuilder()
       .setAlternateAllele("T")

The problem may be that in the tests I'm mocking htsjdk with reasonable behavior, which of course may not be a reasonable thing to do.

@majkiw
Copy link
Author

majkiw commented Jan 10, 2017

Hey,
Thanks for following.
From my investigation the error happens earlier and comes from this line:
https://github.com/Recombine/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/converters/TranscriptEffectConverter.scala#L215

vc.getAttributeAsString("ANN", null)
would return a value like

[C|synonymous_variant|LOW|ADGRG1|ADGRG1|transcript|NM_001145771|protein_coding|8/15|c.996T>C|p.Thr332Thr|1726/4269|996/2082|332/693||, C|synonymous_variant|LOW|ADGRG1|ADGRG1|transcript|NM_001290142|protein_coding|7/14|c.486T>C|p.Thr162Thr|828/3371|486/1572|162/523||]

Notice extra brackets in the start/end of this string that break further parsing.

So for lists we need vc.getAttributeAsList("ANN"). I am not sure how to detect which one should be used.

@heuermh
Copy link
Member

heuermh commented Jan 10, 2017

If the ANN VCF header line is Number=.,Type=String (the VCF annotation specification doesn't say, and I don't have any SnpEff-annotated VCF files in front of me), then I suppose accessing the attribute value as a list would be correct.

@majkiw
Copy link
Author

majkiw commented Jan 12, 2017

Thank you for picking this up @heuermh.
I was however wondering.. Why was it silently ignored?
Shouldn't this have thrown exception or something?
I wasn't setting stringency and the default one seems to be STRICT.

@heuermh heuermh modified the milestones: 0.22.0, 0.21.1 Jan 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants