Ambiguous genome-transcript alignments cause ambiguous variant projections #514

reece · 2018-08-23T20:41:34Z

When a genome-transcript alignment is ambiguous, variant projections in that region are also ambiguous.

Consider this alignment provided by NCBI for ALMS1 in https://ftp.ncbi.nih.gov/refseq/H_sapiens/alignments/GCF_000001405.25_knownrefseq_alignments.gff3 (as of this date), line 18647:

NC_000002.11      RefSeq  cDNA_match      73612886        73613320        431.411 +       .       ID=df5c6169-22bc-4913-b001-6eabfe95b866;Target=NM_015120.4 1 438 +;gap_count=2;identity=0.999536;idty=0.993151;num_ident=12922;num_mismatch=0;pct_coverage=99.9536;pct_identity_gap=99.9536;pct_identity_ungap=100;Gap=M185 I3 M250

Note the CIGAR string M185 I3 M250. That alignment is shown below, with the gap at n.186_188.

   NM_015120.4  c         26 >                                                        >       79 
   NM_015120.4  n        137 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG >      190
                               |||||||||||||||||||||||||||||||||||||||||||||||||   || 49=3I2= 
   NC_000002.11 g   73613022 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGA---AG > 73613072
                             < GCCCGCTCGACCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCT---TC <

This region contains a GGA repeat, with 13x in the genome and 14x in the transcript. An equivalent alignment is with the gap shifted 5' (relative to genome and transcript) and the gap at n.147_149.

   NM_015120.4  c         26 >                                                        >       79
   NM_015120.4  n        137 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG >      190
                               ||||||||||   ||||||||||||||||||||||||||||||||||||||||| 10=3I41= 
   NC_000002.11 g   73613022 > CGGGCGAGCT---GGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG > 73613072
                             < GCCCGCTCGA---CCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTTC <

Many other equivalent alignments are available, including circularly permuted forms. The alignment is ambiguous. (NOTE: Although HGVS specifies that variants should be right-shifted, it says nothing about how alignments should be represented.)

Here are both alignments in one view:

                               4         5         6         7         8         9
                            789012345678901234567890123456789012345678901234567890
NM_015120.4  n        137 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG
                                     ^   ^
                            2       3|   |       4         5         6         7
                            2345678901   23456789012345678901234567890123456789012
NC_000002.11 g   73613022 > CGGGCGAGCT---GGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG
                                     ^   ^
                            2       3|   |    4         5         6         7
                            2345678901234567890123456789012345678901234567890   12
NC_000002.11 g   73613022 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGA---AG
                                     ^   ^

Now, consider two projecting two variants, NM_015120.4:n.146T>C and NM_015120.4n.150G>C to the genome. When the gap is represented at n.147_149, these two variants will project to adjacent positions 73613031 and 73613032 on the genome. When the gap is represented at n.186_188, the genome variants are at 73613031 and 73613035. Both views are correct.

Ideally, hgvs will warn (#166) when an ambiguous alignment affects results.

The text was updated successfully, but these errors were encountered:

AngieHinrichs · 2018-08-23T21:38:37Z

I think of the ambiguously placed gap as an independent variant (insGGA), which in HGVS would use the 3'-shifted location: g.73613070_ 73613071insGGA. So in my opinion, the rightmost placement of the alignment gap is preferable, and NM_015120.4n.150G>C --> g.73613035. The transcript has ...TGGAG > TGGAC, and the genome can too. Otherwise the genome changing at 73613032 assumes that insGGA has already been applied in its leftmost position. I see your point that either is mathematically valid, and using the same alignment, the transform would be reversible -- but at least for the sake of consistency, I like using the rightmost interpretation of the alignment gap. (Should we take it up with HGVS?)

Substitutions that may or may not overlap ambiguously placed alignment gaps are about the trickiest case I've encountered. I have a trick for making sure that they are detected -- going from genome to transcript, but it might carry over in a straightforward way to transcript to genome. YMMV.

Before translating coordinates, I look for ambiguously placed gaps in the transcript alignment -- and I modify the alignment to have a double-sided gap over the entire ambiguous region. This is an easy modification in UCSC's internal representation of alignments as a series of alignment block coordinates on target and query. In this case, instead of skipping 3 transcript bases and 0 genome bases, the coordinates skip over 42 transcript bases and 39 genome bases.

Then it's trivial to detect whether a variant's projection is affected by an alignment gap -- no need to go searching the neighborhood, or taking chances with detecting the interaction or not.

Having determined that the variant is somehow affected by a possibly ambiguously placed alignment gap, I apply the variant change to the ambiguous region sequence, which has the same effect as using the rightmost placement (my preference) of the alignment gap. In case the variant is an insertion or deletion that will need right-shifting itself, I trim matching bases from the modified ambiguous regions of genome and transcript, first from the left to get the rightmost position, then from the right, and that yields the right-shifted position and minimal base change.

Complicated, maybe -- but consistent in the face of differing alignment gap placements. Written in a bit of a hurry, so if I have stated something unclearly, let me know and I'll try to make it more coherent. :)

reece · 2018-08-23T22:24:02Z

Angie- I totally agree that the convention should be to right shift alignment gaps, with the addition that shifting should be with respect to the transcript. Than main reason for this hgvs issue is that some transcripts contain left shifted gaps in UTA. Changing them will change results for users, which is bad. BTW, your approach is essentially equivalent to what the NCBI does with SPDI. It's a great solution as long as the set of alternate sequences is known in advance.

…

-Reece On Aug 23, 2018 2:38 PM, "Angie Hinrichs" <notifications@github.com> wrote: I think of the ambiguously placed gap as an independent variant (insGGA), which in HGVS would use the 3'-shifted location: g.73613070_ 73613071insGGA. So in my opinion, the rightmost placement of the alignment gap is preferable, and NM_015120.4n.150G>C --> g.73613035. The transcript has ...TGGAG > TGGAC, and the genome can too. Otherwise the genome changing at 73613032 assumes that insGGA has already been applied in its leftmost position. I see your point that either is mathematically valid, and using the same alignment, the transform would be reversible -- but at least for the sake of consistency, I like using the rightmost interpretation of the alignment gap. (Should we take it up with HGVS?) Substitutions that may or may not overlap ambiguously placed alignment gaps are about the trickiest case I've encountered. I have a trick for making sure that they are detected -- going from genome to transcript, but it might carry over in a straightforward way to transcript to genome. YMMV. Before translating coordinates, I look for ambiguously placed gaps in the transcript alignment -- and I modify the alignment to have a double-sided gap over the entire ambiguous region. This is an easy modification in UCSC's internal representation of alignments as a series of alignment block coordinates on target and query. In this case, instead of skipping 3 transcript bases and 0 genome bases, the coordinates skip over 42 transcript bases and 39 genome bases. Then it's trivial to detect whether a variant's projection is affected by an alignment gap -- no need to go searching the neighborhood, or taking chances with detecting the interaction or not. Having determined that the variant is somehow affected by a possibly ambiguously placed alignment gap, I apply the variant change to the ambiguous region sequence, which has the same effect as using the rightmost placement (my preference) of the alignment gap. In case the variant is an insertion or deletion that will need right-shifting itself, I trim matching bases from the modified ambiguous regions of genome and transcript, first from the left to get the rightmost position, then from the right, and that yields the right-shifted position and minimal base change. Complicated, maybe -- but consistent in the face of differing alignment gap placements. Written in a bit of a hurry, so if I have stated something unclearly, let me know and I'll try to make it more coherent. :) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#514 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGrjdHk49anAq7EpLsbTpPUINLEjJqKks5uTyDdgaJpZM4WKUHq> .

AngieHinrichs · 2018-08-24T17:17:15Z

Thanks Reece!

with the addition that shifting should be with respect to the transcript.

Even when going from c./n. to g.? (In general I would rather keep things transcript-centric too, but HGVS wants to go 3' on any reference.)

Changing them will change results for users, which is bad.

I hope that all tools can converge on the same behavior eventually, but yes, inconsistency with past results is a problem for users. Hopefully only bad enough to still change the answer but give a warning about it, which seems to be what you intend to do. (?)

BTW, your approach is essentially equivalent to what the NCBI does with SPDI.

Yes, SPDI was my inspiration for tweaking the transcript alignments. I hope SPDI becomes a common interchange format -- unambiguous, easy to parse, what's not to love! (except the string length when there's a large insertion or duplication, but what can ya do)

It's a great solution as long as the set of alternate sequences is known in advance.

Don't we already need all the sequences in advance in order to correctly right-shift? I guess fetching the genome sequence would be the difficult part here... while in genome browser-land, we start with a genome sequence. :)

reece · 2018-09-19T19:21:42Z

See #392

github-actions · 2023-12-28T01:53:06Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-01-04T01:54:54Z

This issue was closed because it has been stalled for 7 days with no activity.

reece · 2024-02-19T17:44:10Z

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

github-actions · 2024-05-23T01:56:17Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

AngieHinrichs mentioned this issue Sep 20, 2018

Optionally raise exception for variants within ambiguous indel alignment regions #392

Closed

github-actions bot added the stale Issue is stale and subject to automatic closing label Dec 28, 2023

github-actions bot added the closed-by-stale label Jan 4, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 4, 2024

reece reopened this Feb 19, 2024

reece added resurrected and removed stale Issue is stale and subject to automatic closing labels Feb 19, 2024

github-actions bot added the stale Issue is stale and subject to automatic closing label May 23, 2024

jsstevenson added keep alive exempt issue from staleness checks and removed stale Issue is stale and subject to automatic closing resurrected labels May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguous genome-transcript alignments cause ambiguous variant projections #514

Ambiguous genome-transcript alignments cause ambiguous variant projections #514

reece commented Aug 23, 2018 •

edited

Loading

AngieHinrichs commented Aug 23, 2018

reece commented Aug 23, 2018 via email

AngieHinrichs commented Aug 24, 2018

reece commented Sep 19, 2018

github-actions bot commented Dec 28, 2023

github-actions bot commented Jan 4, 2024

reece commented Feb 19, 2024

github-actions bot commented May 23, 2024

Ambiguous genome-transcript alignments cause ambiguous variant projections #514

Ambiguous genome-transcript alignments cause ambiguous variant projections #514

Comments

reece commented Aug 23, 2018 • edited Loading

AngieHinrichs commented Aug 23, 2018

reece commented Aug 23, 2018 via email

AngieHinrichs commented Aug 24, 2018

reece commented Sep 19, 2018

github-actions bot commented Dec 28, 2023

github-actions bot commented Jan 4, 2024

reece commented Feb 19, 2024

github-actions bot commented May 23, 2024

reece commented Aug 23, 2018 •

edited

Loading