-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguous genome-transcript alignments cause ambiguous variant projections #514
Comments
I think of the ambiguously placed gap as an independent variant (insGGA), which in HGVS would use the 3'-shifted location: g.73613070_ 73613071insGGA. So in my opinion, the rightmost placement of the alignment gap is preferable, and NM_015120.4n.150G>C --> g.73613035. The transcript has ...TGGAG > TGGAC, and the genome can too. Otherwise the genome changing at 73613032 assumes that insGGA has already been applied in its leftmost position. I see your point that either is mathematically valid, and using the same alignment, the transform would be reversible -- but at least for the sake of consistency, I like using the rightmost interpretation of the alignment gap. (Should we take it up with HGVS?) Substitutions that may or may not overlap ambiguously placed alignment gaps are about the trickiest case I've encountered. I have a trick for making sure that they are detected -- going from genome to transcript, but it might carry over in a straightforward way to transcript to genome. YMMV. Before translating coordinates, I look for ambiguously placed gaps in the transcript alignment -- and I modify the alignment to have a double-sided gap over the entire ambiguous region. This is an easy modification in UCSC's internal representation of alignments as a series of alignment block coordinates on target and query. In this case, instead of skipping 3 transcript bases and 0 genome bases, the coordinates skip over 42 transcript bases and 39 genome bases. Then it's trivial to detect whether a variant's projection is affected by an alignment gap -- no need to go searching the neighborhood, or taking chances with detecting the interaction or not. Having determined that the variant is somehow affected by a possibly ambiguously placed alignment gap, I apply the variant change to the ambiguous region sequence, which has the same effect as using the rightmost placement (my preference) of the alignment gap. In case the variant is an insertion or deletion that will need right-shifting itself, I trim matching bases from the modified ambiguous regions of genome and transcript, first from the left to get the rightmost position, then from the right, and that yields the right-shifted position and minimal base change. Complicated, maybe -- but consistent in the face of differing alignment gap placements. Written in a bit of a hurry, so if I have stated something unclearly, let me know and I'll try to make it more coherent. :) |
Angie-
I totally agree that the convention should be to right shift alignment
gaps, with the addition that shifting should be with respect to the
transcript.
Than main reason for this hgvs issue is that some transcripts contain left
shifted gaps in UTA. Changing them will change results for users, which is
bad.
BTW, your approach is essentially equivalent to what the NCBI does with
SPDI. It's a great solution as long as the set of alternate sequences is
known in advance.
…-Reece
On Aug 23, 2018 2:38 PM, "Angie Hinrichs" <notifications@github.com> wrote:
I think of the ambiguously placed gap as an independent variant (insGGA),
which in HGVS would use the 3'-shifted location: g.73613070_
73613071insGGA. So in my opinion, the rightmost placement of the alignment
gap is preferable, and NM_015120.4n.150G>C --> g.73613035. The transcript
has ...TGGAG > TGGAC, and the genome can too. Otherwise the genome changing
at 73613032 assumes that insGGA has already been applied in its leftmost
position. I see your point that either is mathematically valid, and using
the same alignment, the transform would be reversible -- but at least for
the sake of consistency, I like using the rightmost interpretation of the
alignment gap. (Should we take it up with HGVS?)
Substitutions that may or may not overlap ambiguously placed alignment gaps
are about the trickiest case I've encountered. I have a trick for making
sure that they are detected -- going from genome to transcript, but it
might carry over in a straightforward way to transcript to genome. YMMV.
Before translating coordinates, I look for ambiguously placed gaps in the
transcript alignment -- and I modify the alignment to have a double-sided
gap over the entire ambiguous region. This is an easy modification in
UCSC's internal representation of alignments as a series of alignment block
coordinates on target and query. In this case, instead of skipping 3
transcript bases and 0 genome bases, the coordinates skip over 42
transcript bases and 39 genome bases.
Then it's trivial to detect whether a variant's projection is affected by
an alignment gap -- no need to go searching the neighborhood, or taking
chances with detecting the interaction or not.
Having determined that the variant is somehow affected by a possibly
ambiguously placed alignment gap, I apply the variant change to the
ambiguous region sequence, which has the same effect as using the rightmost
placement (my preference) of the alignment gap. In case the variant is an
insertion or deletion that will need right-shifting itself, I trim matching
bases from the modified ambiguous regions of genome and transcript, first
from the left to get the rightmost position, then from the right, and that
yields the right-shifted position and minimal base change.
Complicated, maybe -- but consistent in the face of differing alignment gap
placements. Written in a bit of a hurry, so if I have stated something
unclearly, let me know and I'll try to make it more coherent. :)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#514 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGrjdHk49anAq7EpLsbTpPUINLEjJqKks5uTyDdgaJpZM4WKUHq>
.
|
Thanks Reece!
Even when going from c./n. to g.? (In general I would rather keep things transcript-centric too, but HGVS wants to go 3' on any reference.)
I hope that all tools can converge on the same behavior eventually, but yes, inconsistency with past results is a problem for users. Hopefully only bad enough to still change the answer but give a warning about it, which seems to be what you intend to do. (?)
Yes, SPDI was my inspiration for tweaking the transcript alignments. I hope SPDI becomes a common interchange format -- unambiguous, easy to parse, what's not to love! (except the string length when there's a large insertion or duplication, but what can ya do)
Don't we already need all the sequences in advance in order to correctly right-shift? I guess fetching the genome sequence would be the difficult part here... while in genome browser-land, we start with a genome sequence. :) |
See #392 |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been stalled for 7 days with no activity. |
This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
When a genome-transcript alignment is ambiguous, variant projections in that region are also ambiguous.
Consider this alignment provided by NCBI for ALMS1 in https://ftp.ncbi.nih.gov/refseq/H_sapiens/alignments/GCF_000001405.25_knownrefseq_alignments.gff3 (as of this date), line 18647:
Note the CIGAR string
M185 I3 M250
. That alignment is shown below, with the gap at n.186_188.This region contains a GGA repeat, with 13x in the genome and 14x in the transcript. An equivalent alignment is with the gap shifted 5' (relative to genome and transcript) and the gap at n.147_149.
Many other equivalent alignments are available, including circularly permuted forms. The alignment is ambiguous. (NOTE: Although HGVS specifies that variants should be right-shifted, it says nothing about how alignments should be represented.)
Here are both alignments in one view:
Now, consider two projecting two variants, NM_015120.4:n.146T>C and NM_015120.4n.150G>C to the genome. When the gap is represented at n.147_149, these two variants will project to adjacent positions 73613031 and 73613032 on the genome. When the gap is represented at n.186_188, the genome variants are at 73613031 and 73613035. Both views are correct.
Ideally, hgvs will warn (#166) when an ambiguous alignment affects results.
The text was updated successfully, but these errors were encountered: