Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous genome-transcript alignments cause ambiguous variant projections #514

Open
reece opened this issue Aug 23, 2018 · 8 comments
Open
Labels
keep alive exempt issue from staleness checks

Comments

@reece
Copy link
Member

reece commented Aug 23, 2018

When a genome-transcript alignment is ambiguous, variant projections in that region are also ambiguous.

Consider this alignment provided by NCBI for ALMS1 in https://ftp.ncbi.nih.gov/refseq/H_sapiens/alignments/GCF_000001405.25_knownrefseq_alignments.gff3 (as of this date), line 18647:

NC_000002.11      RefSeq  cDNA_match      73612886        73613320        431.411 +       .       ID=df5c6169-22bc-4913-b001-6eabfe95b866;Target=NM_015120.4 1 438 +;gap_count=2;identity=0.999536;idty=0.993151;num_ident=12922;num_mismatch=0;pct_coverage=99.9536;pct_identity_gap=99.9536;pct_identity_ungap=100;Gap=M185 I3 M250

Note the CIGAR string M185 I3 M250. That alignment is shown below, with the gap at n.186_188.

   NM_015120.4  c         26 >                                                        >       79 
   NM_015120.4  n        137 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG >      190
                               |||||||||||||||||||||||||||||||||||||||||||||||||   || 49=3I2= 
   NC_000002.11 g   73613022 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGA---AG > 73613072
                             < GCCCGCTCGACCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCT---TC <

This region contains a GGA repeat, with 13x in the genome and 14x in the transcript. An equivalent alignment is with the gap shifted 5' (relative to genome and transcript) and the gap at n.147_149.

   NM_015120.4  c         26 >                                                        >       79
   NM_015120.4  n        137 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG >      190
                               ||||||||||   ||||||||||||||||||||||||||||||||||||||||| 10=3I41= 
   NC_000002.11 g   73613022 > CGGGCGAGCT---GGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG > 73613072
                             < GCCCGCTCGA---CCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTCCTTC <

Many other equivalent alignments are available, including circularly permuted forms. The alignment is ambiguous. (NOTE: Although HGVS specifies that variants should be right-shifted, it says nothing about how alignments should be represented.)

Here are both alignments in one view:

                               4         5         6         7         8         9
                            789012345678901234567890123456789012345678901234567890
NM_015120.4  n        137 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG
                                     ^   ^
                            2       3|   |       4         5         6         7
                            2345678901   23456789012345678901234567890123456789012
NC_000002.11 g   73613022 > CGGGCGAGCT---GGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAAG
                                     ^   ^
                            2       3|   |    4         5         6         7
                            2345678901234567890123456789012345678901234567890   12
NC_000002.11 g   73613022 > CGGGCGAGCTGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGA---AG
                                     ^   ^

Now, consider two projecting two variants, NM_015120.4:n.146T>C and NM_015120.4n.150G>C to the genome. When the gap is represented at n.147_149, these two variants will project to adjacent positions 73613031 and 73613032 on the genome. When the gap is represented at n.186_188, the genome variants are at 73613031 and 73613035. Both views are correct.

Ideally, hgvs will warn (#166) when an ambiguous alignment affects results.

@AngieHinrichs
Copy link

I think of the ambiguously placed gap as an independent variant (insGGA), which in HGVS would use the 3'-shifted location: g.73613070_ 73613071insGGA. So in my opinion, the rightmost placement of the alignment gap is preferable, and NM_015120.4n.150G>C --> g.73613035. The transcript has ...TGGAG > TGGAC, and the genome can too. Otherwise the genome changing at 73613032 assumes that insGGA has already been applied in its leftmost position. I see your point that either is mathematically valid, and using the same alignment, the transform would be reversible -- but at least for the sake of consistency, I like using the rightmost interpretation of the alignment gap. (Should we take it up with HGVS?)

Substitutions that may or may not overlap ambiguously placed alignment gaps are about the trickiest case I've encountered. I have a trick for making sure that they are detected -- going from genome to transcript, but it might carry over in a straightforward way to transcript to genome. YMMV.

Before translating coordinates, I look for ambiguously placed gaps in the transcript alignment -- and I modify the alignment to have a double-sided gap over the entire ambiguous region. This is an easy modification in UCSC's internal representation of alignments as a series of alignment block coordinates on target and query. In this case, instead of skipping 3 transcript bases and 0 genome bases, the coordinates skip over 42 transcript bases and 39 genome bases.

Then it's trivial to detect whether a variant's projection is affected by an alignment gap -- no need to go searching the neighborhood, or taking chances with detecting the interaction or not.

Having determined that the variant is somehow affected by a possibly ambiguously placed alignment gap, I apply the variant change to the ambiguous region sequence, which has the same effect as using the rightmost placement (my preference) of the alignment gap. In case the variant is an insertion or deletion that will need right-shifting itself, I trim matching bases from the modified ambiguous regions of genome and transcript, first from the left to get the rightmost position, then from the right, and that yields the right-shifted position and minimal base change.

Complicated, maybe -- but consistent in the face of differing alignment gap placements. Written in a bit of a hurry, so if I have stated something unclearly, let me know and I'll try to make it more coherent. :)

@reece
Copy link
Member Author

reece commented Aug 23, 2018 via email

@AngieHinrichs
Copy link

Thanks Reece!

with the addition that shifting should be with respect to the transcript.

Even when going from c./n. to g.? (In general I would rather keep things transcript-centric too, but HGVS wants to go 3' on any reference.)

Changing them will change results for users, which is bad.

I hope that all tools can converge on the same behavior eventually, but yes, inconsistency with past results is a problem for users. Hopefully only bad enough to still change the answer but give a warning about it, which seems to be what you intend to do. (?)

BTW, your approach is essentially equivalent to what the NCBI does with SPDI.

Yes, SPDI was my inspiration for tweaking the transcript alignments. I hope SPDI becomes a common interchange format -- unambiguous, easy to parse, what's not to love! (except the string length when there's a large insertion or duplication, but what can ya do)

It's a great solution as long as the set of alternate sequences is known in advance.

Don't we already need all the sequences in advance in order to correctly right-shift? I guess fetching the genome sequence would be the difficult part here... while in genome browser-land, we start with a genome sequence. :)

@reece
Copy link
Member Author

reece commented Sep 19, 2018

See #392

Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Dec 28, 2023
Copy link

github-actions bot commented Jan 4, 2024

This issue was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 4, 2024
@reece reece reopened this Feb 19, 2024
@reece reece added resurrected and removed stale Issue is stale and subject to automatic closing labels Feb 19, 2024
@reece
Copy link
Member Author

reece commented Feb 19, 2024

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label May 23, 2024
@jsstevenson jsstevenson added keep alive exempt issue from staleness checks and removed stale Issue is stale and subject to automatic closing resurrected labels May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep alive exempt issue from staleness checks
Projects
None yet
Development

No branches or pull requests

3 participants