Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hgvs package doesn't handle transcripts with discontinuous CDS numbering #385

Closed
reece opened this issue Nov 5, 2016 · 4 comments
Closed
Labels
duplicate This issue or pull request already exists enhancement New feature or request

Comments

@reece
Copy link
Member

reece commented Nov 5, 2016

Originally reported by: Reece Hart (Bitbucket: reece, GitHub: reece)


Observation (Reported by @vrajaraman): In transcripts with discontinuous CDS numbering, variant projection is wrong 3' from the discontinuity.

Example (from Veena):

NM_033517.1
CDS:0,5196
strand:+
Positions 1305-1341 on exon 11 have no corresponding alignment on Grch37. I ran some variants beyond exon 11 thru’ hgvs c_to_g

c_to_g,NM_033517.1:c.1403A>C, Actual c_to_g:NC_000022.10:g.51136085A>C, Expected:g.51136048A>C
g_to_c,NC_000022.10:g.51136048A>C,NM_033517.1:c.1366A>C
g_to_c,NC_000022.10:g.51136085A>C,NM_033517.1:c.1403A>C
c_to_g,NM_033517.1:c.2200G>C, Actual c_to_g:NC_000022.10:g.51153447G>C, Expected:g.51153410G>C
g_to_c,NC_000022.10:g.51153410G>C,NM_033517.1:c.2163G>C
g_to_c,NC_000022.10:g.51153447G>C,NM_033517.1:c.2200G>C
c_to_g,NM_033517.1:c.1564G>C, Actual c_to_g:NC_000022.10:g.51137220G>C, Expected:g.51137183G>C
g_to_c,NC_000022.10:g.51137183G>C,NM_033517.1:c.1527G>C
g_to_c,NC_000022.10:g.51137220G>C,NM_033517.1:c.1564G>C

Proposal:

  • issue raise exception when CDS numbering is discontinuous #386 immediate fix: in 0.4.x and 0.5.0dev branches, raise exception with discontinuous CDS
  • This issue: permit discontinuous CDS by explicitly tracking unaligned regions, only in dev branch. This requires figuring what the right response should be when projecting variants in CDS regions that have no genomic alignment.

See also: https://invitae.jira.com/browse/CORE-208 (Invitae internal)


@reece
Copy link
Member Author

reece commented Nov 5, 2016

Original comment by Reece Hart (Bitbucket: reece, GitHub: reece):


Cause: When preparing for an alignment between a transcript and genomic segment, hgvs concatenates cigar strings from the alignments, and adds appropriate intronic gaps, resulting in a transcript-genome cigar like 63= 343N 204= 1370N 72= 1891N 109=... The problem is that there's an assumption that the transcript coordinates are continuous, which is wrong here. For NM_033517.1, exon 11 is c.1304_1498, but the NCBI gff alignment covers only c.1342_1498. There is no aligned sequence for c.1305_1341.

This issue affects 22 distinct transcripts in 15 distinct genes in uta_20160908 (27 gaps total):

reece@[local]/uta_dev=> select T.hgnc,ES.exon_set_id,ES.tx_ac,format('[%s,%s]',E1.end_i,E2.start_i) as gap,E1.exon_id as e1_exon_id,E1.ord as e1_ord,E1.start_i as e1_start_i,E1.end_i as e1_end_i,E2.exon_id as e2_exon_id,E2.ord as e2_ord,E2.start_i as e2_start_i,E2.end_i as e2_end_i from exon_set ES left join transcript T on ES.tx_ac=T.ac join exon E1 on ES.exon_set_id=E1.exon_set_id join exon E2 on ES.exon_set_id=E2.exon_set_id and E2.ord=E1.ord+1 and E1.end_i!=E2.start_i where ES.alt_aln_method='transcript' order by hgnc, tx_ac;
┌─────────────┬─────────────┬────────────────┬───────────────┬────────────┬────────┬────────────┬──────────┬────────────┬────────┬────────────┬──────────┐
│    hgnc     │ exon_set_id │     tx_ac      │      gap      │ e1_exon_id │ e1_ord │ e1_start_i │ e1_end_i │ e2_exon_id │ e2_ord │ e2_start_i │ e2_end_i │
├─────────────┼─────────────┼────────────────┼───────────────┼────────────┼────────┼────────────┼──────────┼────────────┼────────┼────────────┼──────────┤
│ AKR1C8P     │      809083 │ NR_027916.2    │ [596,651]     │    6897624 │      4 │        517 │      596 │    6897625 │      5 │        651 │      718 │
│ BAHCC1      │      807958 │ NM_001291324.1 │ [300,375]     │    6884337 │      1 │        289 │      300 │    6884338 │      2 │        375 │      500 │
│ BAHCC1      │      807958 │ NM_001291324.1 │ [197,289]     │    6884336 │      0 │          0 │      197 │    6884337 │      1 │        289 │      300 │
│ CASP8AP2    │      343792 │ NM_001137667.1 │ [6089,6136]   │    3470817 │      8 │       5988 │     6089 │    3470818 │      9 │       6136 │     6772 │
│ CASP8AP2    │      343793 │ NM_001137668.1 │ [5956,6003]   │    3470827 │      8 │       5855 │     5956 │    3470828 │      9 │       6003 │     6639 │
│ CASP8AP2    │      349599 │ NM_012115.3    │ [6128,6175]   │    3533041 │      8 │       6027 │     6128 │    3533042 │      9 │       6175 │     6811 │
│ FADS6       │      809057 │ NM_178128.5    │ [233,287]     │    6897381 │      0 │          0 │      233 │    6897382 │      1 │        287 │      460 │
│ HP09025     │      809105 │ NR_109783.1    │ [2119,2163]   │    6897791 │      1 │        391 │     2119 │    6897792 │      2 │       2163 │     4237 │
│ KIR3DX1     │      809117 │ NR_136268.1    │ [3112,3188]   │    6897948 │      8 │       1281 │     3112 │    6897949 │      9 │       3188 │     3586 │
│ MROH8       │      350957 │ NM_152503.5    │ [201,230]     │    3548608 │      0 │          0 │      201 │    3548609 │      1 │        230 │      395 │
│ MROH8       │      351464 │ NM_213631.2    │ [201,230]     │    3553847 │      0 │          0 │      201 │    3553848 │      1 │        230 │      395 │
│ MROH8       │      351465 │ NM_213632.2    │ [201,230]     │    3553861 │      0 │          0 │      201 │    3553862 │      1 │        230 │      395 │
│ MUC19       │      351098 │ NM_173600.2    │ [19114,19704] │    3550193 │     68 │      19060 │    19114 │    3550194 │     69 │      19704 │    19762 │
│ MUC22       │      734393 │ NM_001322469.1 │ [4564,4594]   │    6185639 │      2 │        254 │     4564 │    6185640 │      3 │       4594 │     4883 │
│ PKD1P6      │      809110 │ NR_123721.1    │ [1924,2030]   │    6897867 │      9 │       1502 │     1924 │    6897868 │     10 │       2030 │     2654 │
│ SHANK3      │      350666 │ NM_033517.1    │ [1304,1341]   │    3545594 │      9 │       1030 │     1304 │    3545595 │     10 │       1341 │     1498 │
│ SLC14A2-AS1 │      360864 │ NR_110899.1    │ [68,89]       │    3599596 │      0 │          0 │       68 │    3599597 │      1 │         89 │      152 │
│ TMEM185A    │      807907 │ NM_001174092.2 │ [150,177]     │    6883698 │      0 │          0 │      150 │    6883699 │      1 │        177 │      370 │
│ TMEM185A    │      807946 │ NM_001282302.1 │ [150,177]     │    6884139 │      0 │          0 │      150 │    6884140 │      1 │        177 │      370 │
│ TMEM185A    │      809100 │ NR_104121.1    │ [150,177]     │    6897751 │      0 │          0 │      150 │    6897752 │      1 │        177 │      370 │
│ USO1        │      807955 │ NM_001290049.1 │ [364,387]     │    6884253 │      0 │          0 │      364 │    6884254 │      1 │        387 │      398 │
│ USO1        │      807955 │ NM_001290049.1 │ [398,494]     │    6884254 │      1 │        387 │      398 │    6884255 │      2 │        494 │      515 │
│ USO1        │      808889 │ NM_003715.3    │ [398,494]     │    6895573 │      1 │        387 │      398 │    6895574 │      2 │        494 │      515 │
│ USO1        │      808889 │ NM_003715.3    │ [364,387]     │    6895572 │      0 │          0 │      364 │    6895573 │      1 │        387 │      398 │
│ ZNF718      │      807953 │ NM_001289930.1 │ [821,918]     │    6884241 │      0 │          0 │      821 │    6884242 │      1 │        918 │      931 │
│ ZNF718      │      807953 │ NM_001289930.1 │ [975,1041]    │    6884243 │      2 │        953 │      975 │    6884244 │      3 │       1041 │     4292 │
│ ZNF718      │      807953 │ NM_001289930.1 │ [931,953]     │    6884242 │      1 │        918 │      931 │    6884243 │      2 │        953 │      975 │
└─────────────┴─────────────┴────────────────┴───────────────┴────────────┴────────┴────────────┴──────────┴────────────┴────────┴────────────┴──────────┘
(27 rows)

Time: 4117.351 ms

@reece
Copy link
Member Author

reece commented Dec 7, 2016

Original comment by Reece Hart (Bitbucket: reece, GitHub: reece):


Now that #386 is closed, I am reframing this issue as an enhancement to support transcripts alignments that have gaps in transcript sequence.

@reece
Copy link
Member Author

reece commented Jan 11, 2017

Original comment by Reece Hart (Bitbucket: reece, GitHub: reece):


Current code rejects using transcripts that lead to errors. Affects a small number of transcripts. Removing from 0.5.0 and making low priority.

@reece
Copy link
Member Author

reece commented Sep 11, 2019

Closing as duplicate of biocommons/uta#198.

@reece reece closed this as completed Sep 11, 2019
@reece reece added the duplicate This issue or pull request already exists label Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant