Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle transcripts with unaligned regions within exons #198

Closed
reece opened this issue Sep 9, 2016 · 15 comments
Closed

Handle transcripts with unaligned regions within exons #198

reece opened this issue Sep 9, 2016 · 15 comments
Assignees
Labels
closed-by-stale enhancement New feature or request stale Issue is stale and subject to automatic closing

Comments

@reece
Copy link
Member

reece commented Sep 9, 2016

Originally reported by Reece Hart (Bitbucket: reece, GitHub: reece) in biocommons/uta #198
Migrated by bitbucket-issue-migration on 2016-09-09 15:15:07


In UTA's data model, an exon_set represents a set of exons for a transcript on a single sequence. A single transcript accession will typically have an exon set defined on the transcript sequence itself and additional exon sets defined on various genomic sequences. A transcript is associated with multiple exon sets, each of which is an instantiation of that transcript on a distinct sequence.

UTA stores exon-level alignments between the pairs of exon sets; typically, these are alignments between a transcript exon set and a genomic exon sets. A core assumption in UTA is that a alignments are made between exons.

In the past, I've referred to this as transcripts having different exon structures on GRCh37 and 38. While this is true in terms of UTA data model definitions, a more precise colloquial explanation is that UTA cannot handle cases where a single exon is split into two aligned regions.


Example:

For example, HRH1 has different transcript definitions on 37 and 38. See below. This means that there's no consistent "transcript" definition, which in turn means that there's no way to have a single current definition, and therefore that it's unclear what we're aligning to.

This presumably happens because the annotation abuts an assembly gap, which means that the 37 alignment is truncated.

zgrep NM_001098212.1 data/ncbi/2016-09-08/GCF_000001405.*.txinfo.gz
data/ncbi/2016-09-08/GCF_000001405.25.txinfo.gz:NCBI    NM_001098212.1  HRH1    62,1526 27,4278
data/ncbi/2016-09-08/GCF_000001405.33.txinfo.gz:NCBI    NM_001098212.1  HRH1    62,1526 0,27;27,4278

Note that on 37, this gene has one exon from transcript interval [27,4278], whereas on 38 it also has the [0,27] interval.


NOTE: A simple way to find these in UTA is to look for a '/' in the alt_aln_method field. When a conflict occurs, the loader deprecates one of them by appending a hash of the exon coordinates (which will be unique). This was implemented to handles cases where exon coordinates changed between GRCh37 patch level releases (which was bad enough, but at least the notion of a single transcript made sense). Now, with two assemblies, there are more of these conflicts and we need a better way to resurrect the notion of a single transcript.

@reece
Copy link
Member Author

reece commented Sep 9, 2016

Original comment on Bitbucket by Reece Hart (Bitbucket: reece, GitHub: reece):


This issue affects 146 distinct transcripts in 81 genes:

reece@[local]/uta_dev=> select distinct T.hgnc, count(distinct tx_ac) as n_transcripts, string_agg(distinct tx_ac, ', ')
from uta_20160908.exon_set ES
join uta_20160908.transcript T on ES.tx_ac=T.ac
where alt_aln_method ~ '^transcript/'
group by 1
order by 1;
┌──────────────┬───────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│     hgnc     │ n_transcripts │                                                                           string_agg                                                                           │
├──────────────┼───────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ AKR1C8P      │             1 │ NR_027916.2                                                                                                                                                    │
│ BAHCC1       │             1 │ NM_001291324.1                                                                                                                                                 │
│ BAI1         │             1 │ NM_001702.2                                                                                                                                                    │
│ BEAN1        │             2 │ NM_001197224.2, NM_001197225.2                                                                                                                                 │
│ C1R          │             1 │ NM_001733.4                                                                                                                                                    │
│ CAPN8        │             1 │ NM_001143962.1                                                                                                                                                 │
│ CASP8AP2     │             3 │ NM_001137667.1, NM_001137668.1, NM_012115.3                                                                                                                    │
│ CCDC120      │             6 │ NM_001163321.2, NM_001163322.2, NM_001163323.2, NM_001271835.1, NM_001271836.1, NM_033626.3                                                                    │
│ CCDC151      │             1 │ NM_145045.4                                                                                                                                                    │
│ CD207        │             1 │ NM_015717.4                                                                                                                                                    │
│ CDRT4        │             1 │ NM_001204477.1                                                                                                                                                 │
│ CELF2        │             4 │ NM_001025076.2, NM_001025077.2, NM_001083591.1, NM_006561.3                                                                                                    │
│ CHODL        │             4 │ NM_001204175.1, NM_001204176.1, NM_001204177.1, NM_001204178.1                                                                                                 │
│ CYFIP2       │             3 │ NM_001037333.2, NM_001291722.1, NM_014376.3                                                                                                                    │
│ DKFZP434A062 │             1 │ NR_026964.3                                                                                                                                                    │
│ DNAH11       │             1 │ NM_001277115.1                                                                                                                                                 │
│ DOCK1        │             2 │ NM_001290223.1, NM_001380.4                                                                                                                                    │
│ DRD5P2       │             1 │ NR_111001.1                                                                                                                                                    │
│ EIF4G3       │             1 │ NM_003760.4                                                                                                                                                    │
│ EPHB6        │             4 │ NM_001280794.2, NM_001280795.2, NM_004445.5, NR_104001.2                                                                                                       │
│ EYS          │             1 │ NM_198283.1                                                                                                                                                    │
│ FADS6        │             1 │ NM_178128.5                                                                                                                                                    │
│ FER1L4       │             1 │ NR_119376.1                                                                                                                                                    │
│ GATSL2       │             1 │ NM_001145064.2                                                                                                                                                 │
│ GOLGA6L6     │             1 │ NM_001145004.2                                                                                                                                                 │
│ GRHL3        │             1 │ NM_198174.2                                                                                                                                                    │
│ GRK1         │             1 │ NM_002929.2                                                                                                                                                    │
│ GTF2IRD1     │             3 │ NM_001199207.1, NM_005685.3, NM_016328.2                                                                                                                       │
│ HIPK2        │             2 │ NM_001113239.2, NM_022740.4                                                                                                                                    │
│ HP09025      │             1 │ NR_109783.1                                                                                                                                                    │
│ HRH1         │             1 │ NM_001098212.1                                                                                                                                                 │
│ IKZF1        │            10 │ NM_001220767.2, NM_001220768.2, NM_001220770.2, NM_001220771.2, NM_001291839.1, NM_001291840.1, NM_001291841.1, NM_001291842.1, NM_001291843.1, NM_001291844.1 │
│ IL7R         │             1 │ NM_002185.3                                                                                                                                                    │
│ KIAA1324L    │             1 │ NM_001291991.1                                                                                                                                                 │
│ KIR2DL4      │             1 │ NM_002255.5                                                                                                                                                    │
│ KIR3DX1      │             5 │ NR_026716.2, NR_104095.1, NR_104096.1, NR_104097.1, NR_136268.1                                                                                                │
│ KRTAP5-4     │             1 │ NM_001012709.1                                                                                                                                                 │
│ KSR2         │             1 │ NM_173598.4                                                                                                                                                    │
│ LAIR1        │             8 │ NM_001289023.2, NM_001289025.2, NM_001289026.2, NM_001289027.2, NM_002287.5, NM_021706.4, NR_110279.2, NR_110280.2                                             │
│ LILRP2       │             1 │ NR_003061.2                                                                                                                                                    │
│ LIN37        │             1 │ NM_019104.2                                                                                                                                                    │
│ LINC01389    │             1 │ NR_126355.1                                                                                                                                                    │
│ LOC100287072 │             1 │ NR_073509.1                                                                                                                                                    │
│ LOC101928622 │             1 │ NR_125902.1                                                                                                                                                    │
│ LOC105616981 │             1 │ NR_131227.1                                                                                                                                                    │
│ MRGPRF       │             1 │ NM_001098515.1                                                                                                                                                 │
│ MROH8        │             1 │ NM_152503.5                                                                                                                                                    │
│ NBPF20       │             1 │ NM_001278267.1                                                                                                                                                 │
│ NEK3         │             2 │ NM_002498.2, NM_152720.2                                                                                                                                       │
│ OR8G1        │             1 │ NM_001002905.1                                                                                                                                                 │
│ OVCH2        │             1 │ NM_198185.4                                                                                                                                                    │
│ PARG         │             2 │ NM_001303489.1, NR_130169.1                                                                                                                                    │
│ PCDH15       │             4 │ NM_001142769.1, NM_001142770.1, NM_001142771.1, NM_001142772.1                                                                                                 │
│ PCDHA3       │             1 │ NM_018906.2                                                                                                                                                    │
│ PCDHA5       │             1 │ NM_018908.2                                                                                                                                                    │
│ PCDHGB2      │             1 │ NM_018923.2                                                                                                                                                    │
│ PCDHGB4      │             1 │ NM_003736.2                                                                                                                                                    │
│ PCDHGB6      │             1 │ NM_018926.2                                                                                                                                                    │
│ PIGB         │             1 │ NM_004855.4                                                                                                                                                    │
│ PKD1P6       │             1 │ NR_123721.1                                                                                                                                                    │
│ POLR3C       │             2 │ NM_001303456.1, NM_006468.7                                                                                                                                    │
│ PRKAR1A      │             1 │ NM_001278433.1                                                                                                                                                 │
│ PSRC1        │             1 │ NM_001032291.2                                                                                                                                                 │
│ PTPRQ        │             1 │ NM_001145026.1                                                                                                                                                 │
│ RDX          │             4 │ NM_001260492.1, NM_001260493.1, NM_001260495.1, NM_001260496.1                                                                                                 │
│ RSPO1        │             4 │ NM_001038633.3, NM_001242908.1, NM_001242909.1, NM_001242910.1                                                                                                 │
│ S100A1       │             1 │ NM_006271.1                                                                                                                                                    │
│ SHANK2       │             1 │ NM_012309.4                                                                                                                                                    │
│ SHANK3       │             1 │ NM_033517.1                                                                                                                                                    │
│ SLC22A5      │             1 │ NM_003060.3                                                                                                                                                    │
│ TMEM185A     │             4 │ NM_001174092.2, NM_001282302.1, NM_032508.3, NR_104121.1                                                                                                       │
│ TMEM216      │             2 │ NM_001173991.2, NM_016499.5                                                                                                                                    │
│ TMEM56       │             1 │ NM_001199679.1                                                                                                                                                 │
│ TVP23C-CDRT4 │             2 │ NM_001204478.1, NR_037924.1                                                                                                                                    │
│ UGT2B10      │             3 │ NM_001075.5, NM_001144767.2, NM_001290091.1                                                                                                                    │
│ USO1         │             2 │ NM_001290049.1, NM_003715.3                                                                                                                                    │
│ VPRBP        │             2 │ NM_001171904.1, NM_014703.2                                                                                                                                    │
│ VSX1         │             2 │ NM_001256271.1, NM_001256272.1                                                                                                                                 │
│ WDR65        │             1 │ NM_001195831.2                                                                                                                                                 │
│ WDR72        │             1 │ NM_182758.3                                                                                                                                                    │
│ ZNF718       │             1 │ NM_001289930.1                                                                                                                                                 │
└──────────────┴───────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(81 rows)

@reece reece added major enhancement New feature or request labels Sep 9, 2016
@reece
Copy link
Member Author

reece commented Jan 17, 2017

Another example: RSPO1.

From NCBI's alignment files:

snafu$ zgrep NM_001038633.3  src/refseq/H_sapiens/alignments/GCF_000001405.25_knownrefseq_alignments.gff3 | expand -t8 | cut -c1-80,137- | cut -d';' -f1
NC_000001.10    RefSeq  cDNA_match      38100238        38100595        358     Target=NM_001038633.3 1 358 +
NC_000001.10    RefSeq  cDNA_match      38097959        38098025        67      Target=NM_001038633.3 359 425 +
NC_000001.10    RefSeq  cDNA_match      38097516        38097694        179     Target=NM_001038633.3 426 604 +
NC_000001.10    RefSeq  cDNA_match      38095240        38095621        382     Target=NM_001038633.3 605 986 +
NC_000001.10    RefSeq  cDNA_match      38082156        38082347        192     Target=NM_001038633.3 987 1178 +
NC_000001.10    RefSeq  cDNA_match      38079856        38080005        150     Target=NM_001038633.3 1179 1328 +
NC_000001.10    RefSeq  cDNA_match      38079376        38079564        189     Target=NM_001038633.3 1329 1517 +
NC_000001.10    RefSeq  cDNA_match      38077421        38078593        1173    Target=NM_001038633.3 1518 2690 +
NC_000001.10    RefSeq  cDNA_match      38076951        38077349        399     Target=NM_001038633.3 2691 3089 +
snafu$ zgrep NM_001038633.3  src/refseq/H_sapiens/alignments/GCF_000001405.28_knownrefseq_alignments.gff3 | expand -t8 | cut -c1-80,137- | cut -d';' -f1
NC_000001.11    RefSeq  cDNA_match      37634566        37634923        358     Target=NM_001038633.3 1 358 +
NC_000001.11    RefSeq  cDNA_match      37632287        37632353        67      Target=NM_001038633.3 359 425 +
NC_000001.11    RefSeq  cDNA_match      37631844        37632022        179     Target=NM_001038633.3 426 604 +
NC_000001.11    RefSeq  cDNA_match      37629568        37629949        382     Target=NM_001038633.3 605 986 +
NC_000001.11    RefSeq  cDNA_match      37616484        37616675        192     Target=NM_001038633.3 987 1178 +
NC_000001.11    RefSeq  cDNA_match      37614184        37614333        150     Target=NM_001038633.3 1179 1328 +
NC_000001.11    RefSeq  cDNA_match      37613704        37613892        189     Target=NM_001038633.3 1329 1517 +
NC_000001.11    RefSeq  cDNA_match      37611350        37612921        1572    Target=NM_001038633.3 1518 3089 +

There are 9 aligned regions in GRCh37 and 8 in GRCh38.
The NCBI RefSeq transcript is defined with 8 exons.
Note that exons 1-7 the same length in both. The last alignment, which corresponds to a signle exon in GRCh38, is essentially the concatenation of the aligned regions 8 and 9 from GRCh37.

@reece
Copy link
Member Author

reece commented May 9, 2017

https://www.ncbi.nlm.nih.gov/nuccore/380748962

     exon            1031..1304
                     /gene="SHANK3"
                     /gene_synonym="DEL22q13.3; PROSAP2; PSAP2; SCZD15;
                     SPANK-2"
                     /inference="alignment:Splign:1.39.8"
     exon            1305..1498
                     /gene="SHANK3"
                     /gene_synonym="DEL22q13.3; PROSAP2; PSAP2; SCZD15;
                     SPANK-2"
                     /inference="alignment:Splign:1.39.8"
     exon            1499..1612
                     /gene="SHANK3"
                     /gene_synonym="DEL22q13.3; PROSAP2; PSAP2; SCZD15;
                     SPANK-2"
                     /inference="alignment:Splign:1.39.8"

The relevant regions of GCF_000001405.25_knownrefseq_alignments.gff3 (GRCh37):

snafu$ grep NM_033517 GCF_000001405.25_knownrefseq_alignments.gff3 | expand -t8 | cut -c1-13,40-48,55-65,144- | cut -d';' -f1
...
NC_000022.10  51133203  51133476 NM_033517.1 1031 1304 +
NC_000022.10  51135985  51136143 NM_033517.1 1342 1498 +

and GCF_000001405.28_knownrefseq_alignments.gff3 (GRCh38):

snafu$ grep NM_033517 GCF_000001405.28_knownrefseq_alignments.gff3 | expand -t8 | cut -c1-13,40-48,55-65,144- | cut -d';' -f1
...
NC_000022.11  50694775  50695048 NM_033517.1 1031 1304 +
NC_000022.11  50697557  50697715 NM_033517.1 1342 1498 +

So, the gff files for 37 and 38 are missing an exon corresponding to c.1305_1341.

It should be possible to write this as a 37nt deletion (whole exon).

TODO:

  • simulate change in UTA (dev) and test hgvs against it.
  • modify ncbi parser to reject discontiguous transcripts from gff files (or perhaps synthesize an exon if this is common).

@reece reece changed the title Need mechanism to handle cases where 37 and 38 have differing transcript definitions Handle transcripts with unaligned regions within exons Jul 13, 2017
@jfreidin
Copy link

jfreidin commented Dec 10, 2018

Hi @reece,
Is the following behavior an instance of this or another issue?
NM_006060 is a transcript of IKZF1, although not listed in your table above.
Also, On May 10, 2018 this sequence version replaced NM_006060.5.

>>> am = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name="GRCh37")
>>> am.c_to_g(hp('NM_006060.5:c.1A>T'))
hgvs.exceptions.HGVSDataNotAvailableError: No alignments for NM_006060.5 in GRCh37 using splign
>>> am.c_to_g(hp('NM_006060.6:c.1A>T'))
hgvs.exceptions.HGVSDataNotAvailableError: No alignments for NM_006060.6 in GRCh37 using splign
>>> am = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name="GRCh38")
>>> am.c_to_g(hp('NM_006060.5:c.1A>T'))
SequenceVariant(ac=NC_000007.14, type=g, posedit=50319062A>T)
>>> am.c_to_g(hp('NM_006060.6:c.1A>T'))
hgvs.exceptions.HGVSDataNotAvailableError: No alignments for NM_006060.6 in GRCh38 using splign

@reece
Copy link
Member Author

reece commented Sep 10, 2019

@jfreidin : Apologies, I missed this question way back then.
There are two things going on here.

  1. Yes, NM_006060.6 suffers from this issue. Specifically, the transcript alignment doesn't start at 1).
  2. hgvs only knows about what's in UTA. And UTA is based on snapshots from NCBI. For some reason, .5 is in GRCh37.p13 and GRCh38.p12, but not in GRCh38.p2. So, the absence is just because it was missing from the 38-based alignments that I loaded.

@reece reece self-assigned this Sep 10, 2019
@reece
Copy link
Member Author

reece commented Sep 10, 2019

Description of problem

UTA assumes that transcripts align to genomic references exon-wise and in toto. Although this assumption is good for the vast majority of transcripts (works on >99.5% of transcripts aligned to GRCh37.p13 based on data below, and better for GRCh38). Non-alignment may occur anywhere in the alignment and a transcript may have multiple such regions. Note that an unaligned region is different from an alignment gap: an alignment gap is a region of known difference whereas an unaligned region represents an unknown difference with reference.

Prevalence

To detect this issue, we look for the following features in NCBI gff3 alignment files:

  • If the first alignment segment of a transcript does not start at 1, the 5' end is unaligned.
  • If an alignment segment start is not consecutively numbered with the end of the previous alignment segment of the same transcript, there is an internal unaligned region.
  • If the last alignment segment does not include the full length of the transcript, the 3' end is unaligned.

Testing for the first two cases is easy from the alignment files. Results for distinct NM and NR transcript accessions are below:

25 (GRCh37.p13): 108645 tx seen; 108315 clean; 202 5' unaligned; 142 internal unaligned
28 (GRCh38.p2): 58188 tx seen; 58122 clean; 46 5' unaligned; 23 internal unaligned
33 (GRCh38.p7): 68119 tx seen; 68063 clean; 31 5' unaligned; 27 internal unaligned
38 (GRCh38.p12): 68408 tx seen; 68380 clean; 9 5' unaligned; 19 internal unaligned

Note: From GCF_000001405.??_knownrefseq_alignments.gff3, NC references only. {clean} := {seen} - {5' unaligned} - {internal unaligned} (i.e., not accounting for 3' unaligned)

(It's great to see that the fraction of cleanly aligned transcripts has been steadily rising .)

Testing for the 3rd case requires having the transcript lengths around. Although I have a (very large) subset of these lengths, I don't have all. And it doesn't matter: the 3' story is almost certainly similar to that of the 5' story, so let's go with that approximation. That means that GRCh37 users should expect that apx max 202 + 142 + 202 = 546 transcript accessions have unaligned regions.

Discussion

  1. Is this problem worth solving (in UTA or otherwise)? This will depend on client use. However, some clinically relevant genes are involved, so the answer really depends on whether those genes/conditions are relevant to a particular use.

  2. Is this really a UTA problem? It's very likely that these alignments are discontinuous because the underlying assembly is problematic. If that's the case, the root cause is the choice of assembly, not the alignments. That is, if this problem affects you, the much better course is to adopt GRCh38.

  3. Could we support partial alignments? Yes, with significant effort and new schema complexity, we could support discontinuous alignments. This new complexity would have ripple effect to data loading, hgvs, and all clients.

My strong inclination now is that we should not support partial transcript alignments† because it adds significant complexity in order to support transcripts that are aligned to a dead/dying assembly, and it likely won't really addressing the underlying biological complexity anyway.

Comments appreciated.

† Edited: Added "alignments". See @leicray's comment below.

@leicray
Copy link

leicray commented Sep 11, 2019

I was following the discussion (and agreeing entirely with your thoughts on the matter) until I got to the phrase "...should not support partial transcripts...". Should this say "...partial alignments..."? Partial transcripts are an entirely separate issue and have been discussed elsewhere in other contexts.

@reece
Copy link
Member Author

reece commented Sep 11, 2019

Yes, I mean "partial alignments". That is, the partial property is of the alignment, not of the transcript itself.

By "partial transcripts", I think you're referring to accessioned transcripts that we know do not contain the full length message. If so, that's definitely not what I'm talking about!

@jfreidin
Copy link

I don't have any problem with the conclusion not to support partial alignments.
But I'm left wondering if any changes can be made in hgvs or uta to enable mapping GRCh37 NM_006060/IKZF1?

@leicray
Copy link

leicray commented Sep 11, 2019

I cannot see any way forward here because NCBI does not provide alignment information for mapping NM_006060.x onto GRCh37.

@reece
Copy link
Member Author

reece commented Sep 11, 2019

This is a really screwy case that illustrates why we could make UTA precisely wrong by using 37.

First, here's what's available from NCBI:

NM_006060.5 NM_006060.6
released 2018-04-05 2019-08-13
length 6317 6255
GRCh37.p13 (1,6302) (1,6255)
GRCh38.p2 (1,6302) n/a
GRCh38.p7 (1,6302) n/a
GRCh38.p12 n/a (1,6255)

The parens indicate the region of coverage. None of the .5 alignments are full-length (stop 17 bases short), regardless of assembly. Conversely, all of the .6 alignments are full length.

So, the reason that the .5 doesn't appear in UTA is because there's a check to ensure that the transcript exon structure in the gbff (essentially the exon features that you see here) match the exon structure from the alignment. In this case, it doesn't match and is discarded.

Arguably, we could be less strict in this case.

@Peter-J-Freeman
Copy link

I would also agree that it is not necessarily a good idea to support partial alignments. I think the mapping from g. to c. in these instances are better derived from scaffolds and contigs to which the transcript has been accurately aligned.

Have you looked at how many of these genes do not align fully to at least one scaffold or contig?

@jfreidin
Copy link

To @reece's point "we could be less strict in this case", I'm all for that.
I need to map from c. to g. and have no control over whether or when we might switch from 37 to 38.
Ergo, any solution that allows mapping of these CDS changes would benefit me.

Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Feb 26, 2024
Copy link

github-actions bot commented Mar 4, 2024

This issue was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-by-stale enhancement New feature or request stale Issue is stale and subject to automatic closing
Projects
None yet
Development

No branches or pull requests

4 participants