New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle transcripts with unaligned regions within exons #198
Comments
Original comment on Bitbucket by Reece Hart (Bitbucket: reece, GitHub: reece): This issue affects 146 distinct transcripts in 81 genes:
|
Another example: RSPO1. From NCBI's alignment files:
There are 9 aligned regions in GRCh37 and 8 in GRCh38. |
https://www.ncbi.nlm.nih.gov/nuccore/380748962
The relevant regions of GCF_000001405.25_knownrefseq_alignments.gff3 (GRCh37):
and GCF_000001405.28_knownrefseq_alignments.gff3 (GRCh38):
So, the gff files for 37 and 38 are missing an exon corresponding to c.1305_1341. It should be possible to write this as a 37nt deletion (whole exon). TODO:
|
Hi @reece,
|
@jfreidin : Apologies, I missed this question way back then.
|
Description of problemUTA assumes that transcripts align to genomic references exon-wise and in toto. Although this assumption is good for the vast majority of transcripts (works on >99.5% of transcripts aligned to GRCh37.p13 based on data below, and better for GRCh38). Non-alignment may occur anywhere in the alignment and a transcript may have multiple such regions. Note that an unaligned region is different from an alignment gap: an alignment gap is a region of known difference whereas an unaligned region represents an unknown difference with reference. PrevalenceTo detect this issue, we look for the following features in NCBI gff3 alignment files:
Testing for the first two cases is easy from the alignment files. Results for distinct NM and NR transcript accessions are below:
Note: From (It's great to see that the fraction of cleanly aligned transcripts has been steadily rising .) Testing for the 3rd case requires having the transcript lengths around. Although I have a (very large) subset of these lengths, I don't have all. And it doesn't matter: the 3' story is almost certainly similar to that of the 5' story, so let's go with that approximation. That means that GRCh37 users should expect that apx max 202 + 142 + 202 = 546 transcript accessions have unaligned regions. Discussion
My strong inclination now is that we should not support partial transcript alignments† because it adds significant complexity in order to support transcripts that are aligned to a dead/dying assembly, and it likely won't really addressing the underlying biological complexity anyway. Comments appreciated. † Edited: Added "alignments". See @leicray's comment below. |
I was following the discussion (and agreeing entirely with your thoughts on the matter) until I got to the phrase "...should not support partial transcripts...". Should this say "...partial alignments..."? Partial transcripts are an entirely separate issue and have been discussed elsewhere in other contexts. |
Yes, I mean "partial alignments". That is, the partial property is of the alignment, not of the transcript itself. By "partial transcripts", I think you're referring to accessioned transcripts that we know do not contain the full length message. If so, that's definitely not what I'm talking about! |
I don't have any problem with the conclusion not to support partial alignments. |
I cannot see any way forward here because NCBI does not provide alignment information for mapping NM_006060.x onto GRCh37. |
This is a really screwy case that illustrates why we could make UTA precisely wrong by using 37. First, here's what's available from NCBI:
The parens indicate the region of coverage. None of the .5 alignments are full-length (stop 17 bases short), regardless of assembly. Conversely, all of the .6 alignments are full length. So, the reason that the .5 doesn't appear in UTA is because there's a check to ensure that the transcript exon structure in the gbff (essentially the exon features that you see here) match the exon structure from the alignment. In this case, it doesn't match and is discarded. Arguably, we could be less strict in this case. |
I would also agree that it is not necessarily a good idea to support partial alignments. I think the mapping from g. to c. in these instances are better derived from scaffolds and contigs to which the transcript has been accurately aligned. Have you looked at how many of these genes do not align fully to at least one scaffold or contig? |
To @reece's point "we could be less strict in this case", I'm all for that. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been stalled for 7 days with no activity. |
Originally reported by Reece Hart (Bitbucket: reece, GitHub: reece) in biocommons/uta #198
Migrated by bitbucket-issue-migration on 2016-09-09 15:15:07
In UTA's data model, an
exon_set
represents a set of exons for a transcript on a single sequence. A single transcript accession will typically have an exon set defined on the transcript sequence itself and additional exon sets defined on various genomic sequences. Atranscript
is associated with multiple exon sets, each of which is an instantiation of that transcript on a distinct sequence.UTA stores
exon
-level alignments between the pairs of exon sets; typically, these are alignments between a transcript exon set and a genomic exon sets. A core assumption in UTA is that a alignments are made betweenexon
s.In the past, I've referred to this as transcripts having different exon structures on GRCh37 and 38. While this is true in terms of UTA data model definitions, a more precise colloquial explanation is that UTA cannot handle cases where a single exon is split into two aligned regions.
Example:
For example, HRH1 has different transcript definitions on 37 and 38. See below. This means that there's no consistent "transcript" definition, which in turn means that there's no way to have a single current definition, and therefore that it's unclear what we're aligning to.
This presumably happens because the annotation abuts an assembly gap, which means that the 37 alignment is truncated.
Note that on 37, this gene has one exon from transcript interval [27,4278], whereas on 38 it also has the [0,27] interval.
NOTE: A simple way to find these in UTA is to look for a '/' in the alt_aln_method field. When a conflict occurs, the loader deprecates one of them by appending a hash of the exon coordinates (which will be unique). This was implemented to handles cases where exon coordinates changed between GRCh37 patch level releases (which was bad enough, but at least the notion of a single transcript made sense). Now, with two assemblies, there are more of these conflicts and we need a better way to resurrect the notion of a single transcript.
The text was updated successfully, but these errors were encountered: