Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HGVS / genome coordinate conversion does not account for cDNA alignment gaps #60

Open
davmlaw opened this issue Sep 22, 2021 · 1 comment

Comments

@davmlaw
Copy link

davmlaw commented Sep 22, 2021

RefSeq transcript sequences can be different from the reference sequence (even if they agree with 1 build they can be different across builds). These sequences are aligned against the genome to produce exon coordinates in GFF releases.

This alignment can sometimes produce insertions / deletions (5-10% of transcripts), eg in the GFF file there is a “cDNA match” string that records the alignment, and has a “Gap” entry:

NC_000002.12    RefSeq  cDNA_match      73385758        73386192        431.411 +       .       ID=daa36283c6058f57b6347eb074291b21;Target=NM_015120.4 1 438 +;assembly_bases_aln=5003;assembly_bases_seq=5003;consensus_splices=44;exon_identity=0.999768;for_remapping=2;gap_count=1;identity=0.999768;idty=0.993151;matches=12925;num_ident=12925;num_mismatch=0;pct_coverage=99.9768;pct_coverage_hiqual=99.9768;pct_identity_gap=99.9768;pct_identity_ungap=100;product_coverage=1;rank=1;splices=44;weighted_identity=0.999771;Gap=M185 I3 M250

NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there was 185 bases matched, 3 bases inserted then back to matching. You can see how this affects PyHGVS conversion downstream from the gaps:

2:73385942 A>T: NM_015120.4(ALMS1):c.74A>T (correct)
2:73385943 A>T: NM_015120.4(ALMS1):c.75A>T (off by 3, VEP gives NM_015120.4:c.78A>T)
2:73385944 G>C: NM_015120.4(ALMS1):c.76G>C (off by 3, VEP gives NM_015120.4:c.79G>C)

@davmlaw
Copy link
Author

davmlaw commented Nov 5, 2021

Hi, I have implemented this myself in a fork at https://github.com/SACGF/hgvs

To resolve the gaps, pyHGVS transcripts now need to contain the cDNA_match information from the GFF files. I've made this data available for download (or you can produce your own) see #26 (comment)

Instead of adding a separate code path to handle alignment gaps, I treat exons without gaps as cDNA matches with 100% alignment. This keeps all code running through the same path so should minimise future work.

The work was done on top of a lot of my existing changes (fixing other bugs etc) but if the project starts accepting pull requests again, please let me know and I can make a patch against master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant