Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for mito m. in translate_from #362

Merged
merged 13 commits into from
Mar 20, 2024
Merged

Add support for mito m. in translate_from #362

merged 13 commits into from
Mar 20, 2024

Conversation

larrybabb
Copy link
Contributor

Added missing support for m. mitochondrial translation from hgvs, beacon, gnomad and spdi formats. Also added mapping of NR_ refseq accessions to n. nomenclature.

@larrybabb larrybabb requested review from a team as code owners March 12, 2024 12:15
@larrybabb larrybabb closed this Mar 12, 2024
@larrybabb
Copy link
Contributor Author

meant to make this a draft PR. I'll try again.

@korikuzma korikuzma reopened this Mar 12, 2024
@korikuzma korikuzma marked this pull request as draft March 12, 2024 12:18
@korikuzma
Copy link
Contributor

@larrybabb I got you

@larrybabb larrybabb marked this pull request as ready for review March 13, 2024 16:00
@korikuzma
Copy link
Contributor

@ahwagner IIRC you had said you had comments that were not submitted. Did you want to review?

Copy link
Member

@ahwagner ahwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments included, but my greatest concern is how circular sequences are handled; SequenceLocation requires start <= end, but this is based on sequence linearity assumptions. Should we update the VRS model to allow end < start for circular sequences? If not, how do we plan to handle variants that span the 0 coordinate?

src/ga4gh/vrs/extras/translator.py Outdated Show resolved Hide resolved
@@ -572,7 +578,7 @@ def _to_hgvs(self, vo, namespace="refseq"):
if ns.startswith("GRC") and namespace is None:
continue

if not (any(a.startswith(pfx) for pfx in ("NM", "NP", "NC", "NG"))):
if not (any(a.startswith(pfx) for pfx in ("NM", "NP", "NC", "NG", "NR", "NW"))):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want to support X{MRP}_ accession prefixes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should. I just hit these use cases in clinvar. there's not many. and the vrs-python seqrepo does have the XM_... accessions available.

@@ -580,7 +586,7 @@ def _to_hgvs(self, vo, namespace="refseq"):
if ns.startswith("GRC") and namespace is None:
continue

if not (any(a.startswith(pfx) for pfx in ("NM", "NP", "NC", "NG"))):
if not (any(a.startswith(pfx) for pfx in ("NM", "NP", "NC", "NG", "NR", "NW"))):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are NW accessions? I found one associated with a linear Zebrafish reference sequence, but have a hard time finding documentation on what this prefix means and how it should be interpreted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 182 hgvs expressions in clinvar that have NW_ associated expressions. Here are a couple if it helps to discern what they are...
https://www.ncbi.nlm.nih.gov/clinvar/variation/146167
(patch scaffolding?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larrybabb should we also add "NT"?

@ehclark
Copy link
Contributor

ehclark commented Mar 15, 2024

My own $0.02 on the circular sequence question. And I'm really just considering human mitochondrial variants here.

Supporting circular reference sequences and allowing locations that span the zero point seems like a solution in search of a problem. My reasons for thinking this:

  • ClinVar does not list any variants close to the zero point
  • The gnomAD 3.1 variants close to the zero point are all substitutions or not problematic (meaning normalization does not expand them over the zero point)
  • The other formats and tools that will be upstream of VRS computation (HGVS, VCF) all treat the mitochondrial chromosome as a linear sequence or do not support variants that span the zero point

For example, I ran this test in Variant Validator:

Unable to validate the submitted variant NC_012920.1:m.16568_3del against the GRCh38 assembly. The following warnings were returned:

The variant positions are valid but we cannot normalize variants spanning the origin of circular reference sequences

And from a human biology standpoint I would think that in/dels that disrupt the zero point and replication would be highly unlikely to be compatible with life?

@larrybabb
Copy link
Contributor Author

@ehclark I agree completely with your points. I also think we should state this clearly as a policy decision and address it in the future if/when the need is presented by the community and the priority is high enough.

@ahwagner
Copy link
Member

Respectfully, I disagree about this not being a problem. Spanning the mitochondrial origin is something that resources like gnomAD do struggle with. For example, from the gnomAD mitchondrial origin calling documentation:

Circular genome: The mtDNA non-coding “control region” spans the artificial break in the circular genome (coordinates chrM:16024-16569 and chrM:1-576), which can make it challenging to call variants in this region. Therefore to call control region variants, we extracted all chrM reads and realigned them to a mitochondrial reference that was shifted by 8,000 bases, called variants on this shifted alignment, and then converted coordinates back to their original positions.

I also want to disentangle the concern I raised (how do we represent these variants) from the concerns raised by @ehclark (how do you normalize these variants). Whether or not we want to add support for normalization over the origin, the fact is that HGVS expressions (such as the submitted NC_012920.1:m.16568_3del in the above example) can represent a variant spanning the origin, and VRS cannot. So, to be clear, I think we minimally need to allow people to represent sequence regions that span the origin when we consider adding support for mitochondrial variants. @larrybabb and @ehclark, I just want to check; are you disagreeing that is needed and that we should move forward without supporting such representations?

@ehclark
Copy link
Contributor

ehclark commented Mar 18, 2024

@ahwagner The main point I am trying to make is that variants spanning the origin point don't actually seem to exist in the real world (at least based on current data) and therefore expending engineering effort and increasing complexity to support them is not worthwhile.

That said, I think your point about separating the various concerns here makes sense. To build on your points, it seems there are three basic elements:

  • The VRS model representation of circular sequence locations that span the origin point
  • The normalization of alleles across the origin point
  • The conversion between VRS and other representations (HGVS, gnomAD, etc) where the origin point is crossed

Most of the engineering complexity is wrapped up in the normalization element. I interpret your comment to mean you think that VRS model representation is a must-have because it is necessary to allow the model to accurately represent circular sequence locations. And it seems that supporting the representation would not be complicated.

I would also note that as of right now, vrs-python will happily create VRS objects/ids for mitochondrial variants converting from gnomAD format, which is what the VCF annotator uses, and assuming a linear sequence. So if the VRS model changes to support sequence locations with start > end, it would be a breaking change. So getting this resolved would be high on my priority list.

@larrybabb
Copy link
Contributor Author

@ahwagner would you mind reviewing this asap. I reverted the secondary concern I found regarding the lack of support of c. hgvs notation so that we could get this m. support PR completed.

@larrybabb larrybabb requested a review from ahwagner March 20, 2024 14:01
@larrybabb larrybabb merged commit dc115d1 into main Mar 20, 2024
8 checks passed
@larrybabb larrybabb deleted the issue-360 branch March 20, 2024 18:52
@larrybabb
Copy link
Contributor Author

fixes bug #360

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants