Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gene names in variants #257

Closed
reece opened this issue Jul 29, 2015 · 6 comments
Closed

Support gene names in variants #257

reece opened this issue Jul 29, 2015 · 6 comments
Assignees

Comments

@reece
Copy link
Member

reece commented Jul 29, 2015

Originally reported by: Reece Hart (Bitbucket: reece, GitHub: reece)


This is a proposal. Users should comment and/or vote.

Current state: The hgvs package does not currently support parsing or formatting variants with gene names (e.g., NM_004006.1(DMD):c.3G>T and GJB2:c.76A>C [source]). This was intentional because 1) gene names alone are insufficient and error-prone to describe variants, and 2) when with an accession, gene names create duplicative information and therefore opportunities to be wrong. However, some sources include gene names with variants and those variants will not parse.

Minimal proposal: Parse gene names in the context of a valid accession and discard it. This is to prevent syntax errors for valid HGVS variants.

Not under consideration: This package is unlikely to ever support gene names alone or chromosome names as accessions because these are fundamentally flawed as precise descriptions of sequence variants.

Questions:

  • Should model be updated to include gene name?
  • Should formatting (and other modules?) add the gene name if it's missing?
  • Should the validator confirm that gene name and accession are associated?
  • Formatting and validation support imply that the data provider interface also provides accession-gene associations.

@reece reece self-assigned this Mar 23, 2017
@aaronslaff
Copy link

I think that this would a pretty interesting functionality, especially for labs that prepare clinical reports. I guess the steps would look something like:

  1. Use gene name as accession ID in hgvs
    1.5) hgvs accepts gene name and replaces it with most recent/desired accession type
  2. Continue with analysis
  3. Print report with gene name and accession ID, gene summary, and observed/proposed variants
    I guess questions I have:
  • How do you relay the gene name to accession ID? Can seqrepo handle this as is?
  • Assuming NCBI needs to be used, why stop at the gene name? Why not include the Summary? The link brings you to a sample gene
  • Would it make sense to create a workflow/function that can generate a "report" per gene?
    Just spit balling. Interested to hear what you guys are thinking.

@Peter-J-Freeman
Copy link

Hi Aaron,

In my opinion, there are two questions to answer here. 1) Whether this can be done; and 2) whether it ought to be done?
In answer to question 1, yes absolutely it can be done, and I believe the hgvs (py) core development team are already developing the necessary functions. Indeed, VariantValidator (https://variantvalidator.org/ ) has offered this functionality for several years. However, VariantValidator is VERY strict in its implementation, which brings us to question 2.

Respectfully, your email provides a perfect example of why implementing this functionality is not necessarily a good idea. PLEASE PLEASE PLEASE take these comments constructively, but this is an issue that really bugs me, and I cannot remain quiet on the matter any longer within this forum.

Sequence variation MUST be reported in the context of a fixed and immutable reference sequence which is permanently identified by a fixed and immutable reference sequence identifier. A gene name is neither immutable nor permanently linked to a single reference sequence. Also, I have never seen a gene name used in the place of a sequence identifier, so I can only assume you mean the gene symbol? For example, collagen type V alpha 1 chain = gene name, COL5A1 = gene symbol https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:2209). Both of these data are not immutable, for example the P3H1 (prolyl 3-hydroxylase 1) gene up until a few years ago was called LEPRE1 (leucine proline-enriched proteoglycan (leprecan) 1). Therefore they should never be used in place of a valid reference sequence identifier.

Please note that transcript models for each gene can be very variable and not all transcript variants will necessarily utilise all the exons within a gene. COL5A1 is a perfect example. Consider the variant description COL5A1:c.5071A>T. This description would be valid in the context of two transcripts NM_000093.4:c.5071A>T and NM_001278074.1:c.5071A>T giving relatively similar protein level variant descriptions NP_000084.3:p.(Arg1691Ter) and NP_001265003.1:p.(Lys1691Ter), both of which would likely lead to mild Ehlers Danlos syndrome. However, these are entirely unique variants i.e. they are derived from different genomic variants NC_000009.12:g.134829979A>T and NC_000009.12:g.134830111A>T.

I cannot comment on how the hgvs (py) core development team will implement this functionality, but in response to your suggested workflow this is how we implement this functionality at VariantValidator.

  1. Use gene name as accession ID in hgvs 1.5) hgvs accepts gene name and replaces it with most recent/desired accession type

VariantValidator generates descriptions in the context of all available transcripts if a gene SYMBOL is provided in place of a reference sequence identifier. It also provides a stern warning that this is not acceptable practice. The USER must then select the desired transcript level description and submit it to validation. We take the view that it is the responsibility of the user to provide the reference sequence to which the variation is being described, therefore we do not discriminate between variants that would be valid and those that would not. Reece, please may urge that the hgvs (py) core development team are equally militant on this matter.

  1. Continue with analysis

As stated above, the USER must provide a suitable description for validation

  1. Print report with gene name and accession ID, gene summary, and observed/proposed variants
    The gene accession ID for COL5A1 is HGNC:2209. I assume this is not what you meant.

I assume you meant the transcript reference sequence identifier for the gene? TP53 has 8 coding transcripts represented by 12 unique reference sequence records which encode 12 proteins. The reason there are 12 records is that although the reference sequences for several of the transcripts are identical, translation is initiated at internal ribosome entry sites, hence 8 mRNAs and 12 proteins. May I ask which transcript reference sequence you believe would be “the gene accession”?

q. “Assuming NCBI needs to be used, why stop at the gene name? Why not include the Summary? The link brings you to a sample gene”

I’m not sure whether this would fall under the remit if the hgvs (py) project, but we could potentially implement something like this in VariantValidator web services. If you are interested please get in touch through our contact-us form.

My greatest concern is that you believe that this functionality would help labs that prepare clinical reports. If labs are using gene symbols in place of valid reference sequences to report variation, then there is a serious problem. The identification of pathogenic variants is significantly hampered by noise in both case and control databases. I realise that the majority of this noise come from the messy practical and analytical workflows involved in genome sequencing, but there is also a significant problem with the submission of inaccurately reported data. My colleague, prof. Raymond Dalgleish, spends hours every few weeks CORRECTING variant descriptions associated with Osteogenesis Imperfecta and Ehlers Danlos syndrome which have been submitted to journals and databases by clinical and non-clinical research laboratories alike. There is now an effort to raise the standards of reporting in journals, and I will make sure I contact the working group involved and raise this issue.

Again, please do not take this as anything more than constructive criticism. If you want to discuss this further in an open forum (which may be a very useful exercise since I have only provided one side of the discussion) please do so on https://groups.google.com/forum/#!forum/hgvs-discuss. I’m happy for you to copy my comments across should you wish to do so.

@davmlaw
Copy link
Contributor

davmlaw commented Dec 9, 2018

I think follow the robustness principle of "Be conservative in what you send, be liberal in what you accept"

Accessions are better for computers no question but humans like NM_004006.1(DMD):c.3G>T over NM_004006.1:c.3G>T because their brains are better at remember gene symbols than accession IDs.

So, using the robustness principle and thinking of gene symbols as a human memory hint, my answers would be:

Should model be updated to include gene name?

Yes

Should formatting (and other modules?) add the gene name if it's missing?

This should be a flag, off by default.

If you load a variant with the gene symbol, turn the flag on, so that the gene symbol remains as it is manipulated/printed etc.

Should the validator confirm that gene name and accession are associated?

Yes - throw an exception by default.

There are a lot of corner cases here, though! Such as do you accept HGNC aliases or gene renamings? (eg someone uses the latest name for a gene, even though at the accession version it was called something else?)

Some people may want to discard the gene symbol if the variant works but the match is wrong, you could add a remove_gene_symbol method so that client can write code like:

my_hgvs = 'NM_004006.1(THIS_GENE_SYMBOL_IS_NOT_DMD):c.3G>T'
try:
    var_c1 = hgvsparser.parse_hgvs_variant(my_hgvs)
except HGVSAccessionGeneSymbolMismatch as e:
    my_stripped_hgvs = hgvsparser.remove_gene_symbol(my_hgvs)
    var_c1 = hgvsparser.parse_hgvs_variant(my_stripped_hgvs)

Or add a parameter to the parse_hgvs_variant that does this internally

If you stripped the gene name, then set a flag to display the gene name, you could then use hgvs to fix any dodgy variant/gene names you have.

@Peter-J-Freeman
Copy link

Peter-J-Freeman commented Dec 14, 2018

Hi Dave.

Nice suggestions. I think this is a valid compromise and is sufficiently HGVS compliant and I'd certainly have no objections in principle. I like this idea because the transcript ID is explicitly stated so there is no doubt as to the identity of the reference sequence.

I do have a couple of concerns about implementation though.

I'm not sure whether it is necessarily a good idea to check whether the Gene Symbol and the transcript ID match. The Transcript ID MUST be stated and MUST be considered as the only viable sequence identifier. The gene symbol is a nice human readable touch, but is is icing on the cake. I would suggest that trying and trying to include validation that the symbol matches the reference sequence ID can muddy the water and cause problems.

Consider this example.
NM_022356.3(LEPRE1):c.2055+18G>A. This would be a perfectly valid description because LEPRE1 is a valid gene symbol. However, LEPRE1 has been replaced with P3H1. When I first started working with hgvs (py) UTA contained the symbol LEPRE1 for over a year after the gene record had been updated to P3H1. Consequently, unless UTA were made aware of all previous symbols for a gene, folks might end up getting parse errors relating to an "invalid" gene symbol when they were in fact entering a valid gene symbol. Similarly, people with legacy data may keep getting errors when entering a valid gene symbol that has been superseded.

Because gene symbols are not immutable, they are not sufficiently robust for asserting a HGVSAccessionGeneSymbolMismatch. Also, playing Devil's advocate, why would asserting an incorrect Gene symbol matter since HGVS only requires the use of a valid sequence ID and does not require a gene symbol to be provided. I actually think that your idea of providing a flag to provide the Gene Symbol could be a work around, i.e. provide the correct gene symbol and UTA can certainly provide the necessary data. However, I can still for-see issues where users may have generated data ~3 years ago where the output would have been NM_022356.3(LEPRE1):c.2055+18G>A, whereas today the output would be NM_022356.3(P3H1):c.2055+18G>A. So again I'd be inclined to say that this is not necessarily a sufficiently robust idea either, but I admit, less worrying than the HGVSAccessionGeneSymbolMismatch error :)

I agree with "Be conservative in what you send, be liberal in what you accept", but to prevent unnecessary issues and confusion, I suggest that when parsing the variant description, the (LEPRE1) or (DMD) as in the previous example just be stripped out and ignored.

This method is implemented in VariantValidator.
NM_022356.3(LEPRE1):c.2055+18G>A
https://variantvalidator.org/variantvalidation/?variant=NM_022356.3%28LEPRE1%29%3Ac.2055%2B18G%3EA&primary_assembly=GRCh38&alignment=splign

@reece
Copy link
Member Author

reece commented Dec 15, 2018

Everything's possible with enough hacks. I didn't support gene names originally because HGVS recommendations uses the parenthesis syntax in two very different ways:

  • gene name: NM_004006.1(DMD):c.3G>T
  • transcript within NG: NG_012232.1(NM_004006.1):c.93+1G>T

(I think there used to be a third use for genome build with chromosome names.)

This syntax stinks for a couple of reasons. First, it's easy enough to grab the text in parens, but how do you know what it is? You could look it up to see if a valid symbol or add logic based on the structure (e.g., "starts with NG_" or "starts with NM_"), which essentially makes the parser dependent on external data or on rules based on internal choices of external systems (like identifier structure.) Changes and design choices in external systems must not affect parsing.

Another wrinkle is that it's unclear whether it's legal to include the gene name in the second form? What if you want gene name and transcript?

So, if we want to support parenthesis after the accession, we must account for all of the ways in which HGVS uses them, not just the gene names.

So, although it's possible to support gene names and transcripts, doing so will impact more code and code assumptions than is easily apparent.

@Peter-J-Freeman
Copy link

Peter-J-Freeman commented Dec 17, 2018

Agreed,

Effectively, within VV we have produced code that deals with both these scenarios, but ultimately all we are aiming for is a way of stripping away un-necessary data to create a variant description that contains only the necessary data for hgvs (py) parsing. So I guess a decision needs to be taken regarding the scope of the hgvs project.

For VV, tackling issues like this one are within scope. I have made the code very omnivorous, able to accept many forms of variant descriptions and convert them in to a HGVS complaint format so that hgvs can work its magic. Don't get me wrong, it's great fun tacking these issues, but the project has been running for >3 years now and still we see new types of weird and wonderful descriptions that users submit, or request that we handle.

We are in the process of building a FOS VV python package and web API, hopefully to be released early 2019. For the second release later in the year, I'm going to modularise some of the code. One module will be a variant formatting module which takes MANY styles of description and formats them into something hgvs (py) will parse.

Once available, would this be a suitable workaround that might save you a headache? If so, I will try prioritise creating the module for the V1 release and will plug the function into the web API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants