New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support gene names in variants #257
Comments
I think that this would a pretty interesting functionality, especially for labs that prepare clinical reports. I guess the steps would look something like:
|
Hi Aaron, In my opinion, there are two questions to answer here. 1) Whether this can be done; and 2) whether it ought to be done? Respectfully, your email provides a perfect example of why implementing this functionality is not necessarily a good idea. PLEASE PLEASE PLEASE take these comments constructively, but this is an issue that really bugs me, and I cannot remain quiet on the matter any longer within this forum. Sequence variation MUST be reported in the context of a fixed and immutable reference sequence which is permanently identified by a fixed and immutable reference sequence identifier. A gene name is neither immutable nor permanently linked to a single reference sequence. Also, I have never seen a gene name used in the place of a sequence identifier, so I can only assume you mean the gene symbol? For example, collagen type V alpha 1 chain = gene name, COL5A1 = gene symbol https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:2209). Both of these data are not immutable, for example the P3H1 (prolyl 3-hydroxylase 1) gene up until a few years ago was called LEPRE1 (leucine proline-enriched proteoglycan (leprecan) 1). Therefore they should never be used in place of a valid reference sequence identifier. Please note that transcript models for each gene can be very variable and not all transcript variants will necessarily utilise all the exons within a gene. COL5A1 is a perfect example. Consider the variant description COL5A1:c.5071A>T. This description would be valid in the context of two transcripts NM_000093.4:c.5071A>T and NM_001278074.1:c.5071A>T giving relatively similar protein level variant descriptions NP_000084.3:p.(Arg1691Ter) and NP_001265003.1:p.(Lys1691Ter), both of which would likely lead to mild Ehlers Danlos syndrome. However, these are entirely unique variants i.e. they are derived from different genomic variants NC_000009.12:g.134829979A>T and NC_000009.12:g.134830111A>T. I cannot comment on how the hgvs (py) core development team will implement this functionality, but in response to your suggested workflow this is how we implement this functionality at VariantValidator.
q. “Assuming NCBI needs to be used, why stop at the gene name? Why not include the Summary? The link brings you to a sample gene”
My greatest concern is that you believe that this functionality would help labs that prepare clinical reports. If labs are using gene symbols in place of valid reference sequences to report variation, then there is a serious problem. The identification of pathogenic variants is significantly hampered by noise in both case and control databases. I realise that the majority of this noise come from the messy practical and analytical workflows involved in genome sequencing, but there is also a significant problem with the submission of inaccurately reported data. My colleague, prof. Raymond Dalgleish, spends hours every few weeks CORRECTING variant descriptions associated with Osteogenesis Imperfecta and Ehlers Danlos syndrome which have been submitted to journals and databases by clinical and non-clinical research laboratories alike. There is now an effort to raise the standards of reporting in journals, and I will make sure I contact the working group involved and raise this issue. Again, please do not take this as anything more than constructive criticism. If you want to discuss this further in an open forum (which may be a very useful exercise since I have only provided one side of the discussion) please do so on https://groups.google.com/forum/#!forum/hgvs-discuss. I’m happy for you to copy my comments across should you wish to do so. |
I think follow the robustness principle of "Be conservative in what you send, be liberal in what you accept" Accessions are better for computers no question but humans like So, using the robustness principle and thinking of gene symbols as a human memory hint, my answers would be:
Yes
This should be a flag, off by default. If you load a variant with the gene symbol, turn the flag on, so that the gene symbol remains as it is manipulated/printed etc.
Yes - throw an exception by default. There are a lot of corner cases here, though! Such as do you accept HGNC aliases or gene renamings? (eg someone uses the latest name for a gene, even though at the accession version it was called something else?) Some people may want to discard the gene symbol if the variant works but the match is wrong, you could add a
Or add a parameter to the If you stripped the gene name, then set a flag to display the gene name, you could then use hgvs to fix any dodgy variant/gene names you have. |
Hi Dave. Nice suggestions. I think this is a valid compromise and is sufficiently HGVS compliant and I'd certainly have no objections in principle. I like this idea because the transcript ID is explicitly stated so there is no doubt as to the identity of the reference sequence. I do have a couple of concerns about implementation though. I'm not sure whether it is necessarily a good idea to check whether the Gene Symbol and the transcript ID match. The Transcript ID MUST be stated and MUST be considered as the only viable sequence identifier. The gene symbol is a nice human readable touch, but is is icing on the cake. I would suggest that trying and trying to include validation that the symbol matches the reference sequence ID can muddy the water and cause problems. Consider this example. Because gene symbols are not immutable, they are not sufficiently robust for asserting a HGVSAccessionGeneSymbolMismatch. Also, playing Devil's advocate, why would asserting an incorrect Gene symbol matter since HGVS only requires the use of a valid sequence ID and does not require a gene symbol to be provided. I actually think that your idea of providing a flag to provide the Gene Symbol could be a work around, i.e. provide the correct gene symbol and UTA can certainly provide the necessary data. However, I can still for-see issues where users may have generated data ~3 years ago where the output would have been NM_022356.3(LEPRE1):c.2055+18G>A, whereas today the output would be NM_022356.3(P3H1):c.2055+18G>A. So again I'd be inclined to say that this is not necessarily a sufficiently robust idea either, but I admit, less worrying than the HGVSAccessionGeneSymbolMismatch error :) I agree with "Be conservative in what you send, be liberal in what you accept", but to prevent unnecessary issues and confusion, I suggest that when parsing the variant description, the (LEPRE1) or (DMD) as in the previous example just be stripped out and ignored. This method is implemented in VariantValidator. |
Everything's possible with enough hacks. I didn't support gene names originally because HGVS recommendations uses the parenthesis syntax in two very different ways:
(I think there used to be a third use for genome build with chromosome names.) This syntax stinks for a couple of reasons. First, it's easy enough to grab the text in parens, but how do you know what it is? You could look it up to see if a valid symbol or add logic based on the structure (e.g., "starts with NG_" or "starts with NM_"), which essentially makes the parser dependent on external data or on rules based on internal choices of external systems (like identifier structure.) Changes and design choices in external systems must not affect parsing. Another wrinkle is that it's unclear whether it's legal to include the gene name in the second form? What if you want gene name and transcript? So, if we want to support parenthesis after the accession, we must account for all of the ways in which HGVS uses them, not just the gene names. So, although it's possible to support gene names and transcripts, doing so will impact more code and code assumptions than is easily apparent. |
Agreed, Effectively, within VV we have produced code that deals with both these scenarios, but ultimately all we are aiming for is a way of stripping away un-necessary data to create a variant description that contains only the necessary data for hgvs (py) parsing. So I guess a decision needs to be taken regarding the scope of the hgvs project. For VV, tackling issues like this one are within scope. I have made the code very omnivorous, able to accept many forms of variant descriptions and convert them in to a HGVS complaint format so that hgvs can work its magic. Don't get me wrong, it's great fun tacking these issues, but the project has been running for >3 years now and still we see new types of weird and wonderful descriptions that users submit, or request that we handle. We are in the process of building a FOS VV python package and web API, hopefully to be released early 2019. For the second release later in the year, I'm going to modularise some of the code. One module will be a variant formatting module which takes MANY styles of description and formats them into something hgvs (py) will parse. Once available, would this be a suitable workaround that might save you a headache? If so, I will try prioritise creating the module for the V1 release and will plug the function into the web API. |
Originally reported by: Reece Hart (Bitbucket: reece, GitHub: reece)
This is a proposal. Users should comment and/or vote.
Current state: The hgvs package does not currently support parsing or formatting variants with gene names (e.g.,
NM_004006.1(DMD):c.3G>T
andGJB2:c.76A>C
[source]). This was intentional because 1) gene names alone are insufficient and error-prone to describe variants, and 2) when with an accession, gene names create duplicative information and therefore opportunities to be wrong. However, some sources include gene names with variants and those variants will not parse.Minimal proposal: Parse gene names in the context of a valid accession and discard it. This is to prevent syntax errors for valid HGVS variants.
Not under consideration: This package is unlikely to ever support gene names alone or chromosome names as accessions because these are fundamentally flawed as precise descriptions of sequence variants.
Questions:
The text was updated successfully, but these errors were encountered: