Validate length of HGVSp_Short values #8064

oplantalech · 2020-11-16T16:43:42Z

Recently, we have encountered very long HGVSp_Short values, like: p.Q3631_Q3632insHQSADSSRHSGIGHGQASSAVRDSGHRGYSGSQASDSEGHSEDSDTQSVSAQGKAGPHQQSHKESARGQSGESSGRSGSFLYQVSTHEQSESTHGQSAPSTGGRQGSHYDQAQDSSRHSASQEGQDTIRGHPGPSRGGRQGSHQEQSVDRSGHSGSHHSHTTSQGRSDASRGQSGSRSASRKTYDKEQSGDGSRHSGSHHHEASSWADSSRHSLVGQGQSSGPRTSRPRGSSVSQDSDSEGHSEDSERRSGSASRNHHGSAQEQSRDGSRHPRSHHEDRAGHGHSAESSRQSGTHHAENSSGGQAASSHEQARSSAGERHGSHHQQSADSSRHSGIGHGQASSAVRDSGHRGSSGSQASDSEGHSEDSDTQSVSAHGQAGPHQQSHQESTRGRSAGRSGRSGSFLYQVSTHEQSESAHGRTGTSTGGRQGSHHKQARDSSRHSTSQEGQDTIHGHPGSSSGGRQGSHYEQLVDRSGHSGSHHSHTTSQGRSDASHGHSGSRSASRQTRNDEQSGDGSRHSGSRHHEASSRADSSGHSQVGQGQSEGPRTSRNWGSSFSQDSDSQGHSEDSERWSGSASRNHHGSAQEQLRDGSRHPRSHQEDRAGHGHSADSSRQSGTRHTQTSSGGQAASSHEQARSSAGERHGSHHQQSADSSRHSGIGHGQASSAVRDSGHRGYSGSQASDNEGHSEDSDTQSVSAHGQAGSHQQSHQESARGRSGETSGHSGSFLYQVSTHEQSESSHGWTGPSTRGRQGSRHEQAQDSSRHSASQDGQDTIRGHPGSSRGGRQGYHHEHSVDSSGHSGSHHSHTTSQGRSDASRGQSGSRSASRTTRNEEQSGDGSRHSGSRHHEASTHADISRHSQAVQGQSEGSRRSRRQGSSVSQDSDSEGHSEDSERWSGSASRNHHGSAQEQLRDGSRHPRSHQEDRAGHGHSADSSRQSGTRHTQTSSGGQAASSHEQARSSAGERHGSHHQQSADSSRHSGIGHGQASSAVRDSGHRGYSGSQASDNEGHSEDSDTQSVSAHGQAGSHQQSHQESARGRSGETSGHSGSFLYQVSTHEQSESSHGWTGPSTRGRQGSRHEQAQDSSRHSASQYGQDTIRGHPGSSRGGRQGYHHEHSVDSSGHSGSHHSHTTSQGRSDASRGQSGSRSASRTTRNEEQSGDSSRHSVSRHHEASTHADISRHSQAVQGQSEGSRRSRRQGSSVSQDSDSEGHSEDSERWSGSASRNHRGSVQEQSRHGSRHPRSHHEDRAGHGHSADRSRQSGTRHAETSSGGQAASSHEQARSSPGERHGSRHQQSADSSRHSGIPRGQASSAVRDSRHWGSSGSQASDSEGHSEESDTQSVSGHGQAGPHQQSHQESARDRSGGRSGRSGSFLYQVSTHEQSESAHGRTRTSTGRRQGSHHEQARDSSRHSASQEGQDTIRGHPGSSRRGRQGSHYEQSVDRSGHSGSHHSHTTSQGRSDASRGQSGSRSASRQTRNDEQSGDGSRHSWSHHHEASTQADSSRHSQSGQGQSAGPRTSRNQGSSVSQDSDSQGHSEDSERWSGSASRNHRGSAQEQSRDGSRHPTSHHEDRAGHGHSAESSRQSGTHHAENSSGGQAASSHEQARSSAGERHGSHHQ

HGVSp_Short values are saved in the column PROTEIN_CHANGE of the mutation_event table. However, the length of this column is 255, so these long HGVSp_Short values are truncated and prone to raise the error "Duplicated mutation event" when loading a study (since PROTEIN_CHANGE is one of the keys of the table).

This PR proposes to add validation for the length of HGVSP_Short values and throw an error if the length is higher than 255.

ao508 · 2020-11-16T16:44:23Z

@oplantalech looking at this today!

oplantalech · 2020-11-16T16:46:48Z

@ao508 Thanks!

ao508

lgtm!

sheridancbio · 2020-11-16T19:53:56Z

A quick point:
This field (PROTEIN_CHANGE) is one element of a composite index of the mutation_event table:

KEY KEY_MUTATION_EVENT_DETAILS (CHR, START_POSITION, END_POSITION, TUMOR_SEQ_ALLELE(240), ENTREZ_GENE_ID, PROTEIN_CHANGE, MUTATION_TYPE),

We truncated the TUMOR_SEQ_ALLELE component to a 240 character prefix based on limitations to the bit size allowed for indexes for databases which use utf-8 encoding and the innodb database engine. An element with 2552 characters under a 3 or 4 byte encoding could be up to 10208 bytes long, which would exceed the maximum allowable length (767 or 3072 bytes) for mysql 5.7

Before adopting this migration step, I think we may need to specify that cbioportal requires a more recent version of mysql, or we must begin to specify the character encodings that cbioportal deployers must use. In either case we would need time to adjust the production environment at MSK.

sheridancbio · 2020-11-16T20:30:15Z

There were some thoughts about long insertions here: #4345
One idea I had back then was : should we segregate "long" insertions/deletions into a separate table?

As sequencing continues to be enhanced over time, we will see longer and longer insertion events. Even if we adopt the 2552 length suggested in this PR eventually that will be exceeded and the problem will recur. We have seen such long events in other datasets (not currently in production).

Another option we might choose is to say that this insertion (1621 amino acids / 4863 nucleotides) is simply too large. Some genes are many times smaller than this insertion (TP53 has 393 amino acids). Perhaps this event is more in the nature of a genomic rearrangement / structural variant / fusion? How many nucleotides can be "inserted" before it is no longer is considered a "mutation" inside cbioportal? We should consider whether the correct answer is 2552. If so, that is fine ... but then when we receive an event showing a 3000 AA insertion in the future we will be ready to say "this is too large to be considered a mutation in cbioportal" -- please use a structural variant.

oplantalech · 2020-11-17T08:20:26Z

@sheridancbio I agree with your comments. I would say the main issue is that currently we do not validate for the PROTEIN_CHANGE length, and that needs to be addressed as soon as possible. I proposed to extend the length because it solves the issue I currently encountered 😉 but indeed the global problem on how to handle long insertions/deletions in cBioPortal needs to be addressed more in depth and probably will need some substantial changes at the database level.

So, do you think it makes more sense to keep the length as is (255) and just introduce the validation check (stating that we do not support HGVSp_Short values longer than 255 characters)?

sheridancbio · 2020-11-17T15:52:15Z

Since it is true that we currently have a "de-facto" limit of 255 characters allowed in the protein change database field then I agree it makes sense to add validation during import (in the scripts module) and also to update the validator script to check this limit when validating studies. If this is urgent, then I think putting in checks immediately would be good. If an event is present in the data file which exceeds these limits, it should fail validation. It should also probably be filtered out / skipped during import with a warning being written in the log output.

I also recommend that we examine and clarify our models using scientific knowledge. We should try to answer the question : what kind of cellular/molecular changes can properly be categorized as "mutations" in the cbioportal system? How are these distinguished from "structural variants" and from "fusions" or other kinds of genetic alterations? With a clear picture of these categories we should then determine what is maximum length of a Protein Change value which we will support in the cbioportal mutation data model.

For example, I can imagine small (even single nucleotide) changes which disrupt an exon splice site and cause an intron to no longer be spliced out. Introns can be long, and the resulting extension of the protein could be substantial. Do we need to capture that in the protein_change field? That might be a reason to extend the protein change field so that it is arbitrarily long, and maybe that means we should drop the protein change field from the index. I'm actually not clear on why the protein_change field was in the index in the first place - presumably to either speed up retrieval, or to guarantee uniqueness of records. But the basic question we should answer is : what is the universe of genetic events which will properly be categorized as mutations ... and that what is the longest protein change string which can reasonably result from any of these events.

We have other concerns about the current mutation model as well. The "sharing" of mutation_event records among multiple sample-specific mutation records has led to difficulties when annotation (stored in the mutation_event record) was performed at different times. So when old studies (annotated years ago) have a shared event with a new study, we are forced to choose between maintaining the original annotation and applying it to the new study ... or having the old / historical study gradually change over time as the events get new annotations as our annotation systems evolve. I bring this up because potentially there is a larger refactoring / restructuring coming. The mutation_event table may cease to exist. So I recommend we think deeply about these issues and plan our path forward with a clear picture based on answering these questions. If we don't formalize answers, we will keep revisiting these issues.

cc: @schultzn @n1zea144 @jjgao

oplantalech · 2020-11-23T10:47:46Z

@sheridancbio Following your comments I have modified the PR to just fix what is wrong: the lack of validating the length of HGVSp_Short values.

I left out the part of increasing the length because I think it is clear that this needs thought and discussion with more people, and this PR is not the place for that.

sheridancbio

Great 👍

We should also proceed to have the discussion as a community about what an appropriate protein change length limit should be for an event of type "Mutation". Perhaps extensions of the peptide by hundreds or thousands of amino acids is something we should handle in the mutation model ... in which case the database representation / indexing would need to be altered to accommodate the longest conceivable mutation consequence we would consider valid.

sheridancbio · 2020-11-25T02:27:42Z

core/src/test/scripts/unit_tests_validate_data.py

+                extra_meta_fields={'swissprot_identifier': 'accession'})
+        self.assertEqual(len(record_list), 2)
+        record_iterator = iter(record_list)
+        # expect an error for the second entry


Should we assert that there is no error logged for the first record? If some logic flaw in the scripts package caused flagging every record as too long I'm sure we would notice, so maybe we don't have to worry about it - but since there are two records in the test set, it would feel more complete to test the code behavior for both cases. I'm not sure how you assert an absence of error .. maybe self.assertNotEqual(record.levelno, logging.ERROR)?

However, I think this PR can be merged without this added test.

Good catch! I've fixed the test

sonarcloud · 2020-11-26T08:53:39Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

oplantalech · 2020-11-26T11:27:49Z

@inodb Can you merge this PR? Thanks!

sheridancbio · 2020-11-30T15:54:38Z

Merged. Thanks!

oplantalech added bug validator labels Nov 16, 2020

oplantalech requested review from n1zea144, sheridancbio and ao508 November 16, 2020 16:43

ao508 approved these changes Nov 16, 2020

View reviewed changes

sheridancbio mentioned this pull request Nov 17, 2020

Feedback needed for schema change PR knowledgesystems/pipelines-scrum#386

Closed

n1zea144 removed their request for review November 18, 2020 14:19

oplantalech force-pushed the fix_long_prot_changes branch from be3eb2a to 4183c29 Compare November 23, 2020 10:42

oplantalech changed the title ~~Add support for longer HGVSp_Short values~~ Validate length of HGVSp_Short values Nov 23, 2020

sheridancbio approved these changes Nov 25, 2020

View reviewed changes

Validate length of HGVSp_Short values

58da307

oplantalech force-pushed the fix_long_prot_changes branch from bbcd726 to 58da307 Compare November 26, 2020 08:46

sheridancbio approved these changes Nov 30, 2020

View reviewed changes

sheridancbio merged commit 8c9f4ec into cBioPortal:master Nov 30, 2020

oplantalech deleted the fix_long_prot_changes branch November 30, 2020 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate length of HGVSp_Short values #8064

Validate length of HGVSp_Short values #8064

oplantalech commented Nov 16, 2020 •

edited

ao508 commented Nov 16, 2020

oplantalech commented Nov 16, 2020

ao508 left a comment

sheridancbio commented Nov 16, 2020

sheridancbio commented Nov 16, 2020

oplantalech commented Nov 17, 2020

sheridancbio commented Nov 17, 2020

oplantalech commented Nov 23, 2020 •

edited

sheridancbio left a comment

sheridancbio Nov 25, 2020

oplantalech Nov 26, 2020

sonarcloud bot commented Nov 26, 2020

oplantalech commented Nov 26, 2020

sheridancbio commented Nov 30, 2020

Validate length of HGVSp_Short values #8064

Validate length of HGVSp_Short values #8064

Conversation

oplantalech commented Nov 16, 2020 • edited

ao508 commented Nov 16, 2020

oplantalech commented Nov 16, 2020

ao508 left a comment

Choose a reason for hiding this comment

sheridancbio commented Nov 16, 2020

sheridancbio commented Nov 16, 2020

oplantalech commented Nov 17, 2020

sheridancbio commented Nov 17, 2020

oplantalech commented Nov 23, 2020 • edited

sheridancbio left a comment

Choose a reason for hiding this comment

sheridancbio Nov 25, 2020

Choose a reason for hiding this comment

oplantalech Nov 26, 2020

Choose a reason for hiding this comment

sonarcloud bot commented Nov 26, 2020

oplantalech commented Nov 26, 2020

sheridancbio commented Nov 30, 2020

oplantalech commented Nov 16, 2020 •

edited

oplantalech commented Nov 23, 2020 •

edited