-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate length of HGVSp_Short values #8064
Conversation
@oplantalech looking at this today! |
@ao508 Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
A quick point: KEY We truncated the TUMOR_SEQ_ALLELE component to a 240 character prefix based on limitations to the bit size allowed for indexes for databases which use utf-8 encoding and the innodb database engine. An element with 2552 characters under a 3 or 4 byte encoding could be up to 10208 bytes long, which would exceed the maximum allowable length (767 or 3072 bytes) for mysql 5.7 Before adopting this migration step, I think we may need to specify that cbioportal requires a more recent version of mysql, or we must begin to specify the character encodings that cbioportal deployers must use. In either case we would need time to adjust the production environment at MSK. |
There were some thoughts about long insertions here: #4345 As sequencing continues to be enhanced over time, we will see longer and longer insertion events. Even if we adopt the 2552 length suggested in this PR eventually that will be exceeded and the problem will recur. We have seen such long events in other datasets (not currently in production). Another option we might choose is to say that this insertion (1621 amino acids / 4863 nucleotides) is simply too large. Some genes are many times smaller than this insertion (TP53 has 393 amino acids). Perhaps this event is more in the nature of a genomic rearrangement / structural variant / fusion? How many nucleotides can be "inserted" before it is no longer is considered a "mutation" inside cbioportal? We should consider whether the correct answer is 2552. If so, that is fine ... but then when we receive an event showing a 3000 AA insertion in the future we will be ready to say "this is too large to be considered a mutation in cbioportal" -- please use a structural variant. |
@sheridancbio I agree with your comments. I would say the main issue is that currently we do not validate for the PROTEIN_CHANGE length, and that needs to be addressed as soon as possible. I proposed to extend the length because it solves the issue I currently encountered 😉 but indeed the global problem on how to handle long insertions/deletions in cBioPortal needs to be addressed more in depth and probably will need some substantial changes at the database level. So, do you think it makes more sense to keep the length as is (255) and just introduce the validation check (stating that we do not support HGVSp_Short values longer than 255 characters)? |
Since it is true that we currently have a "de-facto" limit of 255 characters allowed in the protein change database field then I agree it makes sense to add validation during import (in the scripts module) and also to update the validator script to check this limit when validating studies. If this is urgent, then I think putting in checks immediately would be good. If an event is present in the data file which exceeds these limits, it should fail validation. It should also probably be filtered out / skipped during import with a warning being written in the log output. I also recommend that we examine and clarify our models using scientific knowledge. We should try to answer the question : what kind of cellular/molecular changes can properly be categorized as "mutations" in the cbioportal system? How are these distinguished from "structural variants" and from "fusions" or other kinds of genetic alterations? With a clear picture of these categories we should then determine what is maximum length of a Protein Change value which we will support in the cbioportal mutation data model. For example, I can imagine small (even single nucleotide) changes which disrupt an exon splice site and cause an intron to no longer be spliced out. Introns can be long, and the resulting extension of the protein could be substantial. Do we need to capture that in the protein_change field? That might be a reason to extend the protein change field so that it is arbitrarily long, and maybe that means we should drop the protein change field from the index. I'm actually not clear on why the protein_change field was in the index in the first place - presumably to either speed up retrieval, or to guarantee uniqueness of records. But the basic question we should answer is : what is the universe of genetic events which will properly be categorized as mutations ... and that what is the longest protein change string which can reasonably result from any of these events. We have other concerns about the current mutation model as well. The "sharing" of |
be3eb2a
to
4183c29
Compare
@sheridancbio Following your comments I have modified the PR to just fix what is wrong: the lack of validating the length of I left out the part of increasing the length because I think it is clear that this needs thought and discussion with more people, and this PR is not the place for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great 👍
We should also proceed to have the discussion as a community about what an appropriate protein change length limit should be for an event of type "Mutation". Perhaps extensions of the peptide by hundreds or thousands of amino acids is something we should handle in the mutation model ... in which case the database representation / indexing would need to be altered to accommodate the longest conceivable mutation consequence we would consider valid.
extra_meta_fields={'swissprot_identifier': 'accession'}) | ||
self.assertEqual(len(record_list), 2) | ||
record_iterator = iter(record_list) | ||
# expect an error for the second entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assert that there is no error logged for the first record? If some logic flaw in the scripts package caused flagging every record as too long I'm sure we would notice, so maybe we don't have to worry about it - but since there are two records in the test set, it would feel more complete to test the code behavior for both cases. I'm not sure how you assert an absence of error .. maybe self.assertNotEqual(record.levelno, logging.ERROR)?
However, I think this PR can be merged without this added test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I've fixed the test
bbcd726
to
58da307
Compare
Kudos, SonarCloud Quality Gate passed! |
@inodb Can you merge this PR? Thanks! |
Merged. Thanks! |
Recently, we have encountered very long HGVSp_Short values, like:
p.Q3631_Q3632insHQSADSSRHSGIGHGQASSAVRDSGHRGYSGSQASDSEGHSEDSDTQSVSAQGKAGPHQQSHKESARGQSGESSGRSGSFLYQVSTHEQSESTHGQSAPSTGGRQGSHYDQAQDSSRHSASQEGQDTIRGHPGPSRGGRQGSHQEQSVDRSGHSGSHHSHTTSQGRSDASRGQSGSRSASRKTYDKEQSGDGSRHSGSHHHEASSWADSSRHSLVGQGQSSGPRTSRPRGSSVSQDSDSEGHSEDSERRSGSASRNHHGSAQEQSRDGSRHPRSHHEDRAGHGHSAESSRQSGTHHAENSSGGQAASSHEQARSSAGERHGSHHQQSADSSRHSGIGHGQASSAVRDSGHRGSSGSQASDSEGHSEDSDTQSVSAHGQAGPHQQSHQESTRGRSAGRSGRSGSFLYQVSTHEQSESAHGRTGTSTGGRQGSHHKQARDSSRHSTSQEGQDTIHGHPGSSSGGRQGSHYEQLVDRSGHSGSHHSHTTSQGRSDASHGHSGSRSASRQTRNDEQSGDGSRHSGSRHHEASSRADSSGHSQVGQGQSEGPRTSRNWGSSFSQDSDSQGHSEDSERWSGSASRNHHGSAQEQLRDGSRHPRSHQEDRAGHGHSADSSRQSGTRHTQTSSGGQAASSHEQARSSAGERHGSHHQQSADSSRHSGIGHGQASSAVRDSGHRGYSGSQASDNEGHSEDSDTQSVSAHGQAGSHQQSHQESARGRSGETSGHSGSFLYQVSTHEQSESSHGWTGPSTRGRQGSRHEQAQDSSRHSASQDGQDTIRGHPGSSRGGRQGYHHEHSVDSSGHSGSHHSHTTSQGRSDASRGQSGSRSASRTTRNEEQSGDGSRHSGSRHHEASTHADISRHSQAVQGQSEGSRRSRRQGSSVSQDSDSEGHSEDSERWSGSASRNHHGSAQEQLRDGSRHPRSHQEDRAGHGHSADSSRQSGTRHTQTSSGGQAASSHEQARSSAGERHGSHHQQSADSSRHSGIGHGQASSAVRDSGHRGYSGSQASDNEGHSEDSDTQSVSAHGQAGSHQQSHQESARGRSGETSGHSGSFLYQVSTHEQSESSHGWTGPSTRGRQGSRHEQAQDSSRHSASQYGQDTIRGHPGSSRGGRQGYHHEHSVDSSGHSGSHHSHTTSQGRSDASRGQSGSRSASRTTRNEEQSGDSSRHSVSRHHEASTHADISRHSQAVQGQSEGSRRSRRQGSSVSQDSDSEGHSEDSERWSGSASRNHRGSVQEQSRHGSRHPRSHHEDRAGHGHSADRSRQSGTRHAETSSGGQAASSHEQARSSPGERHGSRHQQSADSSRHSGIPRGQASSAVRDSRHWGSSGSQASDSEGHSEESDTQSVSGHGQAGPHQQSHQESARDRSGGRSGRSGSFLYQVSTHEQSESAHGRTRTSTGRRQGSHHEQARDSSRHSASQEGQDTIRGHPGSSRRGRQGSHYEQSVDRSGHSGSHHSHTTSQGRSDASRGQSGSRSASRQTRNDEQSGDGSRHSWSHHHEASTQADSSRHSQSGQGQSAGPRTSRNQGSSVSQDSDSQGHSEDSERWSGSASRNHRGSAQEQSRDGSRHPTSHHEDRAGHGHSAESSRQSGTHHAENSSGGQAASSHEQARSSAGERHGSHHQ
HGVSp_Short values are saved in the column
PROTEIN_CHANGE
of themutation_event
table. However, the length of this column is 255, so these long HGVSp_Short values are truncated and prone to raise the error "Duplicated mutation event" when loading a study (sincePROTEIN_CHANGE
is one of the keys of the table).This PR proposes to add validation for the length of
HGVSP_Short
values and throw an error if the length is higher than 255.