Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate length of HGVSp_Short values #8064

Merged
merged 1 commit into from
Nov 30, 2020

Conversation

oplantalech
Copy link
Contributor

@oplantalech oplantalech commented Nov 16, 2020

Recently, we have encountered very long HGVSp_Short values, like: p.Q3631_Q3632insHQSADSSRHSGIGHGQASSAVRDSGHRGYSGSQASDSEGHSEDSDTQSVSAQGKAGPHQQSHKESARGQSGESSGRSGSFLYQVSTHEQSESTHGQSAPSTGGRQGSHYDQAQDSSRHSASQEGQDTIRGHPGPSRGGRQGSHQEQSVDRSGHSGSHHSHTTSQGRSDASRGQSGSRSASRKTYDKEQSGDGSRHSGSHHHEASSWADSSRHSLVGQGQSSGPRTSRPRGSSVSQDSDSEGHSEDSERRSGSASRNHHGSAQEQSRDGSRHPRSHHEDRAGHGHSAESSRQSGTHHAENSSGGQAASSHEQARSSAGERHGSHHQQSADSSRHSGIGHGQASSAVRDSGHRGSSGSQASDSEGHSEDSDTQSVSAHGQAGPHQQSHQESTRGRSAGRSGRSGSFLYQVSTHEQSESAHGRTGTSTGGRQGSHHKQARDSSRHSTSQEGQDTIHGHPGSSSGGRQGSHYEQLVDRSGHSGSHHSHTTSQGRSDASHGHSGSRSASRQTRNDEQSGDGSRHSGSRHHEASSRADSSGHSQVGQGQSEGPRTSRNWGSSFSQDSDSQGHSEDSERWSGSASRNHHGSAQEQLRDGSRHPRSHQEDRAGHGHSADSSRQSGTRHTQTSSGGQAASSHEQARSSAGERHGSHHQQSADSSRHSGIGHGQASSAVRDSGHRGYSGSQASDNEGHSEDSDTQSVSAHGQAGSHQQSHQESARGRSGETSGHSGSFLYQVSTHEQSESSHGWTGPSTRGRQGSRHEQAQDSSRHSASQDGQDTIRGHPGSSRGGRQGYHHEHSVDSSGHSGSHHSHTTSQGRSDASRGQSGSRSASRTTRNEEQSGDGSRHSGSRHHEASTHADISRHSQAVQGQSEGSRRSRRQGSSVSQDSDSEGHSEDSERWSGSASRNHHGSAQEQLRDGSRHPRSHQEDRAGHGHSADSSRQSGTRHTQTSSGGQAASSHEQARSSAGERHGSHHQQSADSSRHSGIGHGQASSAVRDSGHRGYSGSQASDNEGHSEDSDTQSVSAHGQAGSHQQSHQESARGRSGETSGHSGSFLYQVSTHEQSESSHGWTGPSTRGRQGSRHEQAQDSSRHSASQYGQDTIRGHPGSSRGGRQGYHHEHSVDSSGHSGSHHSHTTSQGRSDASRGQSGSRSASRTTRNEEQSGDSSRHSVSRHHEASTHADISRHSQAVQGQSEGSRRSRRQGSSVSQDSDSEGHSEDSERWSGSASRNHRGSVQEQSRHGSRHPRSHHEDRAGHGHSADRSRQSGTRHAETSSGGQAASSHEQARSSPGERHGSRHQQSADSSRHSGIPRGQASSAVRDSRHWGSSGSQASDSEGHSEESDTQSVSGHGQAGPHQQSHQESARDRSGGRSGRSGSFLYQVSTHEQSESAHGRTRTSTGRRQGSHHEQARDSSRHSASQEGQDTIRGHPGSSRRGRQGSHYEQSVDRSGHSGSHHSHTTSQGRSDASRGQSGSRSASRQTRNDEQSGDGSRHSWSHHHEASTQADSSRHSQSGQGQSAGPRTSRNQGSSVSQDSDSQGHSEDSERWSGSASRNHRGSAQEQSRDGSRHPTSHHEDRAGHGHSAESSRQSGTHHAENSSGGQAASSHEQARSSAGERHGSHHQ

HGVSp_Short values are saved in the column PROTEIN_CHANGE of the mutation_event table. However, the length of this column is 255, so these long HGVSp_Short values are truncated and prone to raise the error "Duplicated mutation event" when loading a study (since PROTEIN_CHANGE is one of the keys of the table).

This PR proposes to add validation for the length of HGVSP_Short values and throw an error if the length is higher than 255.

@ao508
Copy link
Contributor

ao508 commented Nov 16, 2020

@oplantalech looking at this today!

@oplantalech
Copy link
Contributor Author

@ao508 Thanks!

Copy link
Contributor

@ao508 ao508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@sheridancbio
Copy link
Contributor

A quick point:
This field (PROTEIN_CHANGE) is one element of a composite index of the mutation_event table:

KEY KEY_MUTATION_EVENT_DETAILS (CHR, START_POSITION, END_POSITION, TUMOR_SEQ_ALLELE(240), ENTREZ_GENE_ID, PROTEIN_CHANGE, MUTATION_TYPE),

We truncated the TUMOR_SEQ_ALLELE component to a 240 character prefix based on limitations to the bit size allowed for indexes for databases which use utf-8 encoding and the innodb database engine. An element with 2552 characters under a 3 or 4 byte encoding could be up to 10208 bytes long, which would exceed the maximum allowable length (767 or 3072 bytes) for mysql 5.7

Before adopting this migration step, I think we may need to specify that cbioportal requires a more recent version of mysql, or we must begin to specify the character encodings that cbioportal deployers must use. In either case we would need time to adjust the production environment at MSK.

@sheridancbio
Copy link
Contributor

There were some thoughts about long insertions here: #4345
One idea I had back then was : should we segregate "long" insertions/deletions into a separate table?

As sequencing continues to be enhanced over time, we will see longer and longer insertion events. Even if we adopt the 2552 length suggested in this PR eventually that will be exceeded and the problem will recur. We have seen such long events in other datasets (not currently in production).

Another option we might choose is to say that this insertion (1621 amino acids / 4863 nucleotides) is simply too large. Some genes are many times smaller than this insertion (TP53 has 393 amino acids). Perhaps this event is more in the nature of a genomic rearrangement / structural variant / fusion? How many nucleotides can be "inserted" before it is no longer is considered a "mutation" inside cbioportal? We should consider whether the correct answer is 2552. If so, that is fine ... but then when we receive an event showing a 3000 AA insertion in the future we will be ready to say "this is too large to be considered a mutation in cbioportal" -- please use a structural variant.

@oplantalech
Copy link
Contributor Author

@sheridancbio I agree with your comments. I would say the main issue is that currently we do not validate for the PROTEIN_CHANGE length, and that needs to be addressed as soon as possible. I proposed to extend the length because it solves the issue I currently encountered 😉 but indeed the global problem on how to handle long insertions/deletions in cBioPortal needs to be addressed more in depth and probably will need some substantial changes at the database level.

So, do you think it makes more sense to keep the length as is (255) and just introduce the validation check (stating that we do not support HGVSp_Short values longer than 255 characters)?

@sheridancbio
Copy link
Contributor

Since it is true that we currently have a "de-facto" limit of 255 characters allowed in the protein change database field then I agree it makes sense to add validation during import (in the scripts module) and also to update the validator script to check this limit when validating studies. If this is urgent, then I think putting in checks immediately would be good. If an event is present in the data file which exceeds these limits, it should fail validation. It should also probably be filtered out / skipped during import with a warning being written in the log output.

I also recommend that we examine and clarify our models using scientific knowledge. We should try to answer the question : what kind of cellular/molecular changes can properly be categorized as "mutations" in the cbioportal system? How are these distinguished from "structural variants" and from "fusions" or other kinds of genetic alterations? With a clear picture of these categories we should then determine what is maximum length of a Protein Change value which we will support in the cbioportal mutation data model.

For example, I can imagine small (even single nucleotide) changes which disrupt an exon splice site and cause an intron to no longer be spliced out. Introns can be long, and the resulting extension of the protein could be substantial. Do we need to capture that in the protein_change field? That might be a reason to extend the protein change field so that it is arbitrarily long, and maybe that means we should drop the protein change field from the index. I'm actually not clear on why the protein_change field was in the index in the first place - presumably to either speed up retrieval, or to guarantee uniqueness of records. But the basic question we should answer is : what is the universe of genetic events which will properly be categorized as mutations ... and that what is the longest protein change string which can reasonably result from any of these events.

We have other concerns about the current mutation model as well. The "sharing" of mutation_event records among multiple sample-specific mutation records has led to difficulties when annotation (stored in the mutation_event record) was performed at different times. So when old studies (annotated years ago) have a shared event with a new study, we are forced to choose between maintaining the original annotation and applying it to the new study ... or having the old / historical study gradually change over time as the events get new annotations as our annotation systems evolve. I bring this up because potentially there is a larger refactoring / restructuring coming. The mutation_event table may cease to exist. So I recommend we think deeply about these issues and plan our path forward with a clear picture based on answering these questions. If we don't formalize answers, we will keep revisiting these issues.

cc: @schultzn @n1zea144 @jjgao

@n1zea144 n1zea144 removed their request for review November 18, 2020 14:19
@oplantalech oplantalech changed the title Add support for longer HGVSp_Short values Validate length of HGVSp_Short values Nov 23, 2020
@oplantalech
Copy link
Contributor Author

oplantalech commented Nov 23, 2020

@sheridancbio Following your comments I have modified the PR to just fix what is wrong: the lack of validating the length of HGVSp_Short values.

I left out the part of increasing the length because I think it is clear that this needs thought and discussion with more people, and this PR is not the place for that.

Copy link
Contributor

@sheridancbio sheridancbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great 👍

We should also proceed to have the discussion as a community about what an appropriate protein change length limit should be for an event of type "Mutation". Perhaps extensions of the peptide by hundreds or thousands of amino acids is something we should handle in the mutation model ... in which case the database representation / indexing would need to be altered to accommodate the longest conceivable mutation consequence we would consider valid.

extra_meta_fields={'swissprot_identifier': 'accession'})
self.assertEqual(len(record_list), 2)
record_iterator = iter(record_list)
# expect an error for the second entry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assert that there is no error logged for the first record? If some logic flaw in the scripts package caused flagging every record as too long I'm sure we would notice, so maybe we don't have to worry about it - but since there are two records in the test set, it would feel more complete to test the code behavior for both cases. I'm not sure how you assert an absence of error .. maybe self.assertNotEqual(record.levelno, logging.ERROR)?

However, I think this PR can be merged without this added test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've fixed the test

@sonarcloud
Copy link

sonarcloud bot commented Nov 26, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@oplantalech
Copy link
Contributor Author

@inodb Can you merge this PR? Thanks!

@sheridancbio sheridancbio merged commit 8c9f4ec into cBioPortal:master Nov 30, 2020
@sheridancbio
Copy link
Contributor

Merged. Thanks!

@oplantalech oplantalech deleted the fix_long_prot_changes branch November 30, 2020 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants