Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a new all_cases_with_sv_data to the validation process #6860

Merged
merged 2 commits into from Nov 28, 2019

Conversation

khzhu
Copy link
Contributor

@khzhu khzhu commented Nov 25, 2019

Fix #6741

Describe changes proposed in this pull request:

  • add a new all_cases_with_structural_variant_data to the SampleListCategory class
  • add a new all_cases_with_structural_variant_data to processCaseListDirectory in the Validate script

Checks

  • Runs on heroku
  • Has tests or has a separate issue that describes the types of test that should be created. If no test is included it should explicitly be mentioned in the PR why there is no test.
  • The commit log is comprehensible. It follows 7 rules of great commit messages. For most PRs a single commit should suffice, in some cases multiple topical commits can be useful. During review it is ok to see tiny commits (e.g. Fix reviewer comments), but right before the code gets merged to master or rc branch, any such commits should be squashed since they are useless to the other developers. Definitely avoid merge commits, use rebase instead.
  • Is this PR adding logic based on one or more clinical attributes? If yes, please make sure validation for this attribute is also present in the data validation / data loading layers (in backend repo) and documented in File-Formats Clinical data section!

Any screenshots or GIFs?

If this is a new visual feature please add a before/after screenshot or gif
here with e.g. GifGrabber.

Notify reviewers

Read our Pull request merging
policy
. It can help to figure out who worked on the
file before you. Please use git blame <filename> to determine that
and notify them either through slack or by assigning them as a reviewer on the PR

@sheridancbio
Copy link
Contributor

I've approved, but I wonder if this fix alone will fix #6741. Have we imported a file with both fully specified structural variants (Site1 and Site2 exon number and transcript id provided), and also non-specified structural variants (Site1 and Site2 exon number and transcript id blank or NA)? Is it properly visualized?

@khzhu
Copy link
Contributor Author

khzhu commented Nov 25, 2019

do you have any test data for the second use case @sheridancbio? thanks!

@n1zea144
Copy link
Contributor

n1zea144 commented Nov 26, 2019

@khzhu The file in data hub for for sv in msk-2017-impact should contain both types of data. Does it not? cc: @ritikakundra

@n1zea144
Copy link
Contributor

@ritikakundra @rmadupuri Can you make sure that you are using the same caselist name in the sv caselist for msk-2017-impact as @khzhu is using here?

Also, can you confirm data_SV.txt file on datahub contains our full set of SV data (both sv and fusion record types - see @sheridancbio comment above).

Thanks.

@khzhu
Copy link
Contributor Author

khzhu commented Nov 26, 2019

Thanks, @n1zea144!
I checked the data_sv.txt file and found some events that either do not have site1_exome or site2_exome. Those cases were all filtered by the importer script and did not make their way into the database. Should we relax importing rules to let them go through?
We do not have any cases which site1_exome and site2_exom are both null included in data_sv.txt.

@@ -52,6 +52,7 @@
ALL_CASES_WITH_MUTATION_AND_CNA_DATA("all_cases_with_mutation_and_cna_data"),
ALL_CASES_WITH_MUTATION_AND_CNA_AND_MRNA_DATA("all_cases_with_mutation_and_cna_and_mrna_data"),
ALL_CASES_WITH_GSVA_DATA("all_cases_with_gsva_data"),
ALL_CASES_WITH_SV_DATA("all_cases_with_structural_variant_data"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @khzhu, we are using all_cases_with_sv_data as the category name instead of all_cases_with_structural_variant_data.

@@ -4253,6 +4253,7 @@ def processCaseListDirectory(caseListDir, cancerStudyId, logger,
'all_cases_with_mutation_and_cna_data',
'all_cases_with_mutation_and_cna_and_mrna_data',
'all_cases_with_gsva_data',
'all_cases_with_structural_variant_data',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, should be all_cases_with_sv_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, will make corrections. thanks @rmadupuri!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@ritikakundra
Copy link
Contributor

@n1zea144 @khzhu Datahub has the full SV file (including both versions).

@khzhu
Copy link
Contributor Author

khzhu commented Nov 26, 2019

thanks, @ritikakundra! I will modify the import script so that those events without exome numbers not being rejected.

@sheridancbio
Copy link
Contributor

Thanks, @n1zea144!
I checked the data_sv.txt file and found some events that either do not have site1_exome or site2_exome. Those cases were all filtered by the importer script and did not make their way into the database. Should we relax importing rules to let them go through?
We do not have any cases which site1_exome and site2_exom are both null included in data_sv.txt.

Hi Kelsey. This is just a partial answer to your question.

There are three categories of SV records during import. There are 4 key fields:

  • site 1 exon number
  • site 2 exon number
  • site 1 ensembl transcript id
  • site 2 ensembl transcript id

There are basically 3 categories of records:

  1. records with all 4 pieces of information (fully specified)
  2. records with 1-3 pieces of information (partially specified)
  3. records with 0 pieces of information (unspecified)

In the (now hidden) documentation here:
https://github.com/cBioPortal/cbioportal/blob/cbdd9d2657fbe90f01d6d258798028b097ad33ae/docs/File-Formats.md#structural-variants-data

the fully specified records are considered valid "fusion" events
all records in the structural variant data file are "structural variant" records (this includes the "fusion" events and any other less specified events)

The frontend fails to render the studies (and maybe crashes - I cannot recall the exact details) when a genetic profile contains any partially specified records ... so that is why the filter is in place to remove these events during import. We cannot remove that filter until we write some kind of code to either handle these properly in the frontend, or to "downgrade" these records to remain as unspecified records. We have basically postponed this work / decision.

The other issue is that although the frontend can display genetic profiles made up entirely of unspecified records .. or fully specified records ... it cannot render a mixture and the visual display crashes when opening the fusions tab.

If the data_SV.txt file from mskimpact_2017 study contains no unspecified records then for testing you can simply blank out (or make NA) the fields for some of the partially specified records which would demote them to become unspecified records and allow them to be imported. Then importing the profile will give a mixture of unspecified records and fully specified records ... and the frontend currently cannot render such a mixture. We are hoping the frontend can me massaged to render both of these categories at once.

I'll need to respond with more details later.

@khzhu
Copy link
Contributor Author

khzhu commented Nov 26, 2019

thanks, @sheridancbio , for detailed information!
I will adjust the importing script to let category 2 go through. Hope that can help reproduce the error (fusion tab crash).
Does not seem like we have the third category: records with 0 pieces of information (unspecified)
in data_sv.txt downloaded from the data hub.

@khzhu
Copy link
Contributor Author

khzhu commented Nov 26, 2019

Hi @sheridancbio , I've made changes to use all_cases_with_sv_data instead. The PR is ready to merge.

We will have a new PR for

  • importing the profile with a mixture of unspecified records and fully specified records
  • the frontend can render both of these categories:
    - records with all 4 pieces of information (fully specified)
    - records with 0 pieces of information (unspecified)

@sheridancbio
Copy link
Contributor

I am canceling the request for review from @ritikakundra because she was part of the conversation about all_cases_with_structural_variant_data v. all_cases_with_sv_data with @rmadupuri yesterday and also agreed.

@sheridancbio sheridancbio removed the request for review from ritikakundra November 28, 2019 17:03
@sheridancbio sheridancbio merged commit 8e7d58d into cBioPortal:master Nov 28, 2019
@sheridancbio sheridancbio changed the title add a new all_cases_with_structural_variant_data to the validation process add a new all_cases_with_sv_data to the validation process Nov 28, 2019
@sheridancbio
Copy link
Contributor

I just noticed that I failed to squash the two changesets into one while merging. Apologies.

@khzhu khzhu deleted the fix-fusion-tab-issue-6741 branch November 28, 2019 19:40
@inodb inodb added the bug label Nov 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fusion tab crashes on mixed structural variant types
6 participants