add a new all_cases_with_sv_data to the validation process #6860

khzhu · 2019-11-25T21:44:48Z

Fix #6741

Describe changes proposed in this pull request:

add a new all_cases_with_structural_variant_data to the SampleListCategory class
add a new all_cases_with_structural_variant_data to processCaseListDirectory in the Validate script

Checks

Runs on heroku
Has tests or has a separate issue that describes the types of test that should be created. If no test is included it should explicitly be mentioned in the PR why there is no test.
The commit log is comprehensible. It follows 7 rules of great commit messages. For most PRs a single commit should suffice, in some cases multiple topical commits can be useful. During review it is ok to see tiny commits (e.g. Fix reviewer comments), but right before the code gets merged to master or rc branch, any such commits should be squashed since they are useless to the other developers. Definitely avoid merge commits, use rebase instead.
Is this PR adding logic based on one or more clinical attributes? If yes, please make sure validation for this attribute is also present in the data validation / data loading layers (in backend repo) and documented in File-Formats Clinical data section!

Any screenshots or GIFs?

If this is a new visual feature please add a before/after screenshot or gif
here with e.g. GifGrabber.

Notify reviewers

Read our Pull request merging
policy. It can help to figure out who worked on the
file before you. Please use git blame <filename> to determine that
and notify them either through slack or by assigning them as a reviewer on the PR

sheridancbio · 2019-11-25T22:40:40Z

I've approved, but I wonder if this fix alone will fix #6741. Have we imported a file with both fully specified structural variants (Site1 and Site2 exon number and transcript id provided), and also non-specified structural variants (Site1 and Site2 exon number and transcript id blank or NA)? Is it properly visualized?

khzhu · 2019-11-25T22:45:17Z

do you have any test data for the second use case @sheridancbio? thanks!

n1zea144 · 2019-11-26T14:13:00Z

@khzhu The file in data hub for for sv in msk-2017-impact should contain both types of data. Does it not? cc: @ritikakundra

n1zea144 · 2019-11-26T14:17:10Z

@ritikakundra @rmadupuri Can you make sure that you are using the same caselist name in the sv caselist for msk-2017-impact as @khzhu is using here?

Also, can you confirm data_SV.txt file on datahub contains our full set of SV data (both sv and fusion record types - see @sheridancbio comment above).

Thanks.

khzhu · 2019-11-26T18:17:19Z

Thanks, @n1zea144!
I checked the data_sv.txt file and found some events that either do not have site1_exome or site2_exome. Those cases were all filtered by the importer script and did not make their way into the database. Should we relax importing rules to let them go through?
We do not have any cases which site1_exome and site2_exom are both null included in data_sv.txt.

rmadupuri · 2019-11-26T19:47:48Z

core/src/main/java/org/mskcc/cbio/portal/model/SampleListCategory.java

@@ -52,6 +52,7 @@
    ALL_CASES_WITH_MUTATION_AND_CNA_DATA("all_cases_with_mutation_and_cna_data"),
    ALL_CASES_WITH_MUTATION_AND_CNA_AND_MRNA_DATA("all_cases_with_mutation_and_cna_and_mrna_data"),
    ALL_CASES_WITH_GSVA_DATA("all_cases_with_gsva_data"),
+    ALL_CASES_WITH_SV_DATA("all_cases_with_structural_variant_data"),


Hi @khzhu, we are using all_cases_with_sv_data as the category name instead of all_cases_with_structural_variant_data.

rmadupuri · 2019-11-26T19:49:18Z

core/src/main/scripts/importer/validateData.py

@@ -4253,6 +4253,7 @@ def processCaseListDirectory(caseListDir, cancerStudyId, logger,
                                'all_cases_with_mutation_and_cna_data',
                                'all_cases_with_mutation_and_cna_and_mrna_data',
                                'all_cases_with_gsva_data',
+                                'all_cases_with_structural_variant_data',


Same here, should be all_cases_with_sv_data

okay, will make corrections. thanks @rmadupuri!

ritikakundra · 2019-11-26T19:59:03Z

@n1zea144 @khzhu Datahub has the full SV file (including both versions).

khzhu · 2019-11-26T20:09:38Z

thanks, @ritikakundra! I will modify the import script so that those events without exome numbers not being rejected.

sheridancbio · 2019-11-26T20:30:19Z

Thanks, @n1zea144!
I checked the data_sv.txt file and found some events that either do not have site1_exome or site2_exome. Those cases were all filtered by the importer script and did not make their way into the database. Should we relax importing rules to let them go through?
We do not have any cases which site1_exome and site2_exom are both null included in data_sv.txt.

Hi Kelsey. This is just a partial answer to your question.

There are three categories of SV records during import. There are 4 key fields:

site 1 exon number
site 2 exon number
site 1 ensembl transcript id
site 2 ensembl transcript id

There are basically 3 categories of records:

records with all 4 pieces of information (fully specified)
records with 1-3 pieces of information (partially specified)
records with 0 pieces of information (unspecified)

In the (now hidden) documentation here:
https://github.com/cBioPortal/cbioportal/blob/cbdd9d2657fbe90f01d6d258798028b097ad33ae/docs/File-Formats.md#structural-variants-data

the fully specified records are considered valid "fusion" events
all records in the structural variant data file are "structural variant" records (this includes the "fusion" events and any other less specified events)

The frontend fails to render the studies (and maybe crashes - I cannot recall the exact details) when a genetic profile contains any partially specified records ... so that is why the filter is in place to remove these events during import. We cannot remove that filter until we write some kind of code to either handle these properly in the frontend, or to "downgrade" these records to remain as unspecified records. We have basically postponed this work / decision.

The other issue is that although the frontend can display genetic profiles made up entirely of unspecified records .. or fully specified records ... it cannot render a mixture and the visual display crashes when opening the fusions tab.

If the data_SV.txt file from mskimpact_2017 study contains no unspecified records then for testing you can simply blank out (or make NA) the fields for some of the partially specified records which would demote them to become unspecified records and allow them to be imported. Then importing the profile will give a mixture of unspecified records and fully specified records ... and the frontend currently cannot render such a mixture. We are hoping the frontend can me massaged to render both of these categories at once.

I'll need to respond with more details later.

khzhu · 2019-11-26T20:53:03Z

thanks, @sheridancbio , for detailed information!
I will adjust the importing script to let category 2 go through. Hope that can help reproduce the error (fusion tab crash).
Does not seem like we have the third category: records with 0 pieces of information (unspecified)
in data_sv.txt downloaded from the data hub.

khzhu · 2019-11-26T21:45:01Z

Hi @sheridancbio , I've made changes to use all_cases_with_sv_data instead. The PR is ready to merge.

We will have a new PR for

importing the profile with a mixture of unspecified records and fully specified records
the frontend can render both of these categories:
- records with all 4 pieces of information (fully specified)
- records with 0 pieces of information (unspecified)

sheridancbio · 2019-11-28T17:03:42Z

I am canceling the request for review from @ritikakundra because she was part of the conversation about all_cases_with_structural_variant_data v. all_cases_with_sv_data with @rmadupuri yesterday and also agreed.

sheridancbio · 2019-11-28T17:13:06Z

I just noticed that I failed to squash the two changesets into one while merging. Apologies.

khzhu requested review from n1zea144 and sheridancbio November 25, 2019 21:44

sheridancbio approved these changes Nov 25, 2019

View reviewed changes

n1zea144 requested review from ritikakundra and rmadupuri November 26, 2019 14:15

rmadupuri requested changes Nov 26, 2019

View reviewed changes

khzhu added 2 commits November 26, 2019 13:28

add a new all_cases_with_structural_variant_data to SampleList Category

527ee33

resolving code review comments

cd1b6af

khzhu force-pushed the fix-fusion-tab-issue-6741 branch from 86cbcee to cd1b6af Compare November 26, 2019 21:31

n1zea144 approved these changes Nov 27, 2019

View reviewed changes

rmadupuri approved these changes Nov 27, 2019

View reviewed changes

sheridancbio removed the request for review from ritikakundra November 28, 2019 17:03

sheridancbio merged commit 8e7d58d into cBioPortal:master Nov 28, 2019

sheridancbio changed the title ~~add a new all_cases_with_structural_variant_data to the validation process~~ add a new all_cases_with_sv_data to the validation process Nov 28, 2019

khzhu deleted the fix-fusion-tab-issue-6741 branch November 28, 2019 19:40

inodb added the bug label Nov 29, 2019

khzhu mentioned this pull request Dec 3, 2019

hide fusion gene table in the summary view if no fusion data present … #6731

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a new all_cases_with_sv_data to the validation process #6860

add a new all_cases_with_sv_data to the validation process #6860

khzhu commented Nov 25, 2019

sheridancbio commented Nov 25, 2019

khzhu commented Nov 25, 2019

n1zea144 commented Nov 26, 2019 •

edited

n1zea144 commented Nov 26, 2019

khzhu commented Nov 26, 2019

rmadupuri Nov 26, 2019

rmadupuri Nov 26, 2019

khzhu Nov 26, 2019

khzhu Nov 26, 2019

ritikakundra commented Nov 26, 2019

khzhu commented Nov 26, 2019

sheridancbio commented Nov 26, 2019

khzhu commented Nov 26, 2019

khzhu commented Nov 26, 2019

sheridancbio commented Nov 28, 2019

sheridancbio commented Nov 28, 2019

add a new all_cases_with_sv_data to the validation process #6860

add a new all_cases_with_sv_data to the validation process #6860

Conversation

khzhu commented Nov 25, 2019

Checks

Any screenshots or GIFs?

Notify reviewers

sheridancbio commented Nov 25, 2019

khzhu commented Nov 25, 2019

n1zea144 commented Nov 26, 2019 • edited

n1zea144 commented Nov 26, 2019

khzhu commented Nov 26, 2019

rmadupuri Nov 26, 2019

Choose a reason for hiding this comment

rmadupuri Nov 26, 2019

Choose a reason for hiding this comment

khzhu Nov 26, 2019

Choose a reason for hiding this comment

khzhu Nov 26, 2019

Choose a reason for hiding this comment

ritikakundra commented Nov 26, 2019

khzhu commented Nov 26, 2019

sheridancbio commented Nov 26, 2019

khzhu commented Nov 26, 2019

khzhu commented Nov 26, 2019

sheridancbio commented Nov 28, 2019

sheridancbio commented Nov 28, 2019

n1zea144 commented Nov 26, 2019 •

edited