support multiple reference genomes 2nd try #5891

khzhu · 2019-03-24T23:49:27Z

What? Why?

Fix
#5652
Changes proposed in this pull request:

remove cytoband & length from the gene table
replace entrez_gene_id with genetic_entity_id in reference_genome_gene table to support gene isoforms
add REFERENCE_GENOME_ID to the cancer_study table
- if no REFERENCE_GENOME_ID specified in the study meta file use default GRCh37/hg19
add a new API endpoint reference-genome-genes for querying gene related annotation such as cytoband/exonic length/chromosome from reference_genome_gene table
for mutations, use the CHR in the mutation_event table instead of calculating chromosome from the cytoband recorded in the gene table
add a database constraint to ensure one species per portal installation

Checks

Runs on Heroku.
Follows 7 rules of great commit messages. For most PRs a single commit should suffice, in some cases multiple topical commits can be useful. During review it is ok to see tiny commits (e.g. Fix reviewer comments), but right before the code gets merged to master or rc branch, any such commits should be squashed since they are useless to the other developers. Definitely avoid merge commits, use rebase instead.
Follows the Google Style Guide.
If this is a feature, the PR is to rc. If this is a bug fix, the PR is to master.

Any screenshots or GIFs?

If this is a new visual feature please add a before/after screenshot or gif
here with e.g. GifGrabber.

sheridancbio · 2019-03-28T14:05:15Z

Hi Kelsey. I'm going to start reviewing this. It looks good, but it is pretty big. Could you give a few suggestions on what would be the best classes / configuration to start looking at to understand the important changes? (Advice on how to navigate through the PR and understand the important stuff early in the process) Thanks.

sheridancbio · 2019-03-28T14:11:38Z

I see several files with name "pom.xml.new-version" checked in. Aren't these artifacts of the maven build script? I think these can be dropped from the PR.

sheridancbio · 2019-03-28T15:48:32Z

I see this image file added, but no references to it. I wonder... is there another PR for frontend changes connected to this work?
portal/src/main/webapp/images/uhn-logo.png

khzhu · 2019-03-28T16:17:08Z

@sheridancbio , oops, should not be included, will remove it. Thanks!

khzhu · 2019-03-28T16:32:32Z

@sheridancbio , yes, they should be removed as well. I used "git add -A" due to a large number of files changed, that picked some unwanted files as well. Sorry!

khzhu · 2019-03-28T17:24:21Z

@sheridancbio , extra files removed and push the change back.

sheridancbio · 2019-04-02T15:35:33Z

currently there are three service module unit tests failing in travis-ci:
getExpressionEnrichments(org.cbioportal.service.impl.ExpressionEnrichmentServiceImplTest): expected: but was:<->
getGeneCorrelationForQueriedGene(org.cbioportal.service.impl.CoExpressionServiceImplTest): expected: but was:<->
fetchGeneCoExpressions(org.cbioportal.service.impl.CoExpressionServiceImplTest): expected: but was:<->

khzhu · 2019-04-02T16:27:45Z

@sheridancbio , will take a look. The PR passed all unit testing when I was testing. thanks!

khzhu · 2019-04-02T17:49:07Z

@sheridancbio , pushed the change back, should fix those failed unit testing. Thanks!

khzhu · 2019-04-05T15:01:37Z

hi @sheridancbio , please find updated schema diagram. I will update the code once I made changes. Thanks!

khzhu · 2019-04-05T18:31:22Z

Hi @sheridancbio , pushed the change back and is ready for the review. Thanks!

sheridancbio

So far so good. This is only a partial review, covering the core dao persistence classes. I will continue a pass through the entire code before making change requests.

I make lots of comments while doing code review ... not every comment needs to be addressed - sometimes I am just making notes so that I understand the intended changes, and my own thought process as I come to understand the new features. But I have raised some issues that we probably should talk about soon too.

business/src/main/resources/org/mskcc/cbio/portal/persistence/GenePanelMapperLegacy.xml

business/src/main/resources/org/mskcc/cbio/portal/persistence/StudyMapperLegacy.xml

core/pom.xml

core/src/main/java/org/mskcc/cbio/portal/dao/DaoCancerStudy.java

core/src/main/java/org/mskcc/cbio/portal/dao/DaoReferenceGenomeGene.java

sheridancbio · 2019-04-05T22:32:00Z

core/src/main/java/org/mskcc/cbio/portal/dao/JdbcUtil.java

@@ -56,7 +56,7 @@
     */
    public static DataSource getDataSource() {
        if (dataSource == null) {
-            dataSource = new JdbcDataSource();


We need to be careful here. There is code which wires beans appropriately during spring startup for determination of the data source based on portal.properties and application-context files. We use a shared data source within tomcat when running multiple instances of cBioPortal within a single tomcat application. It seems like this code change may hardwire a connection pool through the apache BasicDataSource library onto each running tomcat web application.
We will do some testing of this. Maybe @n1zea144 can comment on this code change.

good points, if you need to run multiple portal instances in a single tomcat container. Will revert the change.

@sheridancbio , only with one change made to JdbcDataSource constructor, to replace hard code mysql driver class name with value read from portal.properties file as the following:
from

this.setDriverClassName("com.mysql.jdbc.Driver");

to

String mysqlDriverClassName = dbProperties.getDbDriverClassName();
this.setDriverClassName(mysqlDriverClassName);

reminder to self : let's test this code in tomcat

core/src/main/java/org/mskcc/cbio/portal/dao/JdbcUtil.java

khzhu · 2019-04-05T23:30:02Z

@sheridancbio , thank you so much for the feedback. I will go through each of your comments/suggestions over the weekend.

inodb · 2019-06-28T22:58:14Z

service/src/test/java/org/cbioportal/service/util/AlterationEnrichmentUtilTest.java

@@ -120,7 +121,7 @@ public void createAlterationEnrichments() throws Exception {
        AlterationEnrichment alterationEnrichment2 = result.get(1);
        Assert.assertEquals((Integer) 3, alterationEnrichment2.getEntrezGeneId());
        Assert.assertEquals("HUGO3", alterationEnrichment2.getHugoGeneSymbol());
-        Assert.assertEquals("CYTOBAND3", alterationEnrichment2.getCytoband());
+        Assert.assertEquals(null, alterationEnrichment2.getCytoband());


this seems incorrect?

Hi @inodb , that is right. for alterations we will have to fill out cytobands in the frontend, since alterations have no reference genome information associated with them. We will have to rely on the reference genome information of the study to backfill cytobands in the frontend.

khzhu · 2019-06-29T16:33:47Z

Hi @ao508 @sheridancbio @pieterlukasse @inodb ,
rebased to the latest rc, addressed all review comments and passed CI 😄. Please resolve your comments and approve the PR or let me know anything I missed. Thanks again for all your time and suggestions!

ao508

made some additional (minor) requests. let me know if you have any questions.

ao508 · 2019-07-10T17:41:09Z

core/src/main/java/org/mskcc/cbio/portal/scripts/ImportPathwayCommonsExtSif.java

@@ -82,10 +82,10 @@ public void importData() throws IOException, DaoException {

            String geneAId = parts[0];

-            CanonicalGene geneA = daoGene.getNonAmbiguousGene(geneAId, null);
+            CanonicalGene geneA = daoGene.getNonAmbiguousGene(geneAId);


@khzhu Just wanted to clarify a bit more - the call daoGene.getNonAmbiguousGene(geneAId, null); --> daoGene.getNonAmbiguousGene(geneAId) isn't consistent with other calls that were modified in this pr, such as the one below at L88. Why was the second parameter ignored above but set to true in the call below?

ao508 · 2019-07-10T17:44:46Z

core/src/main/scripts/importer/validateData.py

+
+        #Set defaults for genome version and species
+        self.__species = 'human'
+        self.__ncbi_build = '37'


should 37 be changed to GRCh37 to be consistent with how we refer to ncbi builds in the java code? see here

ao508 · 2019-07-10T17:46:01Z

db-scripts/src/main/resources/migration.sql

@@ -396,12 +396,21 @@ CREATE TABLE `reference_genome` (
    UNIQUE INDEX `BUILD_NAME_UNIQUE` (`BUILD_NAME` ASC)
 );

+<<<<<<< HEAD


unmerged file conflict

ao508 · 2019-07-10T17:46:15Z

db-scripts/src/main/resources/migration.sql

 INSERT INTO `reference_genome`
 VALUES (1, 'human', 'hg19', 'GRCh37', NULL, 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips', '2009-02-01');
 INSERT INTO `reference_genome`
 VALUES (2, 'human', 'hg38', 'GRCh38', NULL, 'http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips', '2013-12-01');
 INSERT INTO `reference_genome`
 VALUES (3, 'mouse', 'mm10', 'GRCm38', NULL, 'http://hgdownload.cse.ucsc.edu//goldenPath/mm10/bigZips', '2012-01-01');
+>>>>>>> resolving code review comments/suggestions


another conflict

ao508 · 2019-07-10T17:47:44Z

docs/File-Formats.md

@@ -620,7 +620,7 @@ The extended MAF format recognized by the portal has:
 1. **Hugo_Symbol (Required)**: A [HUGO](http://www.genenames.org/) gene symbol.
 2. **Entrez_Gene_Id (Optional, but recommended)**: A [Entrez Gene](http://www.ncbi.nlm.nih.gov/gene) identifier.
 3. **Center (Optional)**: The sequencing center.
-4. **NCBI_Build (Optional)<sup>1</sup>**: Must be "GRCh37" for human, and "GRCm38" for mouse.
+4. **NCBI_Build (Required)<sup>1</sup>**: The Genome Reference Consortium Build is used by a variant calling software. It must be "GRCh37" or "GRCh38" for a human, and "GRCm38" for a mouse.


@khzhu we seem to refer to the NCBI builds as GRCh* but the python scripts just refer (or expect) the actual numbers. These should match. I think the python scripts should be changed to follow GRCh* for consistency

ao508 · 2019-07-10T17:48:10Z

docs/Using-the-dataset-validator.md

+  -ucsc UCSC_BUILD_NAME, --ucsc_build_name UCSC_BUILD_NAME
+                        UCSC reference genome assembly name (default: assumed hg19)
+  -ncbi NCBI_BUILD_NUMBER, --ncbi_build_number NCBI_BUILD_NUMBER
+                        NCBI reference genome build number (default: assumed 37 for UCSC reference genome build hg19)


use GRCh*

ao508 · 2019-07-10T17:48:22Z

docs/Using-the-dataset-validator.md

+                        species information (default: assumed human)
+  -ucsc UCSC_BUILD_NAME, --ucsc_build_name UCSC_BUILD_NAME
+                        UCSC reference genome assembly name (default: assumed hg19)
+  -ncbi NCBI_BUILD_NUMBER, --ncbi_build_number NCBI_BUILD_NUMBER


same comment here

ao508 · 2019-07-10T17:50:50Z

service/src/main/java/org/cbioportal/service/util/AlterationEnrichmentUtil.java

@@ -62,7 +62,7 @@
            AlterationEnrichment alterationEnrichment = new AlterationEnrichment();
            alterationEnrichment.setEntrezGeneId(gene.getEntrezGeneId());
            alterationEnrichment.setHugoGeneSymbol(gene.getHugoGeneSymbol());
-            alterationEnrichment.setCytoband(gene.getCytoband());
+            //alterationEnrichment.setCytoband(gene.getCytoband());


can we remove this if it's not going to be used or is there a plan to replace it with the reference_genome_gene.CYTOBAND? If so then can you add a TODO here?

khzhu · 2019-07-11T16:28:41Z

@ao508 , replace those 37 with GRCh37 in validator documents. Also, removed the comment. For alterations, cytobands will be filled by the frontend react module.

khzhu · 2019-07-11T17:10:32Z

@ao508 @inodb , addressed all your new comments, please have your final review. thanks!

use MySql reserved keywords with a quote adjust column sizes in mutation_event table to be compatible with UTF8 add new options (species and reference genome build) to PYTHON importer/validate scripts check reference_genome_id in seg file and NCBI_Build (if filled) in MAF file see if agree with reference genome in cancer study meta file add a new API endpoint to fetch reference genome genes by entrez gene ids and hugo gene symbols

- structural variant seed data - db version to 2.11.0

use MySql reserved keywords with a quote adjust column sizes in mutation_event table to be compatible with UTF8 add new options (species and reference genome build) to PYTHON importer/validate scripts check reference_genome_id in seg file and NCBI_Build (if filled) in MAF file see if agree with reference genome in cancer study meta file add a new API endpoint to fetch reference genome genes by entrez gene ids and hugo gene symbols

khzhu requested review from pieterlukasse, jjgao and adamabeshouse March 24, 2019 23:56

jjgao requested review from n1zea144 and sheridancbio and removed request for adamabeshouse March 25, 2019 17:31

khzhu force-pushed the multiple-genome-build-5652 branch 2 times, most recently from 90b1416 to 354c316 Compare March 27, 2019 19:33

sheridancbio added enhancement rc applies to the `rc` development branch includes db changes api labels Mar 28, 2019

khzhu force-pushed the multiple-genome-build-5652 branch from 354c316 to bc9fb5a Compare March 28, 2019 17:23

khzhu force-pushed the multiple-genome-build-5652 branch from bc9fb5a to 5465e10 Compare April 1, 2019 03:22

khzhu force-pushed the multiple-genome-build-5652 branch from 5465e10 to 7ac6ff4 Compare April 2, 2019 17:44

khzhu force-pushed the multiple-genome-build-5652 branch from 7ac6ff4 to 145e590 Compare April 5, 2019 18:27

khzhu force-pushed the multiple-genome-build-5652 branch from 145e590 to 36d5b5c Compare April 5, 2019 20:45

sheridancbio reviewed Apr 5, 2019

View reviewed changes

inodb reviewed Jun 28, 2019

View reviewed changes

inodb requested review from sheridancbio, ao508, onursumer, averyniceday, inodb and pieterlukasse June 29, 2019 18:20

khzhu force-pushed the multiple-genome-build-5652 branch from 3948ce9 to d2fec9d Compare June 30, 2019 16:21

inodb changed the base branch from rc to release-3.1.0 July 8, 2019 19:33

ao508 requested changes Jul 10, 2019

View reviewed changes

khzhu force-pushed the multiple-genome-build-5652 branch from d2fec9d to eade653 Compare July 11, 2019 16:46

lemccarthy and others added 9 commits July 12, 2019 11:29

Removed an unused class

64bc1c3

Update InvolvedCancerStudyExtractorInterceptor.java

2bf032b

resolving code review comments/suggestions

b1b65c7

Fixes after rebase on rc

ab0e08d

- structural variant seed data - db version to 2.11.0

resolving code review comments/suggestions

8a067ad

resolving merge conflicts

7aac075

fix unusedclass conflict

a46a54a

inodb force-pushed the multiple-genome-build-5652 branch from eade653 to a46a54a Compare July 12, 2019 16:01

inodb requested a review from ao508 July 12, 2019 16:50

ao508 approved these changes Jul 12, 2019

View reviewed changes

inodb merged commit 484a7b9 into cBioPortal:release-3.1.0 Jul 17, 2019

sheridancbio removed their assignment Jul 17, 2019

leexgh mentioned this pull request Aug 7, 2019

look at tests failing in frontend release-3.1.0 #6476

Closed

jagnathan deleted the multiple-genome-build-5652 branch June 2, 2021 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support multiple reference genomes 2nd try #5891

support multiple reference genomes 2nd try #5891

khzhu commented Mar 24, 2019 •

edited

sheridancbio commented Mar 28, 2019

sheridancbio commented Mar 28, 2019

sheridancbio commented Mar 28, 2019

khzhu commented Mar 28, 2019

khzhu commented Mar 28, 2019

khzhu commented Mar 28, 2019

sheridancbio commented Apr 2, 2019 •

edited

khzhu commented Apr 2, 2019

khzhu commented Apr 2, 2019

khzhu commented Apr 5, 2019

khzhu commented Apr 5, 2019

sheridancbio left a comment

sheridancbio Apr 5, 2019

khzhu Apr 10, 2019

khzhu Apr 10, 2019

sheridancbio May 13, 2019

khzhu commented Apr 5, 2019

inodb Jun 28, 2019

khzhu Jun 28, 2019

khzhu commented Jun 29, 2019

ao508 left a comment

ao508 Jul 10, 2019

ao508 Jul 10, 2019

ao508 Jul 10, 2019

ao508 Jul 10, 2019

ao508 Jul 10, 2019

ao508 Jul 10, 2019

ao508 Jul 10, 2019

ao508 Jul 10, 2019

khzhu commented Jul 11, 2019

khzhu commented Jul 11, 2019

support multiple reference genomes 2nd try #5891

support multiple reference genomes 2nd try #5891

Conversation

khzhu commented Mar 24, 2019 • edited

What? Why?

Checks

Any screenshots or GIFs?

sheridancbio commented Mar 28, 2019

sheridancbio commented Mar 28, 2019

sheridancbio commented Mar 28, 2019

khzhu commented Mar 28, 2019

khzhu commented Mar 28, 2019

khzhu commented Mar 28, 2019

sheridancbio commented Apr 2, 2019 • edited

khzhu commented Apr 2, 2019

khzhu commented Apr 2, 2019

khzhu commented Apr 5, 2019

khzhu commented Apr 5, 2019

sheridancbio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khzhu commented Apr 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khzhu commented Jun 29, 2019

ao508 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khzhu commented Jul 11, 2019

khzhu commented Jul 11, 2019

khzhu commented Mar 24, 2019 •

edited

sheridancbio commented Apr 2, 2019 •

edited