Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support multiple reference genomes 2nd try #5891

Merged
merged 9 commits into from Jul 17, 2019

Conversation

khzhu
Copy link
Contributor

@khzhu khzhu commented Mar 24, 2019

What? Why?

Fix
#5652
Changes proposed in this pull request:

  • remove cytoband & length from the gene table
  • replace entrez_gene_id with genetic_entity_id in reference_genome_gene table to support gene isoforms
  • add REFERENCE_GENOME_ID to the cancer_study table
    • if no REFERENCE_GENOME_ID specified in the study meta file use default GRCh37/hg19
  • add a new API endpoint reference-genome-genes for querying gene related annotation such as cytoband/exonic length/chromosome from reference_genome_gene table
  • for mutations, use the CHR in the mutation_event table instead of calculating chromosome from the cytoband recorded in the gene table
  • add a database constraint to ensure one species per portal installation

Checks

  • Runs on Heroku.
  • Follows 7 rules of great commit messages. For most PRs a single commit should suffice, in some cases multiple topical commits can be useful. During review it is ok to see tiny commits (e.g. Fix reviewer comments), but right before the code gets merged to master or rc branch, any such commits should be squashed since they are useless to the other developers. Definitely avoid merge commits, use rebase instead.
  • Follows the Google Style Guide.
  • If this is a feature, the PR is to rc. If this is a bug fix, the PR is to master.

Any screenshots or GIFs?

If this is a new visual feature please add a before/after screenshot or gif
here with e.g. GifGrabber.

@jjgao jjgao requested review from n1zea144 and sheridancbio and removed request for adamabeshouse March 25, 2019 17:31
@khzhu khzhu force-pushed the multiple-genome-build-5652 branch 2 times, most recently from 90b1416 to 354c316 Compare March 27, 2019 19:33
@sheridancbio
Copy link
Contributor

Hi Kelsey. I'm going to start reviewing this. It looks good, but it is pretty big. Could you give a few suggestions on what would be the best classes / configuration to start looking at to understand the important changes? (Advice on how to navigate through the PR and understand the important stuff early in the process) Thanks.

@sheridancbio sheridancbio added enhancement rc applies to the `rc` development branch includes db changes api labels Mar 28, 2019
@sheridancbio
Copy link
Contributor

I see several files with name "pom.xml.new-version" checked in. Aren't these artifacts of the maven build script? I think these can be dropped from the PR.

@sheridancbio
Copy link
Contributor

I see this image file added, but no references to it. I wonder... is there another PR for frontend changes connected to this work?
portal/src/main/webapp/images/uhn-logo.png

@khzhu
Copy link
Contributor Author

khzhu commented Mar 28, 2019

@sheridancbio , oops, should not be included, will remove it. Thanks!

@khzhu
Copy link
Contributor Author

khzhu commented Mar 28, 2019

@sheridancbio , yes, they should be removed as well. I used "git add -A" due to a large number of files changed, that picked some unwanted files as well. Sorry!

@khzhu khzhu force-pushed the multiple-genome-build-5652 branch from 354c316 to bc9fb5a Compare March 28, 2019 17:23
@khzhu
Copy link
Contributor Author

khzhu commented Mar 28, 2019

@sheridancbio , extra files removed and push the change back.

@khzhu khzhu force-pushed the multiple-genome-build-5652 branch from bc9fb5a to 5465e10 Compare April 1, 2019 03:22
@sheridancbio
Copy link
Contributor

sheridancbio commented Apr 2, 2019

currently there are three service module unit tests failing in travis-ci:
getExpressionEnrichments(org.cbioportal.service.impl.ExpressionEnrichmentServiceImplTest): expected: but was:<->
getGeneCorrelationForQueriedGene(org.cbioportal.service.impl.CoExpressionServiceImplTest): expected: but was:<->
fetchGeneCoExpressions(org.cbioportal.service.impl.CoExpressionServiceImplTest): expected: but was:<->

@khzhu
Copy link
Contributor Author

khzhu commented Apr 2, 2019

@sheridancbio , will take a look. The PR passed all unit testing when I was testing. thanks!

@khzhu khzhu force-pushed the multiple-genome-build-5652 branch from 5465e10 to 7ac6ff4 Compare April 2, 2019 17:44
@khzhu
Copy link
Contributor Author

khzhu commented Apr 2, 2019

@sheridancbio , pushed the change back, should fix those failed unit testing. Thanks!

@khzhu
Copy link
Contributor Author

khzhu commented Apr 5, 2019

hi @sheridancbio , please find updated schema diagram. I will update the code once I made changes. Thanks!
support-multiple-genomes-5652-revised

@khzhu khzhu force-pushed the multiple-genome-build-5652 branch from 7ac6ff4 to 145e590 Compare April 5, 2019 18:27
@khzhu
Copy link
Contributor Author

khzhu commented Apr 5, 2019

Hi @sheridancbio , pushed the change back and is ready for the review. Thanks!

@khzhu khzhu force-pushed the multiple-genome-build-5652 branch from 145e590 to 36d5b5c Compare April 5, 2019 20:45
Copy link
Contributor

@sheridancbio sheridancbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far so good. This is only a partial review, covering the core dao persistence classes. I will continue a pass through the entire code before making change requests.

I make lots of comments while doing code review ... not every comment needs to be addressed - sometimes I am just making notes so that I understand the intended changes, and my own thought process as I come to understand the new features. But I have raised some issues that we probably should talk about soon too.

core/pom.xml Outdated Show resolved Hide resolved
core/pom.xml Outdated Show resolved Hide resolved
@@ -56,7 +56,7 @@
*/
public static DataSource getDataSource() {
if (dataSource == null) {
dataSource = new JdbcDataSource();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful here. There is code which wires beans appropriately during spring startup for determination of the data source based on portal.properties and application-context files. We use a shared data source within tomcat when running multiple instances of cBioPortal within a single tomcat application. It seems like this code change may hardwire a connection pool through the apache BasicDataSource library onto each running tomcat web application.
We will do some testing of this. Maybe @n1zea144 can comment on this code change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good points, if you need to run multiple portal instances in a single tomcat container. Will revert the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sheridancbio , only with one change made to JdbcDataSource constructor, to replace hard code mysql driver class name with value read from portal.properties file as the following:
from

this.setDriverClassName("com.mysql.jdbc.Driver");

to

String mysqlDriverClassName = dbProperties.getDbDriverClassName();
this.setDriverClassName(mysqlDriverClassName);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to self : let's test this code in tomcat

core/src/main/java/org/mskcc/cbio/portal/dao/JdbcUtil.java Outdated Show resolved Hide resolved
@khzhu
Copy link
Contributor Author

khzhu commented Apr 5, 2019

@sheridancbio , thank you so much for the feedback. I will go through each of your comments/suggestions over the weekend.

@@ -120,7 +121,7 @@ public void createAlterationEnrichments() throws Exception {
AlterationEnrichment alterationEnrichment2 = result.get(1);
Assert.assertEquals((Integer) 3, alterationEnrichment2.getEntrezGeneId());
Assert.assertEquals("HUGO3", alterationEnrichment2.getHugoGeneSymbol());
Assert.assertEquals("CYTOBAND3", alterationEnrichment2.getCytoband());
Assert.assertEquals(null, alterationEnrichment2.getCytoband());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems incorrect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @inodb , that is right. for alterations we will have to fill out cytobands in the frontend, since alterations have no reference genome information associated with them. We will have to rely on the reference genome information of the study to backfill cytobands in the frontend.

@khzhu
Copy link
Contributor Author

khzhu commented Jun 29, 2019

Hi @ao508 @sheridancbio @pieterlukasse @inodb ,
rebased to the latest rc, addressed all review comments and passed CI 😄. Please resolve your comments and approve the PR or let me know anything I missed. Thanks again for all your time and suggestions!

@khzhu khzhu force-pushed the multiple-genome-build-5652 branch from 3948ce9 to d2fec9d Compare June 30, 2019 16:21
@inodb inodb changed the base branch from rc to release-3.1.0 July 8, 2019 19:33
Copy link
Contributor

@ao508 ao508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made some additional (minor) requests. let me know if you have any questions.

@@ -82,10 +82,10 @@ public void importData() throws IOException, DaoException {

String geneAId = parts[0];

CanonicalGene geneA = daoGene.getNonAmbiguousGene(geneAId, null);
CanonicalGene geneA = daoGene.getNonAmbiguousGene(geneAId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khzhu Just wanted to clarify a bit more - the call daoGene.getNonAmbiguousGene(geneAId, null); --> daoGene.getNonAmbiguousGene(geneAId) isn't consistent with other calls that were modified in this pr, such as the one below at L88. Why was the second parameter ignored above but set to true in the call below?


#Set defaults for genome version and species
self.__species = 'human'
self.__ncbi_build = '37'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should 37 be changed to GRCh37 to be consistent with how we refer to ncbi builds in the java code? see here

@@ -396,12 +396,21 @@ CREATE TABLE `reference_genome` (
UNIQUE INDEX `BUILD_NAME_UNIQUE` (`BUILD_NAME` ASC)
);

<<<<<<< HEAD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unmerged file conflict

INSERT INTO `reference_genome`
VALUES (1, 'human', 'hg19', 'GRCh37', NULL, 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips', '2009-02-01');
INSERT INTO `reference_genome`
VALUES (2, 'human', 'hg38', 'GRCh38', NULL, 'http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips', '2013-12-01');
INSERT INTO `reference_genome`
VALUES (3, 'mouse', 'mm10', 'GRCm38', NULL, 'http://hgdownload.cse.ucsc.edu//goldenPath/mm10/bigZips', '2012-01-01');
>>>>>>> resolving code review comments/suggestions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another conflict

@@ -620,7 +620,7 @@ The extended MAF format recognized by the portal has:
1. **Hugo_Symbol (Required)**: A [HUGO](http://www.genenames.org/) gene symbol.
2. **Entrez_Gene_Id (Optional, but recommended)**: A [Entrez Gene](http://www.ncbi.nlm.nih.gov/gene) identifier.
3. **Center (Optional)**: The sequencing center.
4. **NCBI_Build (Optional)<sup>1</sup>**: Must be "GRCh37" for human, and "GRCm38" for mouse.
4. **NCBI_Build (Required)<sup>1</sup>**: The Genome Reference Consortium Build is used by a variant calling software. It must be "GRCh37" or "GRCh38" for a human, and "GRCm38" for a mouse.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khzhu we seem to refer to the NCBI builds as GRCh* but the python scripts just refer (or expect) the actual numbers. These should match. I think the python scripts should be changed to follow GRCh* for consistency

-ucsc UCSC_BUILD_NAME, --ucsc_build_name UCSC_BUILD_NAME
UCSC reference genome assembly name (default: assumed hg19)
-ncbi NCBI_BUILD_NUMBER, --ncbi_build_number NCBI_BUILD_NUMBER
NCBI reference genome build number (default: assumed 37 for UCSC reference genome build hg19)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use GRCh*

species information (default: assumed human)
-ucsc UCSC_BUILD_NAME, --ucsc_build_name UCSC_BUILD_NAME
UCSC reference genome assembly name (default: assumed hg19)
-ncbi NCBI_BUILD_NUMBER, --ncbi_build_number NCBI_BUILD_NUMBER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

@@ -62,7 +62,7 @@
AlterationEnrichment alterationEnrichment = new AlterationEnrichment();
alterationEnrichment.setEntrezGeneId(gene.getEntrezGeneId());
alterationEnrichment.setHugoGeneSymbol(gene.getHugoGeneSymbol());
alterationEnrichment.setCytoband(gene.getCytoband());
//alterationEnrichment.setCytoband(gene.getCytoband());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this if it's not going to be used or is there a plan to replace it with the reference_genome_gene.CYTOBAND? If so then can you add a TODO here?

@khzhu
Copy link
Contributor Author

khzhu commented Jul 11, 2019

@ao508 , replace those 37 with GRCh37 in validator documents. Also, removed the comment. For alterations, cytobands will be filled by the frontend react module.

@khzhu khzhu force-pushed the multiple-genome-build-5652 branch from d2fec9d to eade653 Compare July 11, 2019 16:46
@khzhu
Copy link
Contributor Author

khzhu commented Jul 11, 2019

@ao508 @inodb , addressed all your new comments, please have your final review. thanks!

lemccarthy and others added 9 commits July 12, 2019 11:29
use MySql reserved keywords with a quote

adjust column sizes in mutation_event table to be compatible with UTF8

add new options (species and reference genome build) to PYTHON importer/validate scripts

check reference_genome_id in seg file and NCBI_Build (if filled) in MAF file see if agree with reference genome in cancer study meta file

add a new API endpoint to fetch reference genome genes by entrez gene ids and hugo gene symbols
- structural variant seed data
- db version to 2.11.0
use MySql reserved keywords with a quote

adjust column sizes in mutation_event table to be compatible with UTF8

add new options (species and reference genome build) to PYTHON importer/validate scripts

check reference_genome_id in seg file and NCBI_Build (if filled) in MAF file see if agree with reference genome in cancer study meta file

add a new API endpoint to fetch reference genome genes by entrez gene ids and hugo gene symbols
@inodb inodb force-pushed the multiple-genome-build-5652 branch from eade653 to a46a54a Compare July 12, 2019 16:01
@inodb inodb requested a review from ao508 July 12, 2019 16:50
@inodb inodb merged commit 484a7b9 into cBioPortal:release-3.1.0 Jul 17, 2019
@sheridancbio sheridancbio removed their assignment Jul 17, 2019
@jagnathan jagnathan deleted the multiple-genome-build-5652 branch June 2, 2021 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api enhancement includes db changes rc applies to the `rc` development branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants