Gene info parser fix #6754

SRodenburg · 2019-10-29T08:45:52Z

I ran into some problems while building a hg38 seed database:

Sometimes the file Homo_sapiens.gene_info contains missing values in the map_location and/or chromosome columns. This file is downloaded from NCBI and required when updating genes and gene aliases.
Additionally, sometimes a particular symbol from the GenCode annotation file is not in the info file, causing a NullPointerException.

Changes:

When the chromosome can't be parsed from the map_location column, try the chromosome column. If that also fails, skip gene.
prevent NullPointerException when the gene symbol can't be found, and instead showing a warning, and skipping the gene.

khzhu · 2019-10-30T03:42:59Z

core/src/main/java/org/mskcc/cbio/portal/scripts/ImportGeneData.java

                    int referenceGenomeId = DaoReferenceGenome.getReferenceGenomeByBuildName(genomeBuild).getReferenceGenomeId();
                    String desc = parts[8];
                    String type = parts[9];
                    String mainSymbol = parts[10]; // use 10 instead of 2 since column 2 may have duplication
                    Set<String> aliases = new HashSet<String>();

+                    // try to get chr from other column if needed
+                    if (chr.equals("-")) {


combine two ifs into one, so the second condition will not be evaluated if first one is false.

if (chr.equals("-") && (!parts[6].equals("-")) { chr = parts[6]; } else { // skip line if still unable to parse chr continue; }

The unittests revealed a problem with your proposed implementation; the code will go straight to continue when chr does not equal "-", which is obviously not what we want.

A "synonymous" implementation would be:

if (chr.equals("-") && parts[6].equals("-")) { continue } if (chr.equals("-")) { chr = parts[6]; }

However, this requires 2 logical tests while my original implemetation for most of the lines requires 1, thus would be better performance-wise.

core/src/main/java/org/mskcc/cbio/portal/scripts/ImportGeneData.java

SRodenburg · 2019-11-20T08:37:57Z

@inodb can you merge?

SRodenburg changed the base branch from master to rc October 29, 2019 08:47

SRodenburg requested review from khzhu and oplantalech October 29, 2019 08:48

khzhu reviewed Oct 30, 2019

View reviewed changes

core/src/main/java/org/mskcc/cbio/portal/scripts/ImportGeneData.java Outdated Show resolved Hide resolved

Sander Rodenburg added 2 commits October 30, 2019 10:28

prevent crash when gene is null

246fbd5

Resolve Kelsey's comments

26cb35b

SRodenburg changed the base branch from rc to master October 30, 2019 09:34

SRodenburg force-pushed the gene_info_parser_fix branch from 46f4ba9 to 26cb35b Compare October 30, 2019 09:36

Revert some adjustments

cb50b7e

SRodenburg requested a review from khzhu November 4, 2019 12:33

inodb approved these changes Nov 5, 2019

View reviewed changes

oplantalech approved these changes Nov 7, 2019

View reviewed changes

khzhu approved these changes Nov 7, 2019

View reviewed changes

inodb approved these changes Nov 27, 2019

View reviewed changes

inodb merged commit 8a953f7 into cBioPortal:master Nov 27, 2019

inodb added the bug label Nov 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gene info parser fix #6754

Gene info parser fix #6754

SRodenburg commented Oct 29, 2019

khzhu Oct 30, 2019

SRodenburg Oct 30, 2019 •

edited

SRodenburg commented Nov 20, 2019

Gene info parser fix #6754

Gene info parser fix #6754

Conversation

SRodenburg commented Oct 29, 2019

khzhu Oct 30, 2019

Choose a reason for hiding this comment

SRodenburg Oct 30, 2019 • edited

Choose a reason for hiding this comment

SRodenburg commented Nov 20, 2019

SRodenburg Oct 30, 2019 •

edited