New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gene info parser fix #6754
Gene info parser fix #6754
Conversation
int referenceGenomeId = DaoReferenceGenome.getReferenceGenomeByBuildName(genomeBuild).getReferenceGenomeId(); | ||
String desc = parts[8]; | ||
String type = parts[9]; | ||
String mainSymbol = parts[10]; // use 10 instead of 2 since column 2 may have duplication | ||
Set<String> aliases = new HashSet<String>(); | ||
|
||
// try to get chr from other column if needed | ||
if (chr.equals("-")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine two ifs into one, so the second condition will not be evaluated if first one is false.
if (chr.equals("-") && (!parts[6].equals("-")) {
chr = parts[6];
} else {
// skip line if still unable to parse chr
continue;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unittests revealed a problem with your proposed implementation; the code will go straight to continue
when chr
does not equal "-"
, which is obviously not what we want.
A "synonymous" implementation would be:
if (chr.equals("-") && parts[6].equals("-")) {
continue
}
if (chr.equals("-")) {
chr = parts[6];
}
However, this requires 2 logical tests while my original implemetation for most of the lines requires 1, thus would be better performance-wise.
core/src/main/java/org/mskcc/cbio/portal/scripts/ImportGeneData.java
Outdated
Show resolved
Hide resolved
46f4ba9
to
26cb35b
Compare
@inodb can you merge? |
I ran into some problems while building a hg38 seed database:
Sometimes the file
Homo_sapiens.gene_info
contains missing values in themap_location
and/orchromosome
columns. This file is downloaded from NCBI and required when updating genes and gene aliases.Additionally, sometimes a particular symbol from the GenCode annotation file is not in the info file, causing a
NullPointerException
.Changes:
map_location
column, try thechromosome
column. If that also fails, skip gene.NullPointerException
when the gene symbol can't be found, and instead showing a warning, and skipping the gene.