Join GitHub today
Map mutation gene symbols to Entrez IDs #12
The mapping is conducted in two stages. First, gene symbols are mapped based on the combination of chromosome # and gene symbol of record. This maps ~95% of observed mutations. Next, yet-unmapped gene symbols are mapped based on the combination of chromosome # and alternate gene symbols. Following the second mapping, ~98% of observations are mapped. The remaining ~2% were either ambiguous mappings or un-mappable; this 2% is currently discarded before writing the data out.
Great work with this pull request!
I think you should separate the entrez gene processing to it's own notebook. For example,
I also think we may want to consider the following approach:
This approach gives primacy to official symbols (i.e. we don't blacklist official symbols because there's a colliding synonym on the same chromosome), but we still obliterate colliding synonyms. Does that make sense?
I suggest keeping the pull request open. Any commits you make to your master branch will get added to this pull request.