# Create a symbol to entrez_gene_id mapping for Xena's HiSeqV2

Mapping is downloaded from a [post on the Xena Browser Google Group](https://groups.google.com/d/msg/ucsc-cancer-genomics-browser/uMsXgDwkkxg/eofxMaFGBAAJ). This mapping is a temporary solution: if Xena updates their vocabulary or processing pipeline, this mapping may become outdated.

For genes without symbols in `HiSeqV2`, the following convention was used by Xena:

+ entrez gene id `100130426` becomes symbol `?|100130426`

In [1]:
import pandas

In [2]:
url = "https://groups.google.com/group/ucsc-cancer-genomics-browser/attach/446a131f1877a/unc.edu.f72bfbe6-411d-412e-aaab-1a2414e544ec.2146068.rsem.genes.normalized_results?part=0.1&authuser=0"

In [3]:
gene_df = pandas.read_table(url)
pairs = gene_df.gene_id.str.split('|').tolist()
pairs = [('?|' + entrez, entrez) if symbol == '?' else (symbol, entrez) for symbol, entrez in pairs]
gene_df = pandas.DataFrame(pairs, columns=['symbol', 'entrez_gene_id'])
gene_df.sort_values('symbol', inplace=True)
gene_df.tail(4)

Unnamed: 0,symbol,entrez_gene_id
20527,ZZEF1,23140
20528,ZZZ3,26009
20529,psiTPTE22,387590
20530,tAKR,389932


In [4]:
# Check that entrez_gene_ids are all unique 
assert not gene_df.entrez_gene_id.duplicated().any()

### Duplicate symbol _SLC35E2_

_SLC35E2_ is the symbol listed for both entrez gene [`9906`](http://www.ncbi.nlm.nih.gov/gene/9906) and [`728661`](http://www.ncbi.nlm.nih.gov/gene/728661). As on 2016-07-23, the symbol for `728661` is _SLC35E2B_. Therefore, we're keeping the `9906` mapping and removing the `728661` mapping.

In [5]:
# Find duplicated symbols
count_df = gene_df.symbol.value_counts().reset_index().query("symbol > 1")
count_df

Unnamed: 0,index,symbol
0,SLC35E2,2


In [6]:
gene_df.query("symbol == 'SLC35E2'")

Unnamed: 0,symbol,entrez_gene_id
16300,SLC35E2,728661
16301,SLC35E2,9906


In [7]:
gene_df = gene_df.query("entrez_gene_id != '728661'")
len(gene_df)

20530

### Check mapping against HiSeqV2 gene symbols

In [8]:
path = "../../download/HiSeqV2.tsv.bz2"
expr_df = pandas.read_table(path, index_col=0)

In [9]:
# Symbols that are in `HiSeqV2` but not our entrez gene map
set(expr_df.index) - set(gene_df.symbol)

set()

### Export mapping

In [10]:
gene_df.to_csv('HiSeqV2-gene-map.tsv', index=False, sep='\t')