## Map the HGNC gene symbol with Entrez ID



Before identifying the genes with poorly represented HD in cancer cell lines panel, we would first tidy up the dataset  used for further analysis. Specifically, all the **HGNC gene symbol would be replaced by Entrez ID**.


**Input**
- **SANGER Gene copy number variation data**: WES_pureCN_CNV_genes_cn_category_20221213.csv (https://cellmodelpassports.sanger.ac.uk/downloads)
- **BROAD CRISPR gene KO screening data**: CRISPRGeneEffect.csv (Version: 23Q2; https://depmap.org/portal/download/all/)
- **Gene annotation (e.g., gene name, ENTREZ ID etc.)**: gene_identifiers_20191101.csv

**Output**
- HGNC symbol in Sanger CNV data is mapped with Entrez ID: cnv_sanger_entrezID.csv
- HGNC symbol in CRISPR gene effect data is mapped with Entrez ID: crispr_broad_entrezID.csv

In [1]:
## Import modules
import numpy as np
import pandas as pd

Tidy up Sanger CNV data.

In [2]:
## Import sanger copy number variation dataset
cnv_sanger = pd.read_csv('/Users/amy/Desktop/SyntheticLethalityProject/sources/WES_pureCN_CNV_genes_cn_category_20221213.csv', encoding='ISO-8859-1', index_col=0)
cnv_sanger = cnv_sanger.T # Transpose the data first for easier manipulation
## cnv_sanger

## Import gene identifier data set 
## map the entrez ID to gene name; make sure that gene symbol are unique
gene_identifier = pd.read_csv('/Users/amy/Desktop/SyntheticLethalityProject/sources/gene_identifiers_20191101.csv')
gene_id_map = gene_identifier[['hgnc_symbol', 'entrez_id']]

## only keep the genes with entrez_id 
gene_id_map = gene_id_map.dropna()

## Map the gene name with entrez ID
column_map = dict(zip(gene_id_map['hgnc_symbol'].astype(str), gene_id_map['entrez_id'].astype(int)))
# Rename the columns using the dictionary
cnv_sanger = cnv_sanger.rename(columns = column_map)
cnv_sanger = cnv_sanger.rename(columns = {'model_id' : 'SangerModelID'})
cnv_sanger[:2]

model_name,SangerModelID,source,symbol,1,29974,2,144568,127550,53947,51146,...,9183,55055,11130,7789,158586,79364,79699,7791,23140,26009
22RV1,SIDM00499,Broad,,Neutral,Neutral,Loss,Neutral,Neutral,Neutral,Neutral,...,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Gain,Neutral,Neutral
22RV1.1,SIDM00499,Sanger,,Neutral,Neutral,Gain,Gain,Neutral,Neutral,Neutral,...,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral,Neutral


Tidy up CRISPR gene effect data.

In [3]:
## Import broad CRISPR 
crispr_broad = pd.read_csv('/Users/amy/Desktop/SyntheticLethalityProject/sources/CRISPRGeneEffect.csv')

# Tidy up the column name 
column_ls = pd.DataFrame(crispr_broad.columns.astype(str))
column_ls = column_ls.rename(columns = {0: 'before_map'})

column_ls['after_map'] = column_ls.before_map.apply(lambda x: x.split(" (")[0])

column_map2 = dict(zip(column_ls['before_map'], column_ls['after_map']))

# Rename the columns using the dictionary
crispr_broad = crispr_broad.rename(columns = column_map2)
crispr_broad = crispr_broad.rename(columns = column_map)

crispr_broad[:2]

Unnamed: 0,ModelID,1,29974,2,144568,127550,53947,51146,8086,65985,...,55055,11130,7789,158586,79364,440590,79699,7791,23140,26009
0,ACH-000001,-0.102725,0.058595,0.058246,-0.041881,-0.088661,0.170335,-0.015254,-0.223691,0.218612,...,-0.084055,-0.084184,0.131495,0.238702,0.201712,-0.250381,0.045612,0.044154,0.146801,-0.473583
1,ACH-000004,0.008878,-0.077633,-0.099297,0.03012,-0.080334,-0.112404,0.298774,-0.125139,0.218675,...,-0.066673,-0.443145,0.183618,0.058936,0.108711,0.056322,-0.355712,0.13531,0.200408,-0.07615


Save the transient data (they would be used for further processing).

In [4]:
## Sanger CNV 
cnv_sanger.to_csv('/Users/amy/Desktop/SyntheticLethalityProject/1_data_processing/01_entrez_ID_mapping/cnv_sanger_entrezID.csv', index = False)

## CRISPR gene effect 
crispr_broad.to_csv('/Users/amy/Desktop/SyntheticLethalityProject/1_data_processing/01_entrez_ID_mapping/crispr_broad_entrezID.csv', index = False)