### Gene Essentiality Ranking and Normalization

This notebook processes gene essentiality data from multiple sources to create ranked essentiality scores for downstream synthetic lethality prediction. It combines DepMap CRISPR gene effect scores with additional essentiality data from PC9 and HeLa cell lines.

**Inputs:**
- DepMap CRISPR Gene Effect data (CRISPRGeneEffect.csv)
- PC9 cell line CRISPR screening data (Excel format)
- HeLa cell line Bayes factor data (Excel format)
- Gene symbol mapping file for standardization

**Outputs:**
- Ranked essentiality scores across all cell lines
- Z-score normalized essentiality rankings
- Both saved as CSV files for feature calculation

In [1]:
# import modules
import os
import pandas as pd
import numpy as np

In [2]:
# set the base directory for the project
cwd = os.getcwd()
BASE_DIR = os.path.abspath(os.path.join(cwd, "..", ".."))

# build paths inside the repo
get_data_path = lambda folders, fname: os.path.normpath(
    os.path.join(BASE_DIR, *folders, fname)
)

file_path_gene_essentiality = get_data_path(['data', 'input', 'DepMap22Q4'], 'CRISPRGeneEffect.csv')
file_path_pc9 = get_data_path(['data', 'input', 'other'], '41467_2021_24841_MOESM4_ESM.xlsx')
file_path_bayes_factor = get_data_path(['data', 'input', 'other'], 'bayes_factor_hela.xlsx')

file_path_genenames = get_data_path(['data', 'input', 'other'], 'approved_and_previous_symbols.csv')

### Load and Process DepMap Gene Effect Data

**Load DepMap CRISPR gene effect scores:**
- Gene effect scores represent the dependency of cell lines on specific genes
- More negative scores indicate higher essentiality/dependency

In [3]:
gene_essentiality = pd.read_csv(file_path_gene_essentiality, index_col=0)

In [4]:
# renaming from depmap gene name format 'SYMBOL (ENTREZ ID)' to ENTREZ ID
def get_entrez_id(gene) :
    '''converts from "A1CF (29974)" string to int(29974) "'''
    return int(gene.split('(')[1][0:-1])
gene_essentiality.columns = [get_entrez_id(x) for x in gene_essentiality.columns]

In [5]:
gene_essentiality_df = gene_essentiality.T.sort_index()
gene_essentiality_df[:2]

Unnamed: 0,ACH-000004,ACH-000005,ACH-000007,ACH-000009,ACH-000011,ACH-000012,ACH-000013,ACH-000014,ACH-000015,ACH-000017,...,ACH-002283,ACH-002284,ACH-002285,ACH-002294,ACH-002295,ACH-002296,ACH-002297,ACH-002298,ACH-002304,ACH-002305
1,0.014633,-0.261566,-0.028717,0.000225,0.095791,-0.10898,-0.077777,-0.05374,-0.189235,-0.009789,...,-0.109261,-0.092109,-0.156263,0.026595,-0.036315,-0.073879,0.084735,-0.172365,-0.033065,0.076307
2,-0.151299,0.106526,0.030971,0.051248,0.022204,0.172384,0.026442,0.038028,-0.081227,-0.003561,...,0.034826,0.052046,0.006431,0.002214,0.114886,-0.077086,0.045093,0.055771,-0.044622,0.165187


In [6]:
rankings_df = pd.DataFrame(index=gene_essentiality_df.index)

rankings_list = []

for column in gene_essentiality_df.columns:
    rankings_list.append(gene_essentiality_df[column].rank(ascending=True).astype(int))

rankings_df = pd.concat(rankings_list, axis=1, keys=[f'{col}' for col in gene_essentiality_df.columns])

In [7]:
rankings_df = rankings_df.sort_index()
print(f'# of genes with essentiality: {rankings_df.shape[0]}')
display(rankings_df[:3])

# of genes with essentiality: 17453


Unnamed: 0,ACH-000004,ACH-000005,ACH-000007,ACH-000009,ACH-000011,ACH-000012,ACH-000013,ACH-000014,ACH-000015,ACH-000017,...,ACH-002283,ACH-002284,ACH-002285,ACH-002294,ACH-002295,ACH-002296,ACH-002297,ACH-002298,ACH-002304,ACH-002305
1,11172,3608,9393,10907,14687,5946,6845,8565,3954,10169,...,6166,6936,4347,11909,8975,7259,14336,4673,9301,13377
2,5346,14189,12274,13473,11632,16815,12207,12329,7173,10517,...,12532,13189,11139,10805,16031,7139,12611,12894,8725,15926
9,15919,17197,15246,10793,13188,12370,14169,15968,16420,13331,...,11234,14028,16416,13834,11591,14976,16967,14952,14104,10443


In [8]:
# read the gene names mapping file
id_map = pd.read_csv(file_path_genenames)

# create dictionaries to map gene symbols to Entrez IDs
approved_sym_to_entrez_id = dict(zip(id_map['Approved symbol'], id_map['entrez_id']))

# create dictionaries to map previous gene symbols to Entrez IDs
id_map_cleaned = id_map.dropna(axis=0, how='any', subset=['Previous symbol', 'entrez_id']).reset_index(drop=True)
prev_sym_to_entrez_id = dict(zip(id_map_cleaned['Previous symbol'], id_map_cleaned['entrez_id']))

### Process PC9 Cell Line Data

**Load and process PC9 CRISPR screening data:**
- PC9 is a lung cancer cell line
- Control/Luciferase_DMSO_CS_avg represents CRISPR scores

In [9]:
cols = ["Gene Id", "Control/Luciferase_DMSO_CS_avg"] #fetch the control PC9-Cas9 isogenic cell lines for the analysis
pc9_df = pd.read_excel(file_path_pc9, usecols=cols)
pc9_df.head()

Unnamed: 0,Gene Id,Control/Luciferase_DMSO_CS_avg
0,A1BG,-0.121698
1,A1CF,0.033594
2,A2M,0.152108
3,A2ML1,0.030653
4,A3GALT2,-0.183661


In [10]:
# check approved symbols
pc9_df = pc9_df.assign(
    entrez_id = pc9_df['Gene Id'].map(approved_sym_to_entrez_id))

print('# check the NA values in entrez_id (for Approved Symbols)')
display(pc9_df.loc[pc9_df['entrez_id'].isna(), ])

# check the NA values in entrez_id (for Approved Symbols)


Unnamed: 0,Gene Id,Control/Luciferase_DMSO_CS_avg,entrez_id
14,AAED1,-0.208831,
22,AARS,-1.351190,
178,ACPP,0.099550,
331,ADGRF2,0.076600,
375,ADPRHL2,-0.181583,
...,...,...,...
19186,ZNF705E,0.205061,
19200,ZNF720,0.016087,
19258,ZNF806,0.165149,
19306,ZNRD1,-0.763148,


In [11]:
# check previous symbols
pc9_df['entrez_id'] = pc9_df['entrez_id'].fillna(pc9_df['Gene Id'].map(prev_sym_to_entrez_id))

print('# check the NA values in entrez_id (for Previous Symbols)')
display(pc9_df.loc[pc9_df['entrez_id'].isna(), ])

# check the NA values in entrez_id (for Previous Symbols)


Unnamed: 0,Gene Id,Control/Luciferase_DMSO_CS_avg,entrez_id
499,AKAP2,0.083052,
3682,CRIPAK,-0.003651,
5125,ERCC6-PGBD3,-0.239371,
5471,FAM231A,-0.743895,
5830,FLJ44635,0.168323,
...,...,...,...
17221,TNFAIP8L2-SCNM1,-0.464511,
18422,WI2-2373I1.2,-0.056791,
18468,WTH3DI,-0.328936,
18582,ZASP,-0.134499,


In [12]:
pc9_df = pc9_df.dropna(subset=['entrez_id']).reset_index(drop=True)
pc9_df['entrez_id'] = pc9_df['entrez_id'].astype(int)
print(f"# of unique genes in pc9_df: {pc9_df['entrez_id'].nunique()}")
display(pc9_df[:3])

# of unique genes in pc9_df: 19022


Unnamed: 0,Gene Id,Control/Luciferase_DMSO_CS_avg,entrez_id
0,A1BG,-0.121698,1
1,A1CF,0.033594,29974
2,A2M,0.152108,2


In [13]:
# find out duplicated entrez_ids
dup = pc9_df.loc[pc9_df['entrez_id'].duplicated(), 'entrez_id']
dup_df = pc9_df.loc[pc9_df['entrez_id'].isin(dup), ]
dup_df = dup_df.sort_values(by=['entrez_id'], ascending=True)
print(f'# of duplicated entrez_id in pc9_df: {dup_df.shape[0]}')

# of duplicated entrez_id in pc9_df: 36


In [14]:
# get unique approved symbols from duplicated genes in dup_df 
unique_symbols_from_dup = id_map.loc[id_map['Approved symbol'].isin(dup_df['Gene Id']), 'Approved symbol'].unique()
print(f'# of unique approved symbols from dup_df: {len(unique_symbols_from_dup)}')
display(unique_symbols_from_dup)

# of unique approved symbols from dup_df: 17


array(['ARHGAP42', 'BTBD8', 'CC2D2B', 'DCDC1', 'LCOR', 'MACF1', 'MICAL2',
       'MYO18A', 'NEBL', 'PLEKHG7', 'RRM2', 'S1PR3', 'SEM1', 'SLITRK2',
       'TXNRD3', 'XAGE1B', 'ZFHX3'], dtype=object)

In [15]:
approved_sym_pc9_df = pc9_df.loc[pc9_df['Gene Id'].isin(unique_symbols_from_dup)]
removed_dup_df = dup_df.loc[~dup_df.index.isin(approved_sym_pc9_df.index)]
removed_dup_df

Unnamed: 0,Gene Id,Control/Luciferase_DMSO_CS_avg,entrez_id
1824,C16orf47,0.20673,463
2168,C9orf47,0.037705,1903
14444,SEPT4,0.093461,5414
1850,C17orf47,0.008789,5414
2009,C2orf48,0.114892,6241
2127,C7orf76,0.098195,7979
9598,MICALCL,-0.148669,9645
1719,C10orf113,0.073584,10529
8159,KIAA0754,-0.067068,23499
1720,C10orf12,-0.041877,84458


In [16]:
pc9_df = pc9_df.drop(removed_dup_df.index, axis=0)
pc9_df = pc9_df.rename(columns={'Gene Id':'gene_symbol', 
                                'Control/Luciferase_DMSO_CS_avg':'CRISPR_score_PC9'})
pc9_df = pc9_df[['entrez_id', 'gene_symbol', 'CRISPR_score_PC9']]
pc9_df.head()

Unnamed: 0,entrez_id,gene_symbol,CRISPR_score_PC9
0,1,A1BG,-0.121698
1,29974,A1CF,0.033594
2,2,A2M,0.152108
3,144568,A2ML1,0.030653
4,127550,A3GALT2,-0.183661


In [17]:
pc9_df['PC9_rank'] = pc9_df['CRISPR_score_PC9'].rank(ascending=True).astype(int)
ranked_pc9_df = pc9_df.sort_values(by=['CRISPR_score_PC9'], ascending=True)
ranked_pc9_df = ranked_pc9_df.reset_index(drop=True)
ranked_pc9_df[:3]

Unnamed: 0,entrez_id,gene_symbol,CRISPR_score_PC9,PC9_rank
0,3336,HSPE1,-3.148294,1
1,10594,PRPF8,-2.868349,2
2,1603,DAD1,-2.841247,3


In [18]:
ranked_pc9_df = ranked_pc9_df.drop(['gene_symbol', 'CRISPR_score_PC9'], axis=1).sort_values(by=['entrez_id'], ascending=True)
ranked_pc9_df = ranked_pc9_df.set_index('entrez_id')
ranked_pc9_df.head()

Unnamed: 0_level_0,PC9_rank
entrez_id,Unnamed: 1_level_1
1,6495
2,17186
9,18089
10,18477
12,14699


### Process HeLa Cell Line Data

**Load and process HeLa Bayes factor data:**
- HeLa is a cervical cancer cell line
- Bayes factors indicate gene essentiality

In [19]:
cols = ["Gene", "BF_hela"]
hela_df = pd.read_excel(file_path_bayes_factor,
                        usecols=cols)
hela_df.head()

  for idx, row in parser.parse():


Unnamed: 0,Gene,BF_hela
0,A1BG,-28.842
1,A1CF,-42.187
2,A2M,-42.97
3,A2ML1,-64.316
4,A4GALT,-22.806


In [20]:
hela_df = hela_df.assign(
    entrez_id = hela_df['Gene'].map(approved_sym_to_entrez_id))

print('# check the NA values in entrez_id (for Approved Symbols)')
display(hela_df.loc[hela_df['entrez_id'].isna(), ])

# check the NA values in entrez_id (for Approved Symbols)


Unnamed: 0,Gene,BF_hela,entrez_id
13,AAED1,-12.788,
21,AARS,134.309,
104,ABP1,-59.213,
147,ACN9,-10.434,
168,ACPL2,-39.804,
...,...,...,...
17545,ZNF788,-20.840,
17560,ZNF812,-4.180,
17594,ZNRD1,63.087,
17609,ZRSR1,-12.165,


In [21]:
hela_df['entrez_id'] = hela_df['entrez_id'].fillna(hela_df['Gene'].map(prev_sym_to_entrez_id))

print('# check the NA values in entrez_id (for Previous Symbols)')
display(hela_df.loc[hela_df['entrez_id'].isna(), ])

# check the NA values in entrez_id (for Previous Symbols)


Unnamed: 0,Gene,BF_hela,entrez_id
458,AKAP2,-44.063,
1883,C1orf134,-25.569,
2178,C9orf38,-22.184,
3591,CRIPAK,-35.076,
6545,GVQW1,-24.468,
10895,PALM2,-57.725,
15487,TMEM155,-1.545,


In [22]:
hela_df = hela_df.dropna(subset=['entrez_id', 'BF_hela']).reset_index(drop=True)
hela_df['entrez_id'] = hela_df['entrez_id'].astype(int)
print(f"# of unique genes in HeLa_df: {hela_df['entrez_id'].nunique()}")

# of unique genes in HeLa_df: 17618


In [23]:
# find out duplicated entrez_ids
hdup = hela_df.loc[hela_df['entrez_id'].duplicated(), 'entrez_id']
hdup_df = hela_df.loc[hela_df['entrez_id'].isin(hdup), ]
hdup_df = hdup_df.sort_values(by=['entrez_id'], ascending=True)
print(f'# of duplicated entrez_id in hdup_df: {hdup_df.shape[0]}')

# of duplicated entrez_id in hdup_df: 45


In [24]:
# get unique approved symbols from duplicated genes in dup_df 
unique_symbols_from_hdup = id_map.loc[id_map['Approved symbol'].isin(hdup_df['Gene']), 'Approved symbol'].unique()
print(f'# of unique approved symbols from dup_df: {len(unique_symbols_from_hdup)}')
display(unique_symbols_from_hdup)

# of unique approved symbols from dup_df: 17


array(['ARHGAP42', 'BTBD8', 'CC2D2B', 'CCDC7', 'LCOR', 'MACF1', 'MIA2',
       'MICAL2', 'MYO18A', 'NAA38', 'NEBL', 'PLEKHG7', 'RRM2', 'S1PR3',
       'SLITRK2', 'TXNRD3', 'ZFHX3'], dtype=object)

In [25]:
approved_sym_hela_df = hela_df.loc[hela_df['Gene'].isin(unique_symbols_from_hdup)]
removed_hdup_df = hdup_df.loc[~hdup_df.index.isin(approved_sym_hela_df.index)]
removed_hdup_df

Unnamed: 0,Gene,BF_hela,entrez_id
1753,C16orf47,-50.976,463
2178,C9orf47,-8.755,1903
3693,CTAGE5,-20.832,4253
1783,C17orf47,-19.453,5414
13428,SEPT4,-62.989,5414
1983,C2orf48,-45.03,6241
2131,C7orf76,-31.182,7979
13617,SHFM1,-31.084,7979
9098,MICALCL,-58.581,9645
1609,C10orf113,1.293,10529


In [26]:
hela_df = hela_df.drop(removed_hdup_df.index, axis=0)
hela_df = hela_df.rename(columns={'Gene':'gene_symbol'})
hela_df = hela_df[['entrez_id', 'gene_symbol', 'BF_hela']]
hela_df.head()

Unnamed: 0,entrez_id,gene_symbol,BF_hela
0,1,A1BG,-28.842
1,29974,A1CF,-42.187
2,2,A2M,-42.97
3,144568,A2ML1,-64.316
4,53947,A4GALT,-22.806


In [27]:
hela_df['HeLa_rank'] = hela_df['BF_hela'].rank(ascending=False).astype(int)
ranked_hela_df = hela_df.sort_values(by=['BF_hela'], ascending=False)
ranked_hela_df = ranked_hela_df.reset_index(drop=True)
ranked_hela_df[:3]

Unnamed: 0,entrez_id,gene_symbol,BF_hela,HeLa_rank
0,90196,SYS1,283.27,1
1,3837,KPNB1,268.751,2
2,3692,EIF6,265.959,3


In [28]:
ranked_hela_df = ranked_hela_df.drop(['gene_symbol', 'BF_hela'], axis=1).sort_values(by=['entrez_id'], ascending=True)
ranked_hela_df = ranked_hela_df.set_index('entrez_id')
ranked_hela_df.head()

Unnamed: 0_level_0,HeLa_rank
entrez_id,Unnamed: 1_level_1
1,7497
2,10511
10,4584
12,13647
13,10976


### Combine All Essentiality Data

**Merge DepMap, PC9, and HeLa essentiality rankings:**
- Map PC9 and HeLa to their corresponding DepMap IDs
- Create comprehensive essentiality ranking matrix

In [29]:
all_df = pd.concat([rankings_df, ranked_pc9_df, ranked_hela_df], axis=1)
all_df = all_df.rename(columns={'PC9_rank':'ACH-000779', 'HeLa_rank':'ACH-001086'})
all_df = all_df.transpose()
all_df.head()

Unnamed: 0,1,2,9,10,12,13,14,15,16,18,...,100038246,100129278,100130771,100505621,100505705,100506742,100507249,100528020,102724536,105378803
ACH-000004,11172.0,5346.0,15919.0,14531.0,7827.0,15880.0,681.0,3342.0,238.0,11965.0,...,,,,,,,,,,
ACH-000005,3608.0,14189.0,17197.0,17118.0,4887.0,11519.0,785.0,9371.0,98.0,3221.0,...,,,,,,,,,,
ACH-000007,9393.0,12274.0,15246.0,16239.0,16454.0,16579.0,645.0,12509.0,91.0,16790.0,...,,,,,,,,,,
ACH-000009,10907.0,13473.0,10793.0,16025.0,12830.0,13096.0,655.0,4023.0,258.0,14615.0,...,,,,,,,,,,
ACH-000011,14687.0,11632.0,13188.0,10997.0,16413.0,11970.0,997.0,5040.0,430.0,7148.0,...,,,,,,,,,,


In [30]:
# calculate z-scores for each gene across all cell lines
zall_df = all_df.apply(lambda x: ((x-x.mean())/x.std(ddof=0)))

### Save Processed Data

**Export both raw rankings and z-score normalized rankings:**

In [None]:
output_path = get_data_path(['data', 'output', 'ranked_essentiality'], '')

all_df.to_csv(os.path.join(output_path, 'ranked_essentiality.csv'))
zall_df.to_csv(os.path.join(output_path, 'ranked_zessentiality.csv'))