## Identify synthetic lethal interactions between paralog genes

In our project, we primarily focus on the identification of vulnerabilities involving synthetic lethal interactions between paralog genes.

Hence, we would further filter the dataset and only keep the data of paralog genes.

**Input**
- Sanger CNV data with mapped Entrez ID and BROAD ID (703 CCLs): cnv_sanger_tidy.csv
- BROAD CRISPR gene effect data with mapped Entrez ID and BROAD ID (703 CCLs): crispr_broad_tidy.csv


**Output**
- Sanger CNV data with paralog gene solely: cnv_sanger_paralog.csv
- BROAD CRISPR gene effect data with paralog gene solely: crispr_broad_paralog.csv

In [1]:
## Import modules
import numpy as np
import pandas as pd

In [2]:
## Load the dataset 
## Import the paralog gene table from Barbara's data
paralog = pd.read_csv('/Users/amy/Desktop/SyntheticLethalityProject/sources/DeKegel_TableS8.csv', index_col = None)
cnv_sanger = pd.read_csv('/Users/amy/Desktop/SyntheticLethalityProject/1_data_processing/03_intersection_of_CCLs/cnv_sanger_tidy.csv', index_col = None, low_memory=False)
crispr_broad = pd.read_csv('/Users/amy/Desktop/SyntheticLethalityProject/1_data_processing/03_intersection_of_CCLs/crispr_broad_tidy.csv', index_col = None)
paralog[:2]

Unnamed: 0,prediction_rank,prediction_percentile,sorted_gene_pair,A1,A2,A1_entrez,A2_entrez,A1_ensembl,A2_ensembl,prediction_score,...,either_in_complex,mean_complex_essentiality,colocalisation,interact,n_total_ppi,fet_ppi_overlap,shared_ppi_mean_essentiality,gtex_spearman_corr,gtex_min_mean_expr,gtex_max_mean_expr
0,1,0.1,SMARCA2_SMARCA4,SMARCA2,SMARCA4,6595,6597,ENSG00000080503,ENSG00000127616,0.430886,...,True,0.387262,0.333333,True,302,114.614142,0.225382,0.627875,18.609973,34.302868
1,2,0.1,EXOC6_EXOC6B,EXOC6,EXOC6B,54536,23233,ENSG00000138190,ENSG00000144036,0.410447,...,True,0.486857,0.25,True,53,29.782706,0.285886,0.069456,6.390812,11.168367


In [3]:
## Combine A1_entrez and A2_entrez column (unique content) to obtain the paralog genes list 
A1_entrez = set(paralog['A1_entrez'])
A2_entrez = set(paralog['A2_entrez'])
paralog_union = A1_entrez.union(A2_entrez) # 13320 unique paralog genes
## Convert the int in list into string
paralog_union = [str(x) for x in paralog_union]
print("Number of unique paralog genes:", len(list(paralog_union)))

Number of unique paralog genes: 13320


In [4]:
## Filter CNV and CRISPR data based on the paralog gene list 
cnv_sanger_paralog = cnv_sanger.filter(items = paralog_union)
crispr_broad_paralog = crispr_broad.filter(items = paralog_union)

## Add the cell line model ID (i.e., SangerModelID and ModelID) to the filtered dataset 
crispr_model = crispr_broad[['BROAD_ID', 'SangerModelID']]
cnv_model = cnv_sanger[['BROAD_ID', 'SangerModelID']]

crispr_broad_paralog = pd.concat([crispr_model, crispr_broad_paralog], axis = 1)
cnv_sanger_paralog = pd.concat([cnv_model, cnv_sanger_paralog], axis = 1)

In [5]:
## Save the data filtered with paralog genes
cnv_sanger_paralog.to_csv('/Users/amy/Desktop/SyntheticLethalityProject/1_data_processing/04_paralog_genes/cnv_sanger_paralog.csv', index=False)
crispr_broad_paralog.to_csv('/Users/amy/Desktop/SyntheticLethalityProject/1_data_processing/04_paralog_genes/crispr_broad_paralog.csv', index=False)