## Construct gene dictionary

The column names of the gene dependency files are of the format "gene symbol (entrez id)".

Additionally, the `depmap_gene_meta.tsv` contains genes that passed an initial QC (see Pan et al. 2022).

This notebook will create a six column matrix that separates symbol from entrez id, retains the original column name, two columns of if the gene passed two different QC, and a QC summary column.

Example:

| entrez_id | symbol_id | dependency_column | qc_pass_pan | qc_pass_other | qc_pass |
| :-------: | :-------: | :---------------: | :---------: | :-----------: | :-----: |
| 1 | A1BG |A1BG (1)| True | True | True |
| 29974 | A1CF | A1CF (29974) | True | False | False |
|	2 	| A2M | A2M (2) | False | True | False |

*Note, the example qc_pass column above is an example and may not reflect truth.*

### Quality control columns

- `qc_pass_pan` refers to the genes QC'd by Pan et al. 2022
- `qc_pass_other` refers to gene families filtered by saturated signals
    - RPL - Ribosomal proteins (including mitochondrial)
    - RPS - S Ribosomal proteins
- `qc_pass` refers to genes that pass all qc metrics

In [1]:
import pathlib
import pandas as pd

In [2]:
base_dir = "data/"

dependency_file = pathlib.Path(f"{base_dir}/CRISPRGeneEffect.csv")
qc_gene_file = pathlib.Path(f"{base_dir}/depmap_gene_meta.tsv")

output_gene_dict_file = pathlib.Path(f"{base_dir}/CRISPR_gene_dictionary.tsv")

In [3]:
# Load gene dependency data
dependency_df = pd.read_csv(dependency_file, index_col=0)

print(dependency_df.shape)
dependency_df.head()

(1095, 17931)


Unnamed: 0_level_0,A1BG (1),A1CF (29974),A2M (2),A2ML1 (144568),A3GALT2 (127550),A4GALT (53947),A4GNT (51146),AAAS (8086),AACS (65985),AADAC (13),...,ZWILCH (55055),ZWINT (11130),ZXDA (7789),ZXDB (158586),ZXDC (79364),ZYG11A (440590),ZYG11B (79699),ZYX (7791),ZZEF1 (23140),ZZZ3 (26009)
ModelID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ACH-000001,-0.102725,0.058595,0.058246,-0.041881,-0.088661,0.170335,-0.015254,-0.223691,0.218612,0.025719,...,-0.084055,-0.084184,0.131495,0.238702,0.201712,-0.250381,0.045612,0.044154,0.146801,-0.473583
ACH-000004,0.008878,-0.077633,-0.099297,0.03012,-0.080334,-0.112404,0.298774,-0.125139,0.218675,0.222941,...,-0.066673,-0.443145,0.183618,0.058936,0.108711,0.056322,-0.355712,0.13531,0.200408,-0.07615
ACH-000005,-0.11795,0.013989,0.164099,0.18457,-0.201766,-0.202198,0.207814,-0.089192,-0.082624,0.119679,...,-0.151588,-0.402572,-0.07332,-0.114402,-0.009449,-0.198378,-0.135007,0.014708,-0.065341,-0.196296
ACH-000007,-0.049135,-0.089991,0.084994,0.129586,-0.041561,-0.014555,0.045143,-0.263324,-0.135143,0.22904,...,-0.273444,-0.533265,-0.016257,0.222234,0.086937,-0.070598,-0.412361,-0.003722,-0.277756,-0.410805
ACH-000009,0.004969,-0.09817,0.092887,0.110913,0.028599,-0.087008,0.073032,-0.240147,0.072294,0.112749,...,-0.212287,-0.326986,-0.037498,0.235983,-0.070229,-0.061208,-0.537773,0.08463,0.018678,-0.307176


In [4]:
# Load depmap metadata
gene_meta_df = pd.read_csv(qc_gene_file, sep="\t")
gene_meta_df.entrezgene = gene_meta_df.entrezgene.astype(str)

print(gene_meta_df.shape)
gene_meta_df.head(3)

(2921, 19)


Unnamed: 0,Name,symbol,entrezgene,Function_1,Function_2,Function_3,Function_4,Loading_1,Loading_2,Loading_3,Loading_4,Recon_Pearson,Location,Location_URL,DepMap_URL,GeneCard_URL,NIH_Gene_URL,Pubmed_Count,Understudied
0,AAAS (8086),AAAS,8086,V105,V112,V148,V87,0.486503,0.319132,0.250985,-0.085859,0.564515,"mitochondrial outer membrane, peroxisome",https://humancellmap.org/explore/reports/prey?...,https://depmap.org/portal/gene/AAAS?tab=overview,https://www.genecards.org/cgi-bin/carddisp.pl?...,https://www.ncbi.nlm.nih.gov/gene/?term=8086,93.0,False
1,AAMP (14),AAMP,14,V16,V37,V78,V24,0.214734,0.196146,-0.17645,0.163906,0.386308,,,https://depmap.org/portal/gene/AAMP?tab=overview,https://www.genecards.org/cgi-bin/carddisp.pl?...,https://www.ncbi.nlm.nih.gov/gene/?term=14,49.0,False
2,AARS (16),AARS,16,V10,V63,V1,V98,0.417229,0.251385,0.142732,0.075179,0.561534,,,https://depmap.org/portal/gene/AARS?tab=overview,https://www.genecards.org/cgi-bin/carddisp.pl?...,https://www.ncbi.nlm.nih.gov/gene/?term=16,80.0,False


## Obtain the intersection of the genes

Comparing the current DepMap release and the previous gene set qc (19Q2 depmap release)

In [5]:
# Recode column names to entrez ids from dependency file
entrez_genes = [x[1].strip(")").strip() for x in dependency_df.iloc[:, 1:].columns.str.split("(")]

# Obtain intersection of entrez gene ids
entrez_intersection = list(
    set(gene_meta_df.entrezgene).intersection(set(entrez_genes))
)

print(f"The number of overlapping entrez gene ids: {len(entrez_intersection)}")

# Subset the gene metadata file to only those in common, which are ones that passed qc
gene_passed_qc_df = (
    gene_meta_df
    .query("entrezgene in @entrez_intersection")
    .set_index("entrezgene")
    .reindex(entrez_intersection)
    .reset_index()
    .loc[:, ["entrezgene", "Name", "symbol"]]
)

gene_passed_qc_df.head()

The number of overlapping entrez gene ids: 2903


Unnamed: 0,entrezgene,Name,symbol
0,51258,MRPL51 (51258),MRPL51
1,26149,ZNF658 (26149),ZNF658
2,23262,PPIP5K2 (23262),PPIP5K2
3,6059,ABCE1 (6059),ABCE1
4,51594,NBAS (51594),NBAS


## Convert the initial dependency map input file to three parts

1. Entrez ID
2. Symbol
3. The full column name

In [6]:
entrez_genes = [x[1].strip(")").strip() for x in dependency_df.columns.str.split("(")]
symbol_genes = [x[0].strip() for x in dependency_df.columns.str.split("(")]

gene_dictionary_df = pd.DataFrame(
    [
        entrez_genes,
        symbol_genes,
        dependency_df.columns.tolist()
    ]
).transpose()

gene_dictionary_df.columns = ["entrez_id", "symbol_id", "dependency_column"]

print(gene_dictionary_df.shape)
gene_dictionary_df.head()

(17931, 3)


Unnamed: 0,entrez_id,symbol_id,dependency_column
0,1,A1BG,A1BG (1)
1,29974,A1CF,A1CF (29974)
2,2,A2M,A2M (2)
3,144568,A2ML1,A2ML1 (144568)
4,127550,A3GALT2,A3GALT2 (127550)


## Create the QC columns

In [7]:
# These gene families consistently oversaturate signals in latent representations
qc_fail_other_genes = "RPL|RPS"

In [8]:
gene_dictionary_qc_df = (
    # Merge gene dictionary with qc dataframe
    gene_dictionary_df.merge(
        gene_passed_qc_df,
        left_on="entrez_id",
        right_on="entrezgene",
        how="left"  # Note the left merge, to retain all genes from gene_dictionary_df
    )
    # Select only certain columns
    .loc[:, ["entrez_id", "symbol_id", "dependency_column", "entrezgene"]]
    # Values that are missing indicate genes that did not pass QC
    .fillna(value={"entrezgene": False})
    # Rename the column to be clearly defined
    .rename(columns={"entrezgene": "qc_pass_pan"})
)

# Convert genes with entrez entries to those that indicate QC pass
gene_dictionary_qc_df.loc[gene_dictionary_qc_df.qc_pass_pan != False, "qc_pass_pan"] = True

# Create the qc_pass_other column
gene_dictionary_qc_df = (
    gene_dictionary_qc_df.assign(
        qc_pass_other=~gene_dictionary_qc_df.symbol_id.str.contains(qc_fail_other_genes)
    )
)

# Create qc_pass summary column
gene_dictionary_qc_df = (
    gene_dictionary_qc_df.assign(
        qc_pass=(gene_dictionary_qc_df.qc_pass_pan & gene_dictionary_qc_df.qc_pass_other)
    )
)

# Output file
gene_dictionary_qc_df.to_csv(output_gene_dict_file, index=False, sep="\t")

print(gene_dictionary_qc_df.qc_pass.value_counts())
print(gene_dictionary_qc_df.shape)

gene_dictionary_qc_df.head(3)

qc_pass
False    15163
True      2768
Name: count, dtype: int64
(17931, 6)


Unnamed: 0,entrez_id,symbol_id,dependency_column,qc_pass_pan,qc_pass_other,qc_pass
0,1,A1BG,A1BG (1),False,True,False
1,29974,A1CF,A1CF (29974),False,True,False
2,2,A2M,A2M (2),False,True,False
