## Construct gene dictionary

The column names of the gene dependency files are of the format "gene symbol (entrez id)".

Additionally, the `depmap_gene_meta.tsv` contains genes that passed an initial QC (see Pan et al. 2022).

This notebook will create a six column matrix that separates symbol from entrez id, retains the original column name, two columns of if the gene passed two different QC, and a QC summary column.

Example:

| entrez_id | symbol_id | dependency_column | qc_pass_pan | qc_pass_other | qc_pass |
| :-------: | :-------: | :---------------: | :---------: | :-----------: | :-----: |
| 1 | A1BG |A1BG (1)| True | True | True |
| 29974 | A1CF | A1CF (29974) | True | False | False |
|	2 	| A2M | A2M (2) | False | True | False |

*Note, the example qc_pass column above is an example and may not reflect truth.*

### Quality control columns

- `qc_pass_pan` refers to the genes QC'd by Pan et al. 2022
- `qc_pass_other` refers to gene families filtered by saturated signals
    - RPL - Ribosomal proteins (including mitochondrial)
    - RPS - S Ribosomal proteins
- `qc_pass` refers to genes that pass all qc metrics

In [1]:
import pathlib
import pandas as pd

In [2]:
base_dir = "data/"

dependency_file = pathlib.Path(f"{base_dir}/CRISPRGeneEffect.csv")
qc_gene_file = pathlib.Path(f"{base_dir}/depmap_gene_meta.tsv")

output_gene_dict_file = pathlib.Path(f"{base_dir}/CRISPR_gene_dictionary.tsv")

In [3]:
# Load gene dependency data
dependency_df = pd.read_csv(dependency_file, index_col=0)

print(dependency_df.shape)
dependency_df.head()

(1150, 18443)


Unnamed: 0_level_0,A1BG (1),A1CF (29974),A2M (2),A2ML1 (144568),A3GALT2 (127550),A4GALT (53947),A4GNT (51146),AAAS (8086),AACS (65985),AADAC (13),...,ZWILCH (55055),ZWINT (11130),ZXDA (7789),ZXDB (158586),ZXDC (79364),ZYG11A (440590),ZYG11B (79699),ZYX (7791),ZZEF1 (23140),ZZZ3 (26009)
ModelID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ACH-000001,-0.134132,0.029103,0.016454,-0.13754,-0.047273,0.181367,-0.082437,-0.059023,0.194592,0.035473,...,-0.123528,0.08514,0.181954,0.239474,0.172965,-0.230327,0.055657,0.044296,0.107361,-0.410449
ACH-000004,-0.001436,-0.080068,-0.125263,-0.027607,-0.053838,-0.151272,0.240094,-0.038922,0.186438,0.160221,...,-0.186899,-0.359257,0.202271,0.05774,0.089295,0.086703,-0.30493,0.086858,0.254538,-0.087671
ACH-000005,-0.14494,0.026541,0.160605,0.088015,-0.202605,-0.24342,0.133726,-0.034895,-0.126105,0.03603,...,-0.309668,-0.344502,-0.05616,-0.092447,-0.01555,-0.17038,-0.080934,-0.059685,0.030254,-0.145055
ACH-000007,-0.053334,-0.12042,0.047978,0.086984,-0.018987,-0.017309,-4.1e-05,-0.158419,-0.169559,0.201305,...,-0.323038,-0.387265,-0.013816,0.183228,0.038424,-0.051728,-0.383499,-0.012801,-0.294771,-0.431575
ACH-000009,-0.027684,-0.144202,0.052846,0.073833,0.038823,-0.108149,0.010811,-0.0886,0.032194,0.11427,...,-0.253057,-0.159965,-0.025342,0.1915,-0.071632,-0.077843,-0.525599,0.093219,-0.029515,-0.255204


In [4]:
# Load depmap metadata
gene_meta_df = pd.read_csv(qc_gene_file, sep="\t")
gene_meta_df.entrezgene = gene_meta_df.entrezgene.astype(str)

print(gene_meta_df.shape)
gene_meta_df.head(3)

(2921, 19)


Unnamed: 0,Name,symbol,entrezgene,Function_1,Function_2,Function_3,Function_4,Loading_1,Loading_2,Loading_3,Loading_4,Recon_Pearson,Location,Location_URL,DepMap_URL,GeneCard_URL,NIH_Gene_URL,Pubmed_Count,Understudied
0,AAAS (8086),AAAS,8086,V105,V112,V148,V87,0.486503,0.319132,0.250985,-0.085859,0.564515,"mitochondrial outer membrane, peroxisome",https://humancellmap.org/explore/reports/prey?...,https://depmap.org/portal/gene/AAAS?tab=overview,https://www.genecards.org/cgi-bin/carddisp.pl?...,https://www.ncbi.nlm.nih.gov/gene/?term=8086,93.0,False
1,AAMP (14),AAMP,14,V16,V37,V78,V24,0.214734,0.196146,-0.17645,0.163906,0.386308,,,https://depmap.org/portal/gene/AAMP?tab=overview,https://www.genecards.org/cgi-bin/carddisp.pl?...,https://www.ncbi.nlm.nih.gov/gene/?term=14,49.0,False
2,AARS (16),AARS,16,V10,V63,V1,V98,0.417229,0.251385,0.142732,0.075179,0.561534,,,https://depmap.org/portal/gene/AARS?tab=overview,https://www.genecards.org/cgi-bin/carddisp.pl?...,https://www.ncbi.nlm.nih.gov/gene/?term=16,80.0,False


## Obtain the intersection of the genes

Comparing the current DepMap release and the previous gene set qc (19Q2 depmap release)

In [5]:
# Recode column names to entrez ids from dependency file
entrez_genes = [x[1].strip(")").strip() for x in dependency_df.iloc[:, 1:].columns.str.split("(")]

# Obtain intersection of entrez gene ids
entrez_intersection = list(
    set(gene_meta_df.entrezgene).intersection(set(entrez_genes))
)

print(f"The number of overlapping entrez gene ids: {len(entrez_intersection)}")

# Subset the gene metadata file to only those in common, which are ones that passed qc
gene_passed_qc_df = (
    gene_meta_df
    .query("entrezgene in @entrez_intersection")
    .set_index("entrezgene")
    .reindex(entrez_intersection)
    .reset_index()
    .loc[:, ["entrezgene", "Name", "symbol"]]
)

gene_passed_qc_df.head()

The number of overlapping entrez gene ids: 2907


Unnamed: 0,entrezgene,Name,symbol
0,4792,NFKBIA (4792),NFKBIA
1,166968,MIER3 (166968),MIER3
2,5708,PSMD2 (5708),PSMD2
3,4089,SMAD4 (4089),SMAD4
4,84060,RBM48 (84060),RBM48


## Convert the initial dependency map input file to three parts

1. Entrez ID
2. Symbol
3. The full column name

In [6]:
entrez_genes = [x[1].strip(")").strip() for x in dependency_df.columns.str.split("(")]
symbol_genes = [x[0].strip() for x in dependency_df.columns.str.split("(")]

gene_dictionary_df = pd.DataFrame(
    [
        entrez_genes,
        symbol_genes,
        dependency_df.columns.tolist()
    ]
).transpose()

gene_dictionary_df.columns = ["entrez_id", "symbol_id", "dependency_column"]

print(gene_dictionary_df.shape)
gene_dictionary_df.head()

(18443, 3)


Unnamed: 0,entrez_id,symbol_id,dependency_column
0,1,A1BG,A1BG (1)
1,29974,A1CF,A1CF (29974)
2,2,A2M,A2M (2)
3,144568,A2ML1,A2ML1 (144568)
4,127550,A3GALT2,A3GALT2 (127550)


## Create the QC columns

In [7]:
# These gene families consistently oversaturate signals in latent representations
qc_fail_other_genes = "RPL|RPS"

In [8]:
gene_dictionary_qc_df = (
    # Merge gene dictionary with qc dataframe
    gene_dictionary_df.merge(
        gene_passed_qc_df,
        left_on="entrez_id",
        right_on="entrezgene",
        how="left"  # Note the left merge, to retain all genes from gene_dictionary_df
    )
    # Select only certain columns
    .loc[:, ["entrez_id", "symbol_id", "dependency_column", "entrezgene"]]
    # Values that are missing indicate genes that did not pass QC
    .fillna(value={"entrezgene": False})
    # Rename the column to be clearly defined
    .rename(columns={"entrezgene": "qc_pass_pan"})
)

# Convert genes with entrez entries to those that indicate QC pass
gene_dictionary_qc_df.loc[gene_dictionary_qc_df.qc_pass_pan != False, "qc_pass_pan"] = True

# Create the qc_pass_other column
gene_dictionary_qc_df = (
    gene_dictionary_qc_df.assign(
        qc_pass_other=~gene_dictionary_qc_df.symbol_id.str.contains(qc_fail_other_genes)
    )
)

# Create qc_pass summary column
gene_dictionary_qc_df = (
    gene_dictionary_qc_df.assign(
        qc_pass=(gene_dictionary_qc_df.qc_pass_pan & gene_dictionary_qc_df.qc_pass_other)
    )
)

# Output file
gene_dictionary_qc_df.to_csv(output_gene_dict_file, index=False, sep="\t")

print(gene_dictionary_qc_df.qc_pass.value_counts())
print(gene_dictionary_qc_df.shape)

gene_dictionary_qc_df.head(3)

qc_pass
False    15671
True      2772
Name: count, dtype: int64
(18443, 6)


Unnamed: 0,entrez_id,symbol_id,dependency_column,qc_pass_pan,qc_pass_other,qc_pass
0,1,A1BG,A1BG (1),False,True,False
1,29974,A1CF,A1CF (29974),False,True,False
2,2,A2M,A2M (2),False,True,False
