## Download Dependency Data

Source: [Cancer Dependency Map resource](https://depmap.org/portal/download/).

- CRISPRGeneDependency.csv: The data in this document describes the probability that a gene knockdown has an effect on cell-inhibition or death. These probabilities are derived from the data contained in CRISPRGeneEffect.csv using methods described [here](https://www.biorxiv.org/content/10.1101/720243v1).
- Model.csv: Metadata for all of DepMap’s cancer models/cell lines.

We also create a gene dictionary for future lookups and recoding

In [1]:
import pathlib
import sys
import urllib

import pandas as pd

sys.path.append("../")
from utils import download_utils, load_utils

In [2]:
# Set output directory
output_dir = pathlib.Path("data")

# Make sure directory exists
output_dir.mkdir(exist_ok=True)

# Set output gene file
output_gene_dictionary = pathlib.Path(output_dir, "depmap_gene_dictionary.tsv")

# Set download constants
figshare_url = "https://ndownloader.figshare.com/files/"

download_dict = {"34990033": "CRISPRGeneDependency.csv", "35020903": "Model.csv"}

In [3]:
for figshare_id in download_dict:
    # Set output file
    output_file = pathlib.Path(output_dir, download_dict[figshare_id])

    # Download the dependency data
    print(f"Downloading {output_file}...")

    download_utils.download_figshare(
        figshare_id=figshare_id, output_file=output_file, figshare_url=figshare_url
    )

Downloading data/CRISPRGeneDependency.csv...
Downloading data/Model.csv...


## Process gene dictionary

In [4]:
# Load the GeneDependency data that was just downloaded
top_dir = ".."
data_dir = "depmap/data"

depmap_df = load_utils.load_depmap(top_dir=top_dir, data_dir=data_dir)

print(depmap_df.shape)
depmap_df.head(3)

(1086, 17387)


Unnamed: 0,DepMap_ID,A1BG (1),A1CF (29974),A2M (2),A2ML1 (144568),A3GALT2 (127550),A4GALT (53947),A4GNT (51146),AAAS (8086),AACS (65985),...,ZWILCH (55055),ZWINT (11130),ZXDA (7789),ZXDB (158586),ZXDC (79364),ZYG11A (440590),ZYG11B (79699),ZYX (7791),ZZEF1 (23140),ZZZ3 (26009)
0,ACH-000001,0.094568,0.012519,0.02746,0.025962,0.073412,0.02734,0.020199,0.284733,0.022084,...,0.037449,0.080585,0.034309,0.007142,0.004241,0.082956,0.012,0.003592,0.012679,0.324623
1,ACH-000004,0.012676,0.049011,0.075933,0.033215,0.013176,0.097497,0.005015,0.153166,0.007358,...,0.038768,0.230569,0.007125,0.021209,0.011203,0.060266,0.128375,0.005911,0.004645,0.04253
2,ACH-000005,0.053957,0.027968,0.010139,0.005448,0.018599,0.081636,0.005457,0.159904,0.050884,...,0.017479,0.274568,0.054525,0.033396,0.033416,0.034712,0.092832,0.012482,0.020843,0.050412


In [5]:
# The columns are of the format: symbol (NCBI Entrez ID)
# Transform them and write out the dictionary
genes = depmap_df.columns[1:].tolist()

gene_data = []
for gene in genes:
    gene_name, ncbi_entrez_gene = gene.split(" ")
    gene_data.append([gene, gene_name, ncbi_entrez_gene.strip("()")])

gene_df = pd.DataFrame(
    gene_data, columns=["depmap_column_name", "gene_symbol", "ncbi_entrez_id"]
)

gene_df.to_csv(output_gene_dictionary, sep="\t", index=False)

print(gene_df.shape)
gene_df.head(3)

(17386, 3)


Unnamed: 0,depmap_column_name,gene_symbol,ncbi_entrez_id
0,A1BG (1),A1BG,1
1,A1CF (29974),A1CF,29974
2,A2M (2),A2M,2
