## Download RNAseq Data

Source: [Cancer Dependency Map resource](https://depmap.org/portal/download/).

- `CRISPRGeneDependency.parquet`: The data in this document describes the probability that a gene knockdown has an effect on cell-inhibition or death. These probabilities are derived from the data contained in CRISPRGeneEffect.parquet using methods described [here](https://doi.org/10.1101/720243)
- `

>Tsherniak A, Vazquez F, Montgomery PG, Weir BA, Kryukov G, Cowley GS, Gill S, Harrington WF, Pantel S, Krill-Burger JM, Meyers RM, Ali L, Goodale A, Lee Y, Jiang G, Hsiao J, Gerath WFJ, Howell S, Merkel E, Ghandi M, Garraway LA, Root DE, Golub TR, Boehm JS, Hahn WC. Defining a Cancer Dependency Map. Cell. 2017 Jul 27;170(3):564-576.

In [1]:
import pathlib
import urllib.request
import pandas as pd
from pathlib import Path
import pyarrow as pa

In [2]:
def download_dependency_data(figshare_id, figshare_url, output_file):
    """
    Download the provided figshare resource
    """
    urllib.request.urlretrieve(f"{figshare_url}/{figshare_id}", output_file)

In [3]:
# Set download constants
output_dir = pathlib.Path("data")
figshare_url = "https://ndownloader.figshare.com/files/"

download_dict = {
    "46493242": "RNASeq.csv",

     # DepMap, Broad (2024). DepMap 24Q2 Public. Figshare+. Dataset. https://doi.org/10.25452/figshare.plus.25880521.v1
}

In [4]:
# Make sure directory exists
output_dir.mkdir(exist_ok=True)

In [5]:
for figshare_id in download_dict:
    # Set output file
    output_file = pathlib.Path(output_dir, download_dict[figshare_id])

    # Download the dependency data
    print(f"Downloading {output_file}...")

    download_dependency_data(
        figshare_id=figshare_id, figshare_url=figshare_url, output_file=output_file
    )

Downloading data/RNASeq.csv...


In [6]:
#Convert to parquet
# List of CSV files

data_directory = "../6.RNAseq/data/"
csv_file = pathlib.Path(data_directory, "RNASeq.csv").resolve()


df = pd.read_csv(csv_file)
    
# Define the output Parquet file name
parquet_file = csv_file.with_suffix('.parquet')
    
# Save the DataFrame as a Parquet file
df.to_parquet(parquet_file, index=False)

In [7]:
print(df)

     Unnamed: 0  TSPAN6 (7105)  TNMD (64102)  DPM1 (8813)  SCYL3 (57147)  \
0    ACH-001113       4.361066      0.000000     7.393090       2.873813   
1    ACH-001289       4.578939      0.584963     7.116760       2.580145   
2    ACH-001339       3.160275      0.000000     7.388103       2.397803   
3    ACH-001538       5.094236      0.000000     7.160174       2.606442   
4    ACH-001794       3.889474      0.056584     6.777946       1.978196   
..          ...            ...           ...          ...            ...   
523  ACH-001743       4.074677      0.000000     6.909293       2.263034   
524  ACH-001578       6.361768      3.418190     7.227760       2.589763   
525  ACH-002669       3.122673      0.000000     7.045487       1.570463   
526  ACH-001858       4.400538      0.000000     7.022368       1.925999   
527  ACH-001997       5.076816      0.000000     7.834850       2.601697   

     C1orf112 (55732)  FGR (2268)  CFH (3075)  FUCA2 (2519)  GCLC (2729)  ...  \
0     