## Downloading Data
In this notebook, we will download the dataset required for our compound prioritization analysis. The dataset will be stored in the `../data` directory, which is located one level above the current notebook directory. This `../data` folder will contain all the necessary data and metadata needed for the analysis.

Specifically:
- The metadata and feature column names will be saved in a JSON file for easy reference.
- The profiles will be stored in Parquet format to ensure efficient storage and processing.

This setup ensures that all analytical notebooks have consistent access to the relevant data files.

**NOTE**: In this analysis, we will use well-level aggregate data from the cell-injury dataset. This will be updated once the CFReT HCS data becomes available. The goal is to ensure that changing the data profiles should not necessitate altering the entire analysis pipeline.

In [1]:
import pathlib
import pandas as pd

In [2]:
# creating a data folder
data_dir = pathlib.Path("../data").resolve()
data_dir.mkdir(exist_ok=True)

In [3]:
# dowloading labeled cell injury data from:
# https://github.com/axiomcura/predicting-cell-injury-compounds/blob/main/results/0.feature_selection/cell_injury_profile_fs.csv.gz
cell_injury_df = pd.read_csv(
    "https://github.com/axiomcura/predicting-cell-injury-compounds/raw/refs/heads/main/results/0.feature_selection/cell_injury_profile_fs.csv.gz",
    compression="gzip",
    low_memory=False,
)
cell_injury_df

Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_Intensity_MassDisplacement_DNA,Nuclei_AreaShape_Zernike_9_3,Cytoplasm_AreaShape_Compactness,Cells_Intensity_MaxIntensityEdge_ER,Nuclei_AreaShape_Zernike_8_6,Cells_RadialDistribution_MeanFrac_DNA_1of4,Cells_Intensity_MeanIntensity_RNA,Nuclei_Intensity_StdIntensityEdge_DNA,Nuclei_Intensity_MassDisplacement_RNA,Cytoplasm_Intensity_MaxIntensity_Mito
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.067237,-0.002523,0.058961,0.033801,0.038334,-0.001711,0.126644,-0.019250,-0.049983,-0.088455
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.004589,-0.025927,0.009821,0.079956,0.012352,0.018188,0.153907,0.011445,-0.017450,-0.069477
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.018584,-0.027216,-0.024866,-0.015023,0.008375,0.014888,0.202497,-0.101534,-0.033089,-0.028255
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.022421,0.020242,0.016581,0.073882,-0.004098,0.009786,0.079221,0.078617,-0.027086,-0.082008
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.044967,-0.023699,-0.003827,-0.010467,-0.015679,0.059695,-0.021515,0.034501,0.001262,-0.066839
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16698,14,Nonspecific reactive,BR00114088,G11,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.022920,-0.023663,0.056291,0.118636,-0.019347,-0.070072,-0.067842,-0.034829,-0.016896,0.237271
16699,14,Nonspecific reactive,BR00114088,G12,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.097635,0.028418,0.000393,0.099083,-0.005465,-0.110597,-0.081523,-0.099528,-0.020128,0.052020
16700,14,Nonspecific reactive,BR00114088,G13,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.077432,0.029835,0.069958,0.145948,-0.048388,-0.074193,-0.069558,0.170924,0.004374,0.122368
16701,14,Nonspecific reactive,BR00114088,G14,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.100701,-0.008245,0.041126,0.107706,-0.071207,-0.056485,0.016940,0.134339,0.046722,0.063151


In [4]:
# Saving cell-injury well data as a parquet file
cell_injury_df.to_parquet(data_dir / "labeled_cell_injury_df.parquet", index=False)