# Spliting Data

Here, we utilize the feature-selected profiles generated in the preceding module notebook [here](../0.feature_selection/0.feature_selection.ipynb), focusing on dividing the data into training, testing, and holdout sets for machine learning training.


In [1]:
import sys
import json
import pathlib
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

sys.path.append("../../")  # noqa
from src.utils import split_meta_and_features, get_injury_treatment_info  # noqa

# ignoring warnings
warnings.catch_warnings(action="ignore")



## Paramters

Below are the parameters defined that are used in this notebook


In [2]:
# setting seed constants
seed = 0
np.random.seed(seed)

# directory to get all the inputs for this notebook
data_dir = pathlib.Path("../../data").resolve(strict=True)
results_dir = pathlib.Path("../../results").resolve(strict=True)
fs_dir = (results_dir / "0.feature_selection").resolve(strict=True)

# directory to store all the output of this notebook
data_split_dir = (results_dir / "1.data_splits").resolve()
data_split_dir.mkdir(exist_ok=True)

In [3]:
# data paths
fs_profile_path = (fs_dir / "cell_injury_profile_fs.csv.gz").resolve(strict=True)

# load data
fs_profile_df = pd.read_csv(fs_profile_path)

# splitting meta and feature column names
fs_meta, fs_feats = split_meta_and_features(fs_profile_df)

# display
print("fs profile with control: ", fs_profile_df.shape)
fs_profile_df.head()

fs profile with control:  (16703, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_RadialDistribution_RadialCV_DNA_4of4,Cytoplasm_AreaShape_Extent,Cytoplasm_AreaShape_Zernike_9_5,Nuclei_RadialDistribution_RadialCV_DNA_1of4,Cytoplasm_RadialDistribution_RadialCV_ER_1of4,Cells_RadialDistribution_FracAtD_DNA_3of4,Nuclei_Intensity_MeanIntensityEdge_RNA,Nuclei_RadialDistribution_RadialCV_Mito_4of4,Cells_RadialDistribution_RadialCV_AGP_3of4,Nuclei_Intensity_IntegratedIntensityEdge_ER
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.08926,-0.082892,-0.012755,0.00499,-0.039931,0.002696,0.108018,0.041612,0.102915,-0.048802
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.029568,-0.021502,0.027104,0.016882,-0.051326,-0.029957,0.034152,0.002146,-0.009631,0.020328
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.034188,0.014503,0.000802,-0.012599,-0.069699,-0.009395,0.111277,-0.155381,0.028251,-0.044589
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.011993,-0.032153,0.000531,0.009794,-0.020744,-0.013818,0.052255,-0.018047,0.012353,0.051871
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.014885,0.06177,0.016779,-0.029602,-0.024857,0.004196,-0.038872,-0.071742,0.092877,-0.001608


## Exploring the data set

Below is a exploration of the selected features dataset. The aim is to identify treatments, extract metadata, and gain a understanding of the experiment's design.


Below demonstrates the amount of wells does each treatment have.


In [4]:
# displying the amount of wells per treatments
well_treatments_counts_df = (
    fs_profile_df["Compound Name"].value_counts().to_frame().reset_index()
)

well_treatments_counts_df

Unnamed: 0,Compound Name,count
0,DMSO,9855
1,Wortmannin,600
2,Colchicine,512
3,Nocodazole,504
4,Radicicol,504
...,...,...
139,Melphalan,24
140,Carmustine,24
141,Thio-TEPA,24
142,Chlorambucil,24


Below we show the amount of wells does a specific cell celluar injury has


In [5]:
# Displaying how many how wells does each cell injury have
cell_injury_well_counts = (
    fs_profile_df["injury_type"].value_counts().to_frame().reset_index()
)
cell_injury_well_counts

Unnamed: 0,injury_type,count
0,Control,9855
1,Cytoskeletal,1472
2,Miscellaneous,1304
3,Kinase,1104
4,Genotoxin,944
5,Hsp90,552
6,Redox,312
7,Saponin,288
8,HDAC,168
9,Proteasome,144


Next we wanted to extract some metadata regarding how many compound and wells are treated with a given compounds

This will be saved in the `results/0.data_splits` directory


In [6]:
# get summary information and save it
injury_before_holdout_info_df = get_injury_treatment_info(
    profile=fs_profile_df, groupby_key="injury_type"
).reset_index(drop=True)

# display
print("Shape:", injury_before_holdout_info_df.shape)
injury_before_holdout_info_df

Shape: (15, 5)


Unnamed: 0,injury_type,injury_code,n_wells,n_compounds,compound_list
0,Control,0,9855,1,[DMSO]
1,Cytoskeletal,1,1472,15,"[Nocodazole, Colchicine, Paclitaxel, Vinblasti..."
2,Miscellaneous,5,1304,39,"[L-Buthionine-(S,R)-sulfoximine, CDDO Im, Cino..."
3,Kinase,3,1104,13,"[Wortmannin, Staurosporine, PI-103, BEZ-235, A..."
4,Genotoxin,4,944,22,"[Camptothecin, CX-5461, Doxorubicin, Cladribin..."
5,Hsp90,2,552,3,"[Radicicol, Geldanamycin, 17-AAG]"
6,Redox,6,312,12,"[Menadione, PKF118-310, 4-Amino-1-naphthol (HC..."
7,Saponin,10,288,11,"[Digitonin, Saikosaponin A, Polygalasaponin F,..."
8,HDAC,7,168,5,"[AR-42, SAHA, ITF 2357, Panobinostat, Apicidin]"
9,Mitochondria,11,144,4,"[Antimycin A, CCCP, Rotenone, Oligomycin A]"


Next, we construct the profile metadata. This provides a structured overview of how the treatments assicoated with injuries were applied, detailing the treatments administered to each plate.

This will be saved in the `results/0.data_splits` directory


In [7]:
injury_meta_dict = {}
for injury, df in fs_profile_df.groupby("injury_type"):
    # collecting treatment metadata
    plates = df["Plate"].unique().tolist()
    treatment_meta = {}
    treatment_meta["n_plates"] = len(plates)
    treatment_meta["n_wells"] = df.shape[0]
    treatment_meta["n_treatments"] = len(df["Compound Name"].unique())
    treatment_meta["associated_plates"] = plates

    # counting treatments
    treatment_counter = {}
    for treatment, df2 in df.groupby("Compound Name"):
        if treatment is np.nan:
            continue
        n_treatments = df2.shape[0]
        treatment_counter[treatment] = n_treatments

    # storing treatment counts
    treatment_meta["treatments"] = treatment_counter
    injury_meta_dict[injury] = treatment_meta

# save dictionary into a json file
with open(data_split_dir / "injury_metadata.json", mode="w") as stream:
    json.dump(injury_meta_dict, stream)

Here we build a plate metadata infromations where we look at the type of treatments and amount of wells with the treatment that are present in the dataset

This will be saved in `results/0.data_splits`


In [8]:
plate_meta = {}
for plate_id, df in fs_profile_df.groupby("Plate"):
    unique_compounds = list(df["Compound Name"].unique())
    n_treatments = len(unique_compounds)

    # counting treatments
    treatment_counter = {}
    for treatment, df2 in df.groupby("Compound Name"):
        n_treatments = df2.shape[0]
        treatment_counter[treatment] = n_treatments

    plate_meta[plate_id] = treatment_counter

# save dictionary into a json file
with open(data_split_dir / "cell_injury_plate_info.json", mode="w") as stream:
    json.dump(plate_meta, stream)

## Data Splitting

---


### Holdout Dataset

Here we collected out holdout dataset. The holdout dataset is a subset of the dataset that is not used during model training or tuning. Instead, it is reserved solely for evaluating the model's performance after it has been trained.

In this notebook, we will include three different types of held-out datasets before proceeding with our machine learning training and evaluation.

- Plate hold out
- treatment hold out
- well hold out

Each of these held outdata will be stored in the `results/1.data_splits` directory


### Plate Holdout

Plates are randomly selected based on their Plate ID and save them as our `plate_holdout` data.


In [9]:
# plate
seed = 0
n_plates = 10

# setting random seed globally
np.random.seed(seed)

# selecting plates randomly from a list
selected_plates = (
    np.random.choice(fs_profile_df["Plate"].unique().tolist(), (n_plates, 1))
    .flatten()
    .tolist()
)
plate_holdout_df = fs_profile_df.loc[fs_profile_df["Plate"].isin(selected_plates)]

# take the indices of the held out data frame and use it to drop those samples from
# the main dataset. And then check if those indices are dropped
plate_idx_to_drop = plate_holdout_df.index.tolist()
fs_profile_df = fs_profile_df.drop(plate_idx_to_drop)
assert all(
    [
        True if num not in fs_profile_df.index.tolist() else False
        for num in plate_idx_to_drop
    ]
), "index to be dropped found in the main dataframe"

# saving the holdout data
plate_holdout_df.to_csv(
    data_split_dir / "plate_holdout.csv.gz", index=False, compression="gzip"
)

# display
print("plate holdout shape:", plate_holdout_df.shape)
plate_holdout_df.head()

plate holdout shape: (1949, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_RadialDistribution_RadialCV_DNA_4of4,Cytoplasm_AreaShape_Extent,Cytoplasm_AreaShape_Zernike_9_5,Nuclei_RadialDistribution_RadialCV_DNA_1of4,Cytoplasm_RadialDistribution_RadialCV_ER_1of4,Cells_RadialDistribution_FracAtD_DNA_3of4,Nuclei_Intensity_MeanIntensityEdge_RNA,Nuclei_RadialDistribution_RadialCV_Mito_4of4,Cells_RadialDistribution_RadialCV_AGP_3of4,Nuclei_Intensity_IntegratedIntensityEdge_ER
1044,0,Control,BR00110368,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.091288,-0.009617,-0.026725,-0.049098,-0.020261,0.041325,0.067239,0.02298,0.105339,-0.012277
1045,0,Control,BR00110368,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.01985,-0.063211,0.009658,-0.068078,-0.020326,0.014427,0.240782,-0.005732,0.018887,0.029887
1046,0,Control,BR00110368,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.003181,-0.057414,0.005344,-0.005335,-0.01887,-0.000847,0.136477,-0.023482,-0.047374,-0.037377
1047,0,Control,BR00110368,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.031318,-0.06566,0.002449,-0.00896,-0.065875,-0.031349,-0.03677,-0.013437,-0.104239,-0.066449
1048,0,Control,BR00110368,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.038726,-0.017859,-0.015763,-0.063977,-0.038957,0.014147,0.028667,-0.06563,-0.044535,0.03389


### Treatment holdout

To establish our treatment holdout, we first need to find the number of treatments and wells associated with a specific cell injury, considering the removal of randomly selected plates from the previous step.

To determine which cell injuries should be considered for a single treatment holdout, we establish a threshold of 10 unique compounds. This means that a cell injury type must have at least 10 unique compounds to qualify for selection in the treatment holdout. Any cell injury types failing to meet this criterion will be disregarded.

Once the cell injuries are identified for treatment holdout, we select our holdout treatment by grouping each injury type and choosing the treatment with the fewest wells. This becomes our treatment holdout dataset.


In [10]:
injury_treatment_metadata = (
    fs_profile_df.groupby(["injury_type", "Compound Name"])
    .size()
    .reset_index(name="n_wells")
)
injury_treatment_metadata

Unnamed: 0,injury_type,Compound Name,n_wells
0,Control,DMSO,8783
1,Cytoskeletal,ARQ 621,12
2,Cytoskeletal,Citreoviridin,18
3,Cytoskeletal,Citrinin,18
4,Cytoskeletal,Colchicine,457
...,...,...,...
139,Tannin,Corilagin,18
140,Tannin,Gallotannin,24
141,Tannin,Punicalagin,18
142,mTOR,Rapamycin,42


In [11]:
# setting random seed
min_treatments_per_injury = 10

# Filter out the injury types for which we can select a complete treatment.
# We are using a threshold of 10. If an injury type is associated with fewer than 10 compounds,
# we do not conduct treatment holdout on those injury types.
accepted_injuries = []
for injury_type, df in injury_treatment_metadata.groupby("injury_type"):
    n_treatments = df.shape[0]
    if n_treatments >= min_treatments_per_injury:
        accepted_injuries.append(df)

accepted_injuries = pd.concat(accepted_injuries)

# Next, we select the treatment that will be held out within each injury type.
# We group treatments based on injury type and choose the treatment with the fewest wells
# as our holdout.
selected_treatments_to_holdout = []
for injury_type, df in accepted_injuries.groupby("injury_type"):
    held_treatment = df.min().iloc[1]
    selected_treatments_to_holdout.append([injury_type, held_treatment])

# convert to dataframe
selected_treatments_to_holdout = pd.DataFrame(
    selected_treatments_to_holdout, columns="injury_type held_treatment".split()
)

print("Below are the accepted cell injuries and treatments to be held out")
selected_treatments_to_holdout

Below are the accepted cell injuries and treatments to be held out


Unnamed: 0,injury_type,held_treatment
0,Cytoskeletal,ARQ 621
1,Genotoxin,Aphidicolin
2,Kinase,AZD 1152-HQPA
3,Miscellaneous,Aloisine RP106
4,Redox,4-Amino-1-naphthol (HCl)
5,Saponin,Bacopasaponin C


In [12]:
# select all wells that have the treatments to be heldout
treatment_holdout_df = fs_profile_df.loc[
    fs_profile_df["Compound Name"].isin(
        selected_treatments_to_holdout["held_treatment"]
    )
]

# take the indices of the held out data frame and use it to drop those samples from
# the main dataset. And then check if those indices are dropped
treatment_idx_to_drop = treatment_holdout_df.index.tolist()
fs_profile_df = fs_profile_df.drop(treatment_idx_to_drop)
assert all(
    [
        True if num not in fs_profile_df.index.tolist() else False
        for num in treatment_idx_to_drop
    ]
), "index to be dropped found in the main dataframe"
# saving the holdout data
treatment_holdout_df.to_csv(
    data_split_dir / "treatment_holdout.csv.gz", index=False, compression="gzip"
)

# display
print("Treatment holdout shape:", treatment_holdout_df.shape)
treatment_holdout_df.head()

Treatment holdout shape: (126, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_RadialDistribution_RadialCV_DNA_4of4,Cytoplasm_AreaShape_Extent,Cytoplasm_AreaShape_Zernike_9_5,Nuclei_RadialDistribution_RadialCV_DNA_1of4,Cytoplasm_RadialDistribution_RadialCV_ER_1of4,Cells_RadialDistribution_FracAtD_DNA_3of4,Nuclei_Intensity_MeanIntensityEdge_RNA,Nuclei_RadialDistribution_RadialCV_Mito_4of4,Cells_RadialDistribution_RadialCV_AGP_3of4,Nuclei_Intensity_IntegratedIntensityEdge_ER
10865,1,Cytoskeletal,BR00114093,G17,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,2.238598,0.599463,-0.241712,0.742363,0.583174,-1.48074,-1.266283,0.704287,1.002289,-1.294129
10866,1,Cytoskeletal,BR00114093,G18,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.585216,0.077836,0.006345,0.209817,0.307122,-0.01229,0.079251,0.512768,0.480748,0.050227
10867,1,Cytoskeletal,BR00114093,G19,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.327163,-0.02969,-0.002012,0.195018,0.039966,-0.128593,-0.176903,0.204817,0.299985,-0.289519
10868,1,Cytoskeletal,BR00114093,G20,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.09157,0.031678,-0.001605,0.004097,0.046803,0.047007,-0.175203,0.183495,0.145047,-0.079209
10869,1,Cytoskeletal,BR00114093,G21,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.084292,0.024021,0.001634,0.068272,0.059479,0.02948,-0.20791,0.177868,0.058527,-0.020954


### Well holdout

To generate the well hold out data, each plate was iterated and random wells were selected. However, an additional step was condcuting which was to seperate the control wells and the treated wells, due to the large label imbalance with the controls. Therefore, 5 wells were randomly selected and 10 wells were randomly selected from each individual plate


In [13]:
# parameters
seed = 0
n_controls = 5
n_samples = 10

# setting random seed globally
np.random.seed(seed)

# collecting randomly select wells based on treatment
wells_heldout_df = []
for treatment, df in fs_profile_df.groupby("Plate", as_index=False):
    # separate control wells and rest of all wells since there is a huge label imbalance
    # selected 5 control wells and 10 random wells from the plate
    df_control = df.loc[df["Compound Name"] == "DMSO"].sample(
        n=n_controls, random_state=seed
    )
    df_treated = df.loc[df["Compound Name"] != "DMSO"].sample(
        n=n_samples, random_state=seed
    )

    # concatenate those together
    well_heldout = pd.concat([df_control, df_treated])

    wells_heldout_df.append(well_heldout)

# genearte treatment holdout dataframe
wells_heldout_df = pd.concat(wells_heldout_df)

# take the indices of the held out data frame and use it to drop those samples from
# the main dataset. And then check if those indices are dropped
wells_idx_to_drop = wells_heldout_df.index.tolist()
fs_profile_df = fs_profile_df.drop(wells_idx_to_drop)
assert all(
    [
        True if num not in fs_profile_df.index.tolist() else False
        for num in treatment_idx_to_drop
    ]
), "index to be dropped found in the main dataframe"

# saving the holdout data
wells_heldout_df.to_csv(
    data_split_dir / "wells_holdout.csv.gz", index=False, compression="gzip"
)

# display
print("Wells holdout shape:", wells_heldout_df.shape)
wells_heldout_df.head()

Wells holdout shape: (1125, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_RadialDistribution_RadialCV_DNA_4of4,Cytoplasm_AreaShape_Extent,Cytoplasm_AreaShape_Zernike_9_5,Nuclei_RadialDistribution_RadialCV_DNA_1of4,Cytoplasm_RadialDistribution_RadialCV_ER_1of4,Cells_RadialDistribution_FracAtD_DNA_3of4,Nuclei_Intensity_MeanIntensityEdge_RNA,Nuclei_RadialDistribution_RadialCV_Mito_4of4,Cells_RadialDistribution_RadialCV_AGP_3of4,Nuclei_Intensity_IntegratedIntensityEdge_ER
4994,0,Control,BR00109990,B12,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.03499,0.029559,-0.008153,-0.038878,-0.063308,-0.029295,-0.096242,0.064756,-0.014525,-0.127187
5058,0,Control,BR00109990,K10,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.030488,0.091028,-0.035511,0.057293,0.094118,0.035673,0.237579,0.078312,0.082314,-0.124886
5050,0,Control,BR00109990,J18,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.007634,0.05287,-0.024943,0.0461,0.095658,0.009731,0.065111,-0.052656,0.052096,-0.00145
5035,0,Control,BR00109990,G9,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.044317,0.049637,0.032955,0.069834,-0.004913,0.010322,0.380662,0.00824,-0.044801,0.055468
4991,0,Control,BR00109990,B9,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.039394,0.022917,0.006583,-0.043584,-0.024157,-0.010921,-0.032003,-0.061641,-0.081841,0.011113


## Saving training dataset


Once the data holdout has been generated, the next step is to save the training dataset that will serve as the basis for training the multi-class logistic regression model.


In [14]:
# get summary cell injury dataset treatment and well info after holdouts
injury_after_holdout_info_df = get_injury_treatment_info(
    profile=fs_profile_df, groupby_key="injury_type"
)

# display
print("shape:", injury_after_holdout_info_df.shape)
injury_after_holdout_info_df

shape: (15, 5)


Unnamed: 0,injury_type,injury_code,n_wells,n_compounds,compound_list
0,Control,0,8408,1,[DMSO]
1,Cytoskeletal,1,1102,14,"[Nocodazole, Colchicine, Paclitaxel, Vinblasti..."
7,Miscellaneous,5,1007,38,"[L-Buthionine-(S,R)-sulfoximine, CDDO Im, Cino..."
6,Kinase,3,750,12,"[Wortmannin, Staurosporine, PI-103, BEZ-235, S..."
3,Genotoxin,4,737,21,"[Camptothecin, CX-5461, Doxorubicin, Cladribin..."
5,Hsp90,2,418,3,"[Radicicol, Geldanamycin, 17-AAG]"
11,Redox,6,215,11,"[Menadione, PKF118-310, Dunnione, MGR2, SIN-1 ..."
12,Saponin,10,164,10,"[Digitonin, Saikosaponin A, Polygalasaponin F,..."
4,HDAC,7,138,5,"[AR-42, SAHA, ITF 2357, Panobinostat, Apicidin]"
10,Proteasome,9,117,4,"[Carfilzomib, Bortezomib, (S)-MG132, (R)-MG132]"


In [15]:
# shape of the update training and testing dataset after removing holdout
print("training shape after removing holdouts", fs_profile_df.shape)
fs_profile_df.head()

training shape after removing holdouts (13503, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_RadialDistribution_RadialCV_DNA_4of4,Cytoplasm_AreaShape_Extent,Cytoplasm_AreaShape_Zernike_9_5,Nuclei_RadialDistribution_RadialCV_DNA_1of4,Cytoplasm_RadialDistribution_RadialCV_ER_1of4,Cells_RadialDistribution_FracAtD_DNA_3of4,Nuclei_Intensity_MeanIntensityEdge_RNA,Nuclei_RadialDistribution_RadialCV_Mito_4of4,Cells_RadialDistribution_RadialCV_AGP_3of4,Nuclei_Intensity_IntegratedIntensityEdge_ER
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.08926,-0.082892,-0.012755,0.00499,-0.039931,0.002696,0.108018,0.041612,0.102915,-0.048802
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.029568,-0.021502,0.027104,0.016882,-0.051326,-0.029957,0.034152,0.002146,-0.009631,0.020328
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.034188,0.014503,0.000802,-0.012599,-0.069699,-0.009395,0.111277,-0.155381,0.028251,-0.044589
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.011993,-0.032153,0.000531,0.009794,-0.020744,-0.013818,0.052255,-0.018047,0.012353,0.051871
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.014885,0.06177,0.016779,-0.029602,-0.024857,0.004196,-0.038872,-0.071742,0.092877,-0.001608


In [16]:
# split the data into trianing and testing sets
meta_cols, feat_cols = split_meta_and_features(fs_profile_df)
X = fs_profile_df[feat_cols]
y = fs_profile_df["injury_code"]

# spliting dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.80, random_state=seed, stratify=y
)

# saving training dataset as csv file
X_train.to_csv(data_split_dir / "X_train.csv.gz", compression="gzip", index=False)
X_test.to_csv(data_split_dir / "X_test.csv.gz", compression="gzip", index=False)
y_train.to_csv(data_split_dir / "y_train.csv.gz", compression="gzip", index=False)
y_test.to_csv(data_split_dir / "y_test.csv.gz", compression="gzip", index=False)

# display data split sizes
print("X training size", X_train.shape)
print("X testing size", X_test.shape)
print("y training size", y_train.shape)
print("y testing size", y_test.shape)

X training size (10802, 221)
X testing size (2701, 221)
y training size (10802,)
y testing size (2701,)


In [17]:
# save metadata after holdout
cell_injury_metadata = fs_profile_df[fs_meta]
cell_injury_metadata.to_csv(
    data_split_dir / "cell_injury_metadata_after_holdout.csv.gz",
    compression="gzip",
    index=False,
)
# display
print("Metadata shape", cell_injury_metadata.shape)
cell_injury_metadata.head()

Metadata shape (13503, 33)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Compound PubChem CID,Compound PubChem URL,Control Type,Channels,Comment [Image File Path],Comment [Image Prefix],Mahalanobis distance,Mahalanobis distance significant,Relative well cellcount,Relative well cellcount significant
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c02,7.51,No,1.02,No
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c03,6.21,No,1.11,No
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c04,10.94,No,1.02,No
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c05,7.59,No,1.06,No
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c06,5.28,No,1.0,No


## Generating data split summary file 

In [18]:
def get_and_rename_injury_info(
    profile: pd.DataFrame, groupby_key: str, column_name: str
) -> pd.DataFrame:
    """Gets injury treatment information and renames the specified column.

    Parameters
    ----------
    profile : DataFrame
        The profile DataFrame containing data to be processed.
    groupby_key : str
        The key to group by in the injury treatment information.
    column_name : str
        The new name for the 'n_wells' column.

    Returns
    -------
    DataFrame
        A DataFrame with the injury treatment information and the 'n_wells' column renamed.
    """
    return get_injury_treatment_info(profile=profile, groupby_key=groupby_key).rename(
        columns={"n_wells": column_name}
    )


# name of the columns
data_col_name = [
    "Number of Wells (Total Data)",
    "Number of Wells (Train Split)",
    "Number of Wells (Test Split)",
    "Number of Wells (Plate Holdout)",
    "Number of Wells (Treatment Holdout)",
    "Number of Wells (Well Holdout)",
]


# Total amount summary
injury_before_holdout_info_df = injury_before_holdout_info_df.rename(
    columns={"n_wells": data_col_name[0]}
)

# Data splits train test summary
injury_train_info_df = get_and_rename_injury_info(
    profile=X_train.merge(
        fs_profile_df[meta_cols], how="left", right_index=True, left_index=True
    )[meta_cols + feat_cols],
    groupby_key="injury_type",
    column_name=data_col_name[1],
)

injury_test_info_df = get_and_rename_injury_info(
    profile=X_test.merge(
        fs_profile_df[meta_cols], how="left", right_index=True, left_index=True
    )[meta_cols + feat_cols],
    groupby_key="injury_type",
    column_name=data_col_name[2],
)

# Holdouts summary
injury_plate_holdout_info_df = get_and_rename_injury_info(
    profile=plate_holdout_df, groupby_key="injury_type", column_name=data_col_name[3]
)

injury_treatment_holdout_info_df = get_and_rename_injury_info(
    profile=treatment_holdout_df,
    groupby_key="injury_type",
    column_name=data_col_name[4],
)

injury_well_holdout_info_df = get_and_rename_injury_info(
    profile=wells_heldout_df, groupby_key="injury_type", column_name=data_col_name[5]
)

# Select interested columns
total_data_summary = injury_before_holdout_info_df[["injury_type", data_col_name[0]]]
train_split_summary = injury_train_info_df[["injury_type", data_col_name[1]]]
test_split_summary = injury_test_info_df[["injury_type", data_col_name[2]]]
plate_holdout_info_df = injury_plate_holdout_info_df[["injury_type", data_col_name[3]]]
treatment_holdout_summary = injury_treatment_holdout_info_df[
    ["injury_type", data_col_name[4]]
]
well_holdout_summary = injury_well_holdout_info_df[["injury_type", data_col_name[5]]]

In [19]:
# merge the summary data splits into one, update data type to integers
merged_summary_df = (
    total_data_summary.merge(train_split_summary, on="injury_type", how="outer")
    .merge(test_split_summary, on="injury_type", how="outer")
    .merge(plate_holdout_info_df, on="injury_type", how="outer")
    .merge(treatment_holdout_summary, on="injury_type", how="outer")
    .merge(well_holdout_summary, on="injury_type", how="outer")
    .fillna(0)
    .set_index("injury_type")
)[data_col_name].astype(int)

# update index and rename it 'injury_type' to "Cellular Injury"
merged_summary_df = merged_summary_df.reset_index().rename(
    columns={"injury_type": "Cellular Injury"}
)

# save as csv file
merged_summary_df.to_csv(data_split_dir / "summary_data_split.csv", index=False)

# display
merged_summary_df

Unnamed: 0,Cellular Injury,Number of Wells (Total Data),Number of Wells (Train Split),Number of Wells (Test Split),Number of Wells (Plate Holdout),Number of Wells (Treatment Holdout),Number of Wells (Well Holdout)
0,Control,9855,6726,1682,1072,0,375
1,Cytoskeletal,1472,881,221,181,12,177
2,Ferroptosis,96,66,16,6,0,8
3,Genotoxin,944,590,147,73,48,86
4,HDAC,168,110,28,30,0,0
5,Hsp90,552,334,84,54,0,80
6,Kinase,1104,600,150,120,12,222
7,Miscellaneous,1304,806,201,172,18,107
8,Mitochondria,144,92,23,12,0,17
9,Nonspecific reactive,128,84,21,19,0,4
