# Spliting Data
Here, we utilize the feature-selected profiles generated in the preceding module notebook [here](../0.feature_selection/0.feature_selection.ipynb), focusing on dividing the data into training, testing, and holdout sets for machine learning training.

In [1]:
import sys
import json
import pathlib
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

sys.path.append("../../")  # noqa
from src.utils import split_meta_and_features  # noqa

# ignoring warnings
warnings.catch_warnings(action="ignore")



## Paramters

Below are the parameters defined that are used in this notebook

In [2]:
# setting seed constants
seed = 0
np.random.seed(seed)

# directory to get all the inputs for this notebook
data_dir = pathlib.Path("../../data").resolve(strict=True)
results_dir = pathlib.Path("../../results").resolve(strict=True)
fs_dir = (results_dir / "0.feature_selection").resolve(strict=True)

# directory to store all the output of this notebook
data_split_dir = (results_dir / "1.data_splits").resolve()
data_split_dir.mkdir(exist_ok=True)

In [3]:
# data paths
fs_profile_path = (fs_dir / "cell_injury_profile_fs.csv.gz").resolve(strict=True)

# load data
fs_profile_df = pd.read_csv(fs_profile_path)

# splitting meta and feature column names
fs_meta, fs_feats = split_meta_and_features(fs_profile_df)

# display
print("fs profile with control: ", fs_profile_df.shape)
fs_profile_df.head()

fs profile with control:  (16703, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Cells_AreaShape_Zernike_8_2,Cytoplasm_AreaShape_FormFactor,Cells_AreaShape_Zernike_8_4,Cells_RadialDistribution_RadialCV_RNA_2of4,Cells_AreaShape_Solidity,Cytoplasm_RadialDistribution_MeanFrac_ER_2of4,Cytoplasm_AreaShape_Zernike_9_5,Cells_RadialDistribution_MeanFrac_DNA_2of4,Cells_RadialDistribution_FracAtD_RNA_3of4,Cells_AreaShape_Zernike_3_1
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.016041,-0.041107,-0.000347,0.103464,-0.096563,-0.015029,-0.012755,0.033795,-0.026427,0.051454
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.016385,-0.030383,0.028278,-0.005469,-0.004532,0.009526,0.027104,-0.004394,-0.070415,-0.020646
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.004361,0.072861,-0.002493,0.054249,0.046936,0.009864,0.000802,0.014793,-0.004474,0.01267
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.010156,-0.002715,-0.001161,-0.018706,0.01133,-0.00128,0.000531,0.018147,-0.037427,0.024888
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.022098,0.001851,0.014731,0.026536,0.026682,-0.017886,0.016779,0.069323,0.008786,0.029109


## Exploring the data set

Below is a  exploration of the selected features dataset. The aim is to identify treatments, extract metadata, and gain a understanding of the experiment's design.

Below demonstrates the amount of wells does each treatment have. 

In [4]:
# displying the amount of wells per treatments
well_treatments_counts_df = (
    fs_profile_df["Compound Name"].value_counts().to_frame().reset_index()
)

well_treatments_counts_df

Unnamed: 0,Compound Name,count
0,DMSO,9855
1,Wortmannin,600
2,Colchicine,512
3,Nocodazole,504
4,Radicicol,504
...,...,...
139,Melphalan,24
140,Carmustine,24
141,Thio-TEPA,24
142,Chlorambucil,24


Below we show the amount of wells does a specific cell celluar injury has

In [5]:
# Displaying how many how wells does each cell injury have
cell_injury_well_counts = (
    fs_profile_df["injury_type"].value_counts().to_frame().reset_index()
)
cell_injury_well_counts

Unnamed: 0,injury_type,count
0,Control,9855
1,Cytoskeletal,1472
2,Miscellaneous,1304
3,Kinase,1104
4,Genotoxin,944
5,Hsp90,552
6,Redox,312
7,Saponin,288
8,HDAC,168
9,Proteasome,144


Next we wanted to extract some metadata regarding how many compound and wells are treated with a given compounds

This will be saved in the `results/0.data_splits` directory

In [6]:
meta_injury = []
for injury_type, df in fs_profile_df.groupby("injury_type"):
    # extract n_wells, n_compounds and unique compounds per injury_type
    n_wells = df.shape[0]
    unique_compounds = list(df["Compound Name"].unique())
    n_compounds = len(unique_compounds)

    # store information
    meta_injury.append([injury_type, n_wells, n_compounds, unique_compounds])

injury_meta_df = pd.DataFrame(
    meta_injury, columns=["injury_type", "n_wells", "n_compounds", "compound_list"]
).sort_values("n_wells", ascending=False)
injury_meta_df.to_csv(data_split_dir / "injury_well_counts_table.csv", index=False)

# display
print("shape:", injury_meta_df.shape)
injury_meta_df

shape: (15, 4)


Unnamed: 0,injury_type,n_wells,n_compounds,compound_list
0,Control,9855,1,[DMSO]
1,Cytoskeletal,1472,15,"[Nocodazole, Colchicine, Paclitaxel, Vinblasti..."
7,Miscellaneous,1304,39,"[L-Buthionine-(S,R)-sulfoximine, CDDO Im, Cino..."
6,Kinase,1104,13,"[Wortmannin, Staurosporine, PI-103, BEZ-235, A..."
3,Genotoxin,944,22,"[Camptothecin, CX-5461, Doxorubicin, Cladribin..."
5,Hsp90,552,3,"[Radicicol, Geldanamycin, 17-AAG]"
11,Redox,312,12,"[Menadione, PKF118-310, 4-Amino-1-naphthol (HC..."
12,Saponin,288,11,"[Digitonin, Saikosaponin A, Polygalasaponin F,..."
4,HDAC,168,5,"[AR-42, SAHA, ITF 2357, Panobinostat, Apicidin]"
8,Mitochondria,144,4,"[Antimycin A, CCCP, Rotenone, Oligomycin A]"


Next, we construct the profile metadata. This provides a structured overview of how the treatments assicoated with injuries were applied, detailing the treatments administered to each plate.

This will be saved in the `results/0.data_splits` directory

In [7]:
injury_meta_dict = {}
for injury, df in fs_profile_df.groupby("injury_type"):
    # collecting treatment metadata
    plates = df["Plate"].unique().tolist()
    treatment_meta = {}
    treatment_meta["n_plates"] = len(plates)
    treatment_meta["n_wells"] = df.shape[0]
    treatment_meta["n_treatments"] = len(df["Compound Name"].unique())
    treatment_meta["associated_plates"] = plates

    # counting treatments
    treatment_counter = {}
    for treatment, df2 in df.groupby("Compound Name"):
        if treatment is np.nan:
            continue
        n_treatments = df2.shape[0]
        treatment_counter[treatment] = n_treatments

    # storing treatment counts
    treatment_meta["treatments"] = treatment_counter
    injury_meta_dict[injury] = treatment_meta

# save dictionary into a json file
with open(data_split_dir / "injury_metadata.json", mode="w") as stream:
    json.dump(injury_meta_dict, stream)

Here we build a plate metadata infromations where we look at the type of treatments and amount of wells with the treatment that are present in the dataset

This will be saved in `results/0.data_splits`

In [8]:
plate_meta = {}
for plate_id, df in fs_profile_df.groupby("Plate"):
    unique_compounds = list(df["Compound Name"].unique())
    n_treatments = len(unique_compounds)

    # counting treatments
    treatment_counter = {}
    for treatment, df2 in df.groupby("Compound Name"):
        n_treatments = df2.shape[0]
        treatment_counter[treatment] = n_treatments

    plate_meta[plate_id] = treatment_counter

# save dictionary into a json file
with open(data_split_dir / "cell_injury_plate_info.json", mode="w") as stream:
    json.dump(plate_meta, stream)

## Data Splitting 
---

### Holdout Dataset

Here we collected out holdout dataset. The holdout dataset is a subset of the dataset that is not used during model training or tuning. Instead, it is reserved solely for evaluating the model's performance after it has been trained.

In this notebook, we will include three different types of held-out datasets before proceeding with our machine learning training and evaluation.
 - Plate hold out 
 - treatment hold out 
 - well hold out 

Each of these held outdata will be stored in the `results/1.data_splits` directory 


### Plate Holdout

Plates are randomly selected based on their Plate ID and save them as our `plate_holdout` data.

In [9]:
# plate
seed = 0
n_plates = 10

# setting random seed globally
np.random.seed(seed)

# selecting plates randomly from a list
selected_plates = (
    np.random.choice(fs_profile_df["Plate"].unique().tolist(), (n_plates, 1))
    .flatten()
    .tolist()
)
plate_holdout_df = fs_profile_df.loc[fs_profile_df["Plate"].isin(selected_plates)]

# take the indices of the held out data frame and use it to drop those samples from
# the main dataset. And then check if those indices are dropped
plate_idx_to_drop = plate_holdout_df.index.tolist()
fs_profile_df = fs_profile_df.drop(plate_idx_to_drop)
assert all(
    [
        True if num not in fs_profile_df.index.tolist() else False
        for num in plate_idx_to_drop
    ]
), "index to be dropped found in the main dataframe"

# saving the holdout data
plate_holdout_df.to_csv(
    data_split_dir / "plate_holdout.csv.gz", index=False, compression="gzip"
)

# display
print("plate holdout shape:", plate_holdout_df.shape)
plate_holdout_df.head()

plate holdout shape: (1949, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Cells_AreaShape_Zernike_8_2,Cytoplasm_AreaShape_FormFactor,Cells_AreaShape_Zernike_8_4,Cells_RadialDistribution_RadialCV_RNA_2of4,Cells_AreaShape_Solidity,Cytoplasm_RadialDistribution_MeanFrac_ER_2of4,Cytoplasm_AreaShape_Zernike_9_5,Cells_RadialDistribution_MeanFrac_DNA_2of4,Cells_RadialDistribution_FracAtD_RNA_3of4,Cells_AreaShape_Zernike_3_1
1044,0,Control,BR00110368,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.019895,-0.031427,-0.027356,0.040374,0.018828,-0.042929,-0.026725,0.056919,0.070751,0.050997
1045,0,Control,BR00110368,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.045059,-0.043225,-0.002194,0.050981,0.01348,0.036595,0.009658,-0.00466,-0.03514,-0.00643
1046,0,Control,BR00110368,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.01558,-0.054127,-0.01319,0.01041,-0.043693,0.018881,0.005344,0.032422,-0.042607,0.027212
1047,0,Control,BR00110368,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.028573,-0.104911,-0.035428,-0.093552,-0.005486,0.025152,0.002449,0.007772,-0.092371,-0.026863
1048,0,Control,BR00110368,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.05329,-0.066313,-0.014626,-0.011876,-0.016188,-0.024041,-0.015763,0.040085,-0.033968,-0.007589


### Treatment holdout

To establish our treatment holdout, we first need to find the number of treatments and wells associated with a specific cell injury, considering the removal of randomly selected plates from the previous step.

To determine which cell injuries should be considered for a single treatment holdout, we establish a threshold of 10 unique compounds. This means that a cell injury type must have at least 10 unique compounds to qualify for selection in the treatment holdout. Any cell injury types failing to meet this criterion will be disregarded.

Once the cell injuries are identified for treatment holdout, we select our holdout treatment by grouping each injury type and choosing the treatment with the fewest wells. This becomes our treatment holdout dataset.

In [10]:
injury_treatment_metadata = (
    fs_profile_df.groupby(["injury_type", "Compound Name"])
    .size()
    .reset_index(name="n_wells")
)
injury_treatment_metadata

Unnamed: 0,injury_type,Compound Name,n_wells
0,Control,DMSO,8783
1,Cytoskeletal,ARQ 621,12
2,Cytoskeletal,Citreoviridin,18
3,Cytoskeletal,Citrinin,18
4,Cytoskeletal,Colchicine,457
...,...,...,...
139,Tannin,Corilagin,18
140,Tannin,Gallotannin,24
141,Tannin,Punicalagin,18
142,mTOR,Rapamycin,42


In [11]:
# setting random seed
min_treatments_per_injury = 10

# Filter out the injury types for which we can select a complete treatment.
# We are using a threshold of 10. If an injury type is associated with fewer than 10 compounds,
# we do not conduct treatment holdout on those injury types.
accepted_injuries = []
for injury_type, df in injury_treatment_metadata.groupby("injury_type"):
    n_treatments = df.shape[0]
    if n_treatments >= min_treatments_per_injury:
        accepted_injuries.append(df)

accepted_injuries = pd.concat(accepted_injuries)

# Next, we select the treatment that will be held out within each injury type.
# We group treatments based on injury type and choose the treatment with the fewest wells
# as our holdout.
selected_treatments_to_holdout = []
for injury_type, df in accepted_injuries.groupby("injury_type"):
    held_treatment = df.min().iloc[1]
    selected_treatments_to_holdout.append([injury_type, held_treatment])

# convert to dataframe
selected_treatments_to_holdout = pd.DataFrame(
    selected_treatments_to_holdout, columns="injury_type held_treatment".split()
)

print("Below are the accepted cell injuries and treatments to be held out")
selected_treatments_to_holdout

Below are the accepted cell injuries and treatments to be held out


Unnamed: 0,injury_type,held_treatment
0,Cytoskeletal,ARQ 621
1,Genotoxin,Aphidicolin
2,Kinase,AZD 1152-HQPA
3,Miscellaneous,Aloisine RP106
4,Redox,4-Amino-1-naphthol (HCl)
5,Saponin,Bacopasaponin C


In [12]:
# select all wells that have the treatments to be heldout
treatment_holdout_df = fs_profile_df.loc[
    fs_profile_df["Compound Name"].isin(
        selected_treatments_to_holdout["held_treatment"]
    )
]

# take the indices of the held out data frame and use it to drop those samples from
# the main dataset. And then check if those indices are dropped
treatment_idx_to_drop = treatment_holdout_df.index.tolist()
fs_profile_df = fs_profile_df.drop(treatment_idx_to_drop)
assert all(
    [
        True if num not in fs_profile_df.index.tolist() else False
        for num in treatment_idx_to_drop
    ]
), "index to be dropped found in the main dataframe"
# saving the holdout data
treatment_holdout_df.to_csv(
    data_split_dir / "treatment_holdout.csv.gz", index=False, compression="gzip"
)

# display
print("Treatment holdout shape:", treatment_holdout_df.shape)
treatment_holdout_df.head()

Treatment holdout shape: (126, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Cells_AreaShape_Zernike_8_2,Cytoplasm_AreaShape_FormFactor,Cells_AreaShape_Zernike_8_4,Cells_RadialDistribution_RadialCV_RNA_2of4,Cells_AreaShape_Solidity,Cytoplasm_RadialDistribution_MeanFrac_ER_2of4,Cytoplasm_AreaShape_Zernike_9_5,Cells_RadialDistribution_MeanFrac_DNA_2of4,Cells_RadialDistribution_FracAtD_RNA_3of4,Cells_AreaShape_Zernike_3_1
10865,1,Cytoskeletal,BR00114093,G17,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.069103,-1.401268,-0.161672,1.533565,0.021436,0.303084,-0.241712,0.49353,-1.259023,-0.356498
10866,1,Cytoskeletal,BR00114093,G18,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.072028,0.028541,0.057101,0.299031,0.241427,-0.109267,0.006345,-0.00186,0.368694,-0.085104
10867,1,Cytoskeletal,BR00114093,G19,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.046375,-0.094459,0.043656,0.198019,0.087601,-0.19929,-0.002012,-0.008229,0.252236,-0.031922
10868,1,Cytoskeletal,BR00114093,G20,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.003548,0.103991,-0.022332,0.046964,0.018746,-0.068135,-0.001605,-0.023014,0.080041,0.014965
10869,1,Cytoskeletal,BR00114093,G21,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.02357,0.034305,-0.046098,0.014641,-0.010664,-0.005398,0.001634,0.01464,0.039462,-0.017043


### Well holdout 

To generate the well hold out data, each plate was iterated and random wells were selected. However, an additional step was condcuting which was to seperate the control wells and the treated wells, due to the large label imbalance with the controls. Therefore, 5 wells were randomly selected and 10 wells were randomly selected from each individual plate

In [13]:
# parameters
seed = 0
n_controls = 5
n_samples = 10

# setting random seed globally
np.random.seed(seed)

# collecting randomly select wells based on treatment
wells_heldout_df = []
for treatment, df in fs_profile_df.groupby("Plate", as_index=False):
    # separate control wells and rest of all wells since there is a huge label imbalance
    # selected 5 control wells and 10 random wells from the plate
    df_control = df.loc[df["Compound Name"] == "DMSO"].sample(
        n=n_controls, random_state=seed
    )
    df_treated = df.loc[df["Compound Name"] != "DMSO"].sample(
        n=n_samples, random_state=seed
    )

    # concatenate those together
    well_heldout = pd.concat([df_control, df_treated])

    wells_heldout_df.append(well_heldout)

# genearte treatment holdout dataframe
wells_heldout_df = pd.concat(wells_heldout_df)

# take the indices of the held out data frame and use it to drop those samples from
# the main dataset. And then check if those indices are dropped
wells_idx_to_drop = wells_heldout_df.index.tolist()
fs_profile_df = fs_profile_df.drop(wells_idx_to_drop)
assert all(
    [
        True if num not in fs_profile_df.index.tolist() else False
        for num in treatment_idx_to_drop
    ]
), "index to be dropped found in the main dataframe"

# saving the holdout data
wells_heldout_df.to_csv(
    data_split_dir / "wells_holdout.csv.gz", index=False, compression="gzip"
)

# display
print("Wells holdout shape:", wells_heldout_df.shape)
wells_heldout_df.head()

Wells holdout shape: (1125, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Cells_AreaShape_Zernike_8_2,Cytoplasm_AreaShape_FormFactor,Cells_AreaShape_Zernike_8_4,Cells_RadialDistribution_RadialCV_RNA_2of4,Cells_AreaShape_Solidity,Cytoplasm_RadialDistribution_MeanFrac_ER_2of4,Cytoplasm_AreaShape_Zernike_9_5,Cells_RadialDistribution_MeanFrac_DNA_2of4,Cells_RadialDistribution_FracAtD_RNA_3of4,Cells_AreaShape_Zernike_3_1
4994,0,Control,BR00109990,B12,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.040145,0.048424,-0.009433,0.075082,0.03502,-0.031018,-0.008153,-0.057286,0.007505,-0.012247
5058,0,Control,BR00109990,K10,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.012591,0.156842,0.029842,0.196149,0.042445,0.046887,-0.035511,0.007725,0.127208,0.000954
5050,0,Control,BR00109990,J18,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.036971,0.144002,-0.005104,0.043383,0.059112,0.045391,-0.024943,0.00437,0.095137,-0.014566
5035,0,Control,BR00109990,G9,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.014939,0.085469,0.013341,0.058608,0.035497,-0.02588,0.032955,0.045563,0.05216,-0.004972
4991,0,Control,BR00109990,B9,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.019574,-0.026158,-0.003867,-0.015022,-0.026718,-0.010295,0.006583,-0.018505,-0.011389,0.013764


## Saving training dataset

Once the data holdout has been generated, the next step is to save the training dataset that will serve as the basis for training the multi-class logistic regression model.

In [14]:
# Showing the amount of data we have after removing the holdout data
meta_injury = []
for injury_type, df in fs_profile_df.groupby("injury_type"):
    # extract n_wells, n_compounds and unique compounds per injury_type
    n_wells = df.shape[0]
    injury_code = df["injury_code"].unique()[0]
    unique_compounds = list(df["Compound Name"].unique())
    n_compounds = len(unique_compounds)

    # store information
    meta_injury.append(
        [injury_type, injury_code, n_wells, n_compounds, unique_compounds]
    )

# creating data frame
injury_meta_df = pd.DataFrame(
    meta_injury,
    columns=["injury_type", "injury_code", "n_wells", "n_compounds", "compound_list"],
).sort_values("n_wells", ascending=False)

# display
injury_meta_df

Unnamed: 0,injury_type,injury_code,n_wells,n_compounds,compound_list
0,Control,0,8408,1,[DMSO]
1,Cytoskeletal,1,1102,14,"[Nocodazole, Colchicine, Paclitaxel, Vinblasti..."
7,Miscellaneous,5,1007,38,"[L-Buthionine-(S,R)-sulfoximine, CDDO Im, Cino..."
6,Kinase,3,750,12,"[Wortmannin, Staurosporine, PI-103, BEZ-235, S..."
3,Genotoxin,4,737,21,"[Camptothecin, CX-5461, Doxorubicin, Cladribin..."
5,Hsp90,2,418,3,"[Radicicol, Geldanamycin, 17-AAG]"
11,Redox,6,215,11,"[Menadione, PKF118-310, Dunnione, MGR2, SIN-1 ..."
12,Saponin,10,164,10,"[Digitonin, Saikosaponin A, Polygalasaponin F,..."
4,HDAC,7,138,5,"[AR-42, SAHA, ITF 2357, Panobinostat, Apicidin]"
10,Proteasome,9,117,4,"[Carfilzomib, Bortezomib, (S)-MG132, (R)-MG132]"


In [15]:
# shape of the update training and testing dataset after removing holdout
print("training shape after removing holdouts", fs_profile_df.shape)
fs_profile_df.head()

training shape after removing holdouts (13503, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Cells_AreaShape_Zernike_8_2,Cytoplasm_AreaShape_FormFactor,Cells_AreaShape_Zernike_8_4,Cells_RadialDistribution_RadialCV_RNA_2of4,Cells_AreaShape_Solidity,Cytoplasm_RadialDistribution_MeanFrac_ER_2of4,Cytoplasm_AreaShape_Zernike_9_5,Cells_RadialDistribution_MeanFrac_DNA_2of4,Cells_RadialDistribution_FracAtD_RNA_3of4,Cells_AreaShape_Zernike_3_1
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.016041,-0.041107,-0.000347,0.103464,-0.096563,-0.015029,-0.012755,0.033795,-0.026427,0.051454
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.016385,-0.030383,0.028278,-0.005469,-0.004532,0.009526,0.027104,-0.004394,-0.070415,-0.020646
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.004361,0.072861,-0.002493,0.054249,0.046936,0.009864,0.000802,0.014793,-0.004474,0.01267
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.010156,-0.002715,-0.001161,-0.018706,0.01133,-0.00128,0.000531,0.018147,-0.037427,0.024888
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.022098,0.001851,0.014731,0.026536,0.026682,-0.017886,0.016779,0.069323,0.008786,0.029109


In [16]:
# split the data into trianing and testing sets
meta_cols, feat_cols = split_meta_and_features(fs_profile_df)
X = fs_profile_df[feat_cols]
y = fs_profile_df["injury_code"]

# spliting dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.80, random_state=seed, stratify=y
)

# saving training dataset as csv file
X_train.to_csv(data_split_dir / "X_train.csv.gz", compression="gzip", index=False)
X_test.to_csv(data_split_dir / "X_test.csv.gz", compression="gzip", index=False)
y_train.to_csv(data_split_dir / "y_train.csv.gz", compression="gzip", index=False)
y_test.to_csv(data_split_dir / "y_test.csv.gz", compression="gzip", index=False)

# display data split sizes
print("X training size", X_train.shape)
print("X testing size", X_test.shape)
print("y training size", y_train.shape)
print("y testing size", y_test.shape)

X training size (10802, 221)
X testing size (2701, 221)
y training size (10802,)
y testing size (2701,)


In [17]:
# save metadata after holdout
cell_injury_metadata = fs_profile_df[fs_meta]
cell_injury_metadata.to_csv(
    data_split_dir / "cell_injury_metadata_after_holdout.csv.gz",
    compression="gzip",
    index=False,
)

# display
print("Metadata shape", cell_injury_metadata.shape)
cell_injury_metadata.head()

Metadata shape (13503, 33)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Compound PubChem CID,Compound PubChem URL,Control Type,Channels,Comment [Image File Path],Comment [Image Prefix],Mahalanobis distance,Mahalanobis distance significant,Relative well cellcount,Relative well cellcount significant
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c02,7.51,No,1.02,No
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c03,6.21,No,1.11,No
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c04,10.94,No,1.02,No
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c05,7.59,No,1.06,No
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c06,5.28,No,1.0,No
