# Splitting Data

Here, we utilize the both the JUMP aligned and non-aligned feature-selected cell-injury profiles generated in the preceding module notebook [here](../0.feature_selection/0.feature_selection.ipynb), focusing on dividing the data into training, testing, and holdout sets for machine learning training.

In [1]:
import sys
import json
import pathlib
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

sys.path.append("../../")  # noqa
from src.utils import split_meta_and_features, get_injury_treatment_info, load_json_file  # noqa

# ignoring warnings
warnings.catch_warnings(action="ignore")



## Helper functions

In [2]:
def get_and_rename_injury_info(
    profile: pd.DataFrame, groupby_key: str, column_name: str
) -> pd.DataFrame:
    """Gets injury treatment information and renames the specified column.

    Parameters
    ----------
    profile : DataFrame
        The profile DataFrame containing data to be processed.
    groupby_key : str
        The key to group by in the injury treatment information.
    column_name : str
        The new name for the 'n_wells' column.

    Returns
    -------
    DataFrame
        A DataFrame with the injury treatment information and the 'n_wells' column renamed.
    """
    return get_injury_treatment_info(profile=profile, groupby_key=groupby_key).rename(
        columns={"n_wells": column_name}
    )

Setting up parameters and file paths

In [3]:
# setting seed constants
seed = 0
np.random.seed(seed)

In [4]:
# directory to get all the inputs for this notebook
data_dir = pathlib.Path("../../data").resolve(strict=True)
JUMP_data_dir = (data_dir / "JUMP_data").resolve(strict=True)
results_dir = pathlib.Path("../../results").resolve(strict=True)
fs_dir = (results_dir / "0.feature_selection").resolve(strict=True)

# directory to store all the output of this notebook
data_split_dir = (results_dir / "1.data_splits").resolve()
data_split_dir.mkdir(exist_ok=True)

# feature space paths
fs_feature_space_path = (fs_dir / "fs_cell_injury_only_feature_space.json").resolve(
    strict=True
)
aligned_fs_feature_space_path = (
    fs_dir / "aligned_cell_injury_shared_feature_space.json"
).resolve(strict=True)

# data paths
raw_cell_injury_path = (
    JUMP_data_dir / "labeled_JUMP_all_plates_normalized_negcon.csv.gz"
).resolve(strict=True)
fs_profile_path = (fs_dir / "fs_cell_injury_only.csv.gz").resolve(strict=True)
aligned_fs_profile_path = (fs_dir / "aligned_cell_injury_profile_fs.csv.gz").resolve(
    strict=True
)

In [5]:
# loading in feature spaces and setting morphological feature spaces
fs_feature_space = load_json_file(fs_feature_space_path)
aligned_fs_feature_space = load_json_file(aligned_fs_feature_space_path)

fs_meta = fs_feature_space["meta_features"]
fs_features = fs_feature_space["features"]
aligned_fs_meta = aligned_fs_feature_space["meta_features"]
aligned_fs_features = aligned_fs_feature_space["features"]

# loading in both aligned and non aligned feature selected profiles
raw_cell_injury_profile_df = pd.read_csv(raw_cell_injury_path)
fs_profile_df = pd.read_csv(fs_profile_path)
aligned_fs_profile_df = pd.read_csv(aligned_fs_profile_path)

## Exploring the data set

Below is a exploration of the selected features dataset. The aim is to identify treatments, extract metadata, and gain a understanding of the experiment's design.

Below demonstrates the amount of wells does each treatment have.

In [6]:
well_treatments_counts_df = (
    raw_cell_injury_profile_df["Compound Name"].value_counts().to_frame().reset_index()
)

well_treatments_counts_df

Unnamed: 0,Compound Name,count
0,DMSO,9855
1,Wortmannin,600
2,Colchicine,512
3,Nocodazole,504
4,Radicicol,504
...,...,...
139,Carmustine,24
140,Thio-TEPA,24
141,Chlorambucil,24
142,Ebselen oxide,24


Below we show the amount of wells does a specific cell celluar injury has

In [7]:
# Displaying how many how wells does each cell injury have
cell_injury_well_counts = (
    raw_cell_injury_profile_df["injury_type"].value_counts().to_frame().reset_index()
)
cell_injury_well_counts

Unnamed: 0,injury_type,count
0,Control,9855
1,Cytoskeletal,1472
2,Miscellaneous,1302
3,Kinase,1104
4,Genotoxin,944
5,Hsp90,552
6,Redox,312
7,Saponin,288
8,HDAC,168
9,Proteasome,144


Next we wanted to extract some metadata regarding how many compound and wells are treated with a given compounds

This will be saved in the `results/0.data_splits` directory

In [8]:
# get summary information and save it
injury_before_holdout_info_df = get_injury_treatment_info(
    profile=raw_cell_injury_profile_df, groupby_key="injury_type"
).reset_index(drop=True)

# display
print("Shape:", injury_before_holdout_info_df.shape)
injury_before_holdout_info_df

Shape: (15, 5)


Unnamed: 0,injury_type,injury_code,n_wells,n_compounds,compound_list
0,Control,0,9855,1,[DMSO]
1,Cytoskeletal,1,1472,15,"[Nocodazole, Colchicine, Paclitaxel, Vinblasti..."
2,Miscellaneous,5,1302,39,"[L-Buthionine-(S,R)-sulfoximine, CDDO Im, Cino..."
3,Kinase,3,1104,13,"[Wortmannin, Staurosporine, PI-103, BEZ-235, A..."
4,Genotoxin,4,944,22,"[Camptothecin, CX-5461, Doxorubicin, Cladribin..."
5,Hsp90,2,552,3,"[Radicicol, Geldanamycin, 17-AAG]"
6,Redox,6,312,12,"[Menadione, PKF118-310, 4-Amino-1-naphthol (HC..."
7,Saponin,10,288,11,"[Digitonin, Saikosaponin A, Polygalasaponin F,..."
8,HDAC,7,168,5,"[AR-42, SAHA, ITF 2357, Panobinostat, Apicidin]"
9,Mitochondria,11,144,4,"[Antimycin A, CCCP, Rotenone, Oligomycin A]"


Next, we construct the profile metadata. This provides a structured overview of how the treatments associated with injuries were applied, detailing the treatments administered to each plate.

This will be saved in the `results/0.data_splits` directory

In [9]:
injury_meta_dict = {}
for injury, df in raw_cell_injury_profile_df.groupby("injury_type"):
    # collecting treatment metadata
    plates = df["Plate"].unique().tolist()
    treatment_meta = {}
    treatment_meta["n_plates"] = len(plates)
    treatment_meta["n_wells"] = df.shape[0]
    treatment_meta["n_treatments"] = len(df["Compound Name"].unique())
    treatment_meta["associated_plates"] = plates

    # counting treatments
    treatment_counter = {}
    for treatment, df2 in df.groupby("Compound Name"):
        if treatment is np.nan:
            continue
        n_treatments = df2.shape[0]
        treatment_counter[treatment] = n_treatments

    # storing treatment counts
    treatment_meta["treatments"] = treatment_counter
    injury_meta_dict[injury] = treatment_meta

# save dictionary into a json file
with open(data_split_dir / "cell_injury_metadata.json", mode="w") as stream:
    json.dump(injury_meta_dict, stream)

Here we build a plate metadata information where we look at the type of treatments and amount of wells with the treatment that are present in the dataset

This will be saved in `results/0.data_splits`

In [10]:
plate_meta = {}
for plate_id, df in raw_cell_injury_profile_df.groupby("Plate"):
    unique_compounds = list(df["Compound Name"].unique())
    n_treatments = len(unique_compounds)

    # counting treatments
    treatment_counter = {}
    for treatment, df2 in df.groupby("Compound Name"):
        n_treatments = df2.shape[0]
        treatment_counter[treatment] = n_treatments

    plate_meta[plate_id] = treatment_counter

# save dictionary into a json file
with open(data_split_dir / "cell_injury_plate_info.json", mode="w") as stream:
    json.dump(plate_meta, stream)


## Data Splitting

---

In this section, we split the data into training, testing, and holdout sets. The process involves generating and splitting the holdout and train-test sets using the JUMP-aligned dataset. To ensure consistency, we extract the same samples from the non-aligned cell injury features, matching those used in the aligned dataset. This approach preserves sample variance and helps prevent errors due to sample discrepancies.

Each subsection will describe how the splits and holdouts were generated.

### holdout dataset

here we collected out holdout dataset. the holdout dataset is a subset of the dataset that is not used during model training or tuning. instead, it is reserved solely for evaluating the model's performance after it has been trained.

in this notebook, we will include three different types of held-out datasets before proceeding with our machine learning training and evaluation.

- plate hold out
- treatment hold out
- well hold out

each of these held outdata will be stored in the `results/1.data_splits` directory



### Plate Holdout (JUMP aligned cell-injury profile)

Plates are randomly selected based on their Plate ID and save them as our `plate_holdout` data.

In [11]:
# plate
n_plates = 10

# setting random seed globally
np.random.seed(seed)

# selecting plates randomly from a list
selected_plates = (
    np.random.choice(fs_profile_df["Plate"].unique().tolist(), (n_plates, 1))
    .flatten()
    .tolist()
)
plate_holdout_df = fs_profile_df.loc[fs_profile_df["Plate"].isin(selected_plates)]

# take the indices of the held out data frame and use it to drop those samples from
# the main dataset. And then check if those indices are dropped
aligned_plate_holdout_idx = plate_holdout_df.index.tolist()
aligned_fs_profile_df = aligned_fs_profile_df.drop(aligned_plate_holdout_idx)
assert all(
    [
        True if num not in aligned_fs_profile_df.index.tolist() else False
        for num in aligned_plate_holdout_idx
    ]
), "index to be dropped found in the main dataframe"

# saving the holdout data
plate_holdout_df.to_csv(
    data_split_dir / "aligned_plate_holdout.csv.gz", index=False, compression="gzip"
)

# display
print("plate holdout shape:", plate_holdout_df.shape)
plate_holdout_df.head()

plate holdout shape: (1948, 385)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_Texture_InverseDifferenceMoment_DNA_20_0,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumVariance_DNA_20_0
1044,0,Control,BR00110368,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.062886,0.011554,0.087596,0.163291,0.058129,-0.00701,0.100495,0.093309,0.108031,0.139935
1045,0,Control,BR00110368,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.018239,0.05969,0.11874,0.031366,-0.00188,0.124516,-0.115299,0.06554,0.095688,0.097536
1046,0,Control,BR00110368,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.054246,0.013374,0.010342,0.070002,0.002531,0.079322,0.127617,0.071349,0.025576,0.05115
1047,0,Control,BR00110368,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.015467,0.01659,0.077109,0.00767,0.039326,0.022608,0.012423,0.076461,0.076174,0.098298
1048,0,Control,BR00110368,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.00636,0.040196,0.104461,0.081361,0.013528,0.012501,-0.044112,0.043685,0.063887,0.06643


### Plate Holdout (non aligned cell-injury profile)

The indices used to generate the plate holdout for the aligned dataset will also be applied to create the non-aligned plate holdout. 

In [12]:
# select
fs_plate_holdout_df = raw_cell_injury_profile_df[fs_meta + fs_features]
fs_plate_holdout_df = fs_plate_holdout_df.iloc[aligned_plate_holdout_idx]
fs_plate_holdout_df.head()

Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_Texture_InverseDifferenceMoment_DNA_20_0,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumVariance_DNA_20_0
1044,0,Control,BR00110368,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.062886,0.011554,0.087596,0.163291,0.058129,-0.00701,0.100495,0.093309,0.108031,0.139935
1045,0,Control,BR00110368,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.018239,0.05969,0.11874,0.031366,-0.00188,0.124516,-0.115299,0.06554,0.095688,0.097536
1046,0,Control,BR00110368,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.054246,0.013374,0.010342,0.070002,0.002531,0.079322,0.127617,0.071349,0.025576,0.05115
1047,0,Control,BR00110368,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.015467,0.01659,0.077109,0.00767,0.039326,0.022608,0.012423,0.076461,0.076174,0.098298
1048,0,Control,BR00110368,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.00636,0.040196,0.104461,0.081361,0.013528,0.012501,-0.044112,0.043685,0.063887,0.06643


Verify that the indices of the holdouts are identical between `fs_plate_holdout` and `aligned_plate_holdout`.

In [13]:
# lets check that both data
assert all(
    aligned_plate_holdout_idx == fs_plate_holdout_df.index
), "holdout indexes are not the same"

# save plate holdout for non aligned profile
fs_plate_holdout_df.to_csv(
    data_split_dir / "fs_plate_holdout.csv.gz", index=False, compression="gzip"
)

### Treatment holdout (JUMP aligned cell-injury profile)

To establish our treatment holdout, we first need to find the number of treatments and wells associated with a specific cell injury, considering the removal of randomly selected plates from the previous step.

To determine which cell injuries should be considered for a single treatment holdout, we establish a threshold of 10 unique compounds. This means that a cell injury type must have at least 10 unique compounds to qualify for selection in the treatment holdout. Any cell injury types failing to meet this criterion will be disregarded.

Once the cell injuries are identified for treatment holdout, we select our holdout treatment by grouping each injury type and choosing the treatment with the fewest wells. This becomes our treatment holdout dataset

In [14]:
injury_treatment_metadata = (
    aligned_fs_profile_df.groupby(["injury_type", "Compound Name"])
    .size()
    .reset_index(name="n_wells")
)
injury_treatment_metadata

Unnamed: 0,injury_type,Compound Name,n_wells
0,Control,DMSO,8783
1,Cytoskeletal,ARQ 621,12
2,Cytoskeletal,Citreoviridin,18
3,Cytoskeletal,Citrinin,18
4,Cytoskeletal,Colchicine,457
...,...,...,...
139,Tannin,Corilagin,18
140,Tannin,Gallotannin,24
141,Tannin,Punicalagin,18
142,mTOR,Rapamycin,42


In [15]:
# setting random seed
min_treatments_per_injury = 10

# Filter out the injury types for which we can select a complete treatment.
# We are using a threshold of 10. If an injury type is associated with fewer than 10 compounds,
# we do not conduct treatment holdout on those injury types.
accepted_injuries = []
for injury_type, df in injury_treatment_metadata.groupby("injury_type"):
    n_treatments = df.shape[0]
    if n_treatments >= min_treatments_per_injury:
        accepted_injuries.append(df)

accepted_injuries = pd.concat(accepted_injuries)

# Next, we select the treatment that will be held out within each injury type.
# We group treatments based on injury type and choose the treatment with the fewest wells
# as our holdout.
selected_treatments_to_holdout = []
for injury_type, df in accepted_injuries.groupby("injury_type"):
    held_treatment = df.min().iloc[1]
    selected_treatments_to_holdout.append([injury_type, held_treatment])

# convert to dataframe
selected_treatments_to_holdout = pd.DataFrame(
    selected_treatments_to_holdout, columns="injury_type held_treatment".split()
)

print("Below are the accepted cell injuries and treatments to be held out")
selected_treatments_to_holdout

Below are the accepted cell injuries and treatments to be held out


Unnamed: 0,injury_type,held_treatment
0,Cytoskeletal,ARQ 621
1,Genotoxin,Aphidicolin
2,Kinase,AZD 1152-HQPA
3,Miscellaneous,Aloisine RP106
4,Redox,4-Amino-1-naphthol (HCl)
5,Saponin,Bacopasaponin C


In [16]:
# select all wells that have the treatments to be heldout
treatment_holdout_df = aligned_fs_profile_df.loc[
    fs_profile_df["Compound Name"].isin(
        selected_treatments_to_holdout["held_treatment"]
    )
]

# take the indices of the held out data frame and use it to drop those samples from
# the main dataset. And then check if those indices are dropped
aligned_treatment_holdout_idx = treatment_holdout_df.index.tolist()
aligned_fs_profile_df = aligned_fs_profile_df.drop(aligned_treatment_holdout_idx)
assert all(
    [
        True if num not in aligned_fs_profile_df.index.tolist() else False
        for num in aligned_treatment_holdout_idx
    ]
), "index to be dropped found in the main dataframe"
# saving the holdout data
treatment_holdout_df.to_csv(
    data_split_dir / "aligned_treatment_holdout.csv.gz", index=False, compression="gzip"
)

# display
print("Treatment holdout shape:", treatment_holdout_df.shape)
treatment_holdout_df.head()

Treatment holdout shape: (126, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Cytoplasm_RadialDistribution_RadialCV_Mito_2of4,Nuclei_AreaShape_Zernike_8_4,Cytoplasm_RadialDistribution_MeanFrac_Mito_3of4,Nuclei_RadialDistribution_RadialCV_Mito_2of4,Cytoplasm_RadialDistribution_RadialCV_DNA_1of4,Nuclei_Intensity_IntegratedIntensityEdge_RNA,Cytoplasm_AreaShape_Solidity,Cells_RadialDistribution_RadialCV_AGP_3of4,Cytoplasm_Intensity_MassDisplacement_DNA,Cytoplasm_Intensity_MaxIntensity_AGP
10865,1,Cytoskeletal,BR00114093,G17,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.157707,-0.424077,0.238654,0.47888,0.155788,-1.686765,0.32206,1.002289,-0.694047,0.411874
10866,1,Cytoskeletal,BR00114093,G18,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.369202,-0.104745,0.004876,0.725775,0.976211,0.275408,0.080979,0.480748,0.326442,0.672092
10867,1,Cytoskeletal,BR00114093,G19,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.112051,-0.108013,-0.147719,0.391773,0.635461,-0.028438,-0.013401,0.299985,0.171236,0.447289
10868,1,Cytoskeletal,BR00114093,G20,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.07178,0.00311,0.007107,0.116435,0.108253,-0.105663,0.121813,0.145047,0.073096,0.023514
10869,1,Cytoskeletal,BR00114093,G21,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.094077,-0.019577,0.055251,0.078307,0.102966,-0.129248,0.064776,0.058527,0.058303,0.135508


### Treatment Holdout (non aligned cell-injury profile)

The indices used to generate the treatment holdout for the aligned dataset will also be applied to create the non-aligned plate holdout. 

In [17]:
# select
fs_treatment_holdout_df = raw_cell_injury_profile_df[fs_meta + fs_features]
fs_treatment_holdout_df = fs_treatment_holdout_df.iloc[aligned_treatment_holdout_idx]
fs_treatment_holdout_df.head()

Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_Texture_InverseDifferenceMoment_DNA_20_0,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumVariance_DNA_20_0
10865,1,Cytoskeletal,BR00114093,G17,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.718839,0.049142,-1.740741,0.253751,-2.058566,2.938515,1.600429,0.843067,0.216721,0.809478
10866,1,Cytoskeletal,BR00114093,G18,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.014927,0.100368,0.080819,0.173918,-0.397879,0.155645,0.373518,0.133463,0.120955,0.235244
10867,1,Cytoskeletal,BR00114093,G19,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.051267,-0.009367,0.037545,-0.041294,-0.375493,0.09888,0.095514,-0.027942,-0.001075,0.017841
10868,1,Cytoskeletal,BR00114093,G20,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.015987,0.000596,-0.035705,0.055867,-0.018554,-0.020018,0.131029,0.050271,0.022755,0.033448
10869,1,Cytoskeletal,BR00114093,G21,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.017609,0.066929,0.08337,-0.186292,-0.096957,-0.063968,-0.073064,0.061575,0.050377,0.04906


In [18]:
# lets check that both data
assert all(
    aligned_treatment_holdout_idx == fs_treatment_holdout_df.index
), "holdout indexes are not the same"

# save plate holdout for non aligned profile
fs_treatment_holdout_df.to_csv(
    data_split_dir / "fs_treatment_holdout.csv.gz", index=False, compression="gzip"
)

### Well holdout (JUMP aligned cell-injury profile)

To generate the well hold out data, each plate was iterated and random wells were selected. However, an additional step was conducting which was to separate the control wells and the treated wells, due to the large label imbalance with the controls. Therefore, 5 wells were randomly selected and 10 wells were randomly selected from each individual plate


In [19]:
# parameters
n_controls = 5
n_samples = 10

# setting random seed globally
np.random.seed(seed)

# collecting randomly select wells based on treatment
wells_heldout_df = []
for treatment, df in aligned_fs_profile_df.groupby("Plate", as_index=False):
    # separate control wells and rest of all wells since there is a huge label imbalance
    # selected 5 control wells and 10 random wells from the plate
    df_control = df.loc[df["Compound Name"] == "DMSO"].sample(
        n=n_controls, random_state=seed
    )
    df_treated = df.loc[df["Compound Name"] != "DMSO"].sample(
        n=n_samples, random_state=seed
    )

    # concatenate those together
    well_heldout = pd.concat([df_control, df_treated])

    wells_heldout_df.append(well_heldout)

# genearte treatment holdout dataframe
wells_heldout_df = pd.concat(wells_heldout_df)

# take the indices of the held out data frame and use it to drop those samples from
# the main dataset. And then check if those indices are dropped
aligned_wells_holdout_idx = wells_heldout_df.index.tolist()
aligned_fs_profile_df = aligned_fs_profile_df.drop(aligned_wells_holdout_idx)
assert all(
    [
        True if num not in aligned_fs_profile_df.index.tolist() else False
        for num in aligned_wells_holdout_idx
    ]
), "index to be dropped found in the main dataframe"

# saving the holdout data
wells_heldout_df.to_csv(
    data_split_dir / "aligned_wells_holdout.csv.gz", index=False, compression="gzip"
)

# display
print("Wells holdout shape:", wells_heldout_df.shape)
wells_heldout_df.head()

Wells holdout shape: (1125, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Cytoplasm_RadialDistribution_RadialCV_Mito_2of4,Nuclei_AreaShape_Zernike_8_4,Cytoplasm_RadialDistribution_MeanFrac_Mito_3of4,Nuclei_RadialDistribution_RadialCV_Mito_2of4,Cytoplasm_RadialDistribution_RadialCV_DNA_1of4,Nuclei_Intensity_IntegratedIntensityEdge_RNA,Cytoplasm_AreaShape_Solidity,Cells_RadialDistribution_RadialCV_AGP_3of4,Cytoplasm_Intensity_MassDisplacement_DNA,Cytoplasm_Intensity_MaxIntensity_AGP
4994,0,Control,BR00109990,B12,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.060639,-0.009071,-0.056602,0.01002,-0.07531,-0.10236,0.043993,-0.014525,-0.054584,-0.186213
5058,0,Control,BR00109990,K10,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.077503,-0.01391,0.075362,0.084203,0.013832,0.169926,0.130327,0.082314,0.056664,0.248564
5050,0,Control,BR00109990,J18,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.010995,0.026215,0.034111,0.05588,-0.025792,0.109645,0.138761,0.052096,-0.050448,0.211491
5035,0,Control,BR00109990,G9,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.014911,0.001269,-0.029966,-0.006748,-0.053485,0.323106,0.03827,-0.044801,-0.072149,0.058953
4991,0,Control,BR00109990,B9,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.013263,0.023094,-0.01962,-0.077216,-0.060231,-0.016611,0.001895,-0.081841,-0.11298,-0.319118


### Treatment Holdout (non aligned cell-injury profile)

The indices used to generate the well holdout for the aligned dataset will also be applied to create the non-aligned plate holdout. 

In [20]:
fs_wells_holdout_df = raw_cell_injury_profile_df[fs_meta + fs_features]
fs_wells_holdout_df = fs_wells_holdout_df.iloc[aligned_wells_holdout_idx]
fs_wells_holdout_df.head()

Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_Texture_InverseDifferenceMoment_DNA_20_0,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumVariance_DNA_20_0
4994,0,Control,BR00109990,B12,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.033483,-0.030215,-0.000435,-0.079334,0.020319,0.165035,-0.031037,0.02598,-0.000591,0.000401
5058,0,Control,BR00109990,K10,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.020964,-0.018817,0.024526,-0.005947,0.038915,0.069078,-0.073076,-0.027233,-0.000291,0.014113
5050,0,Control,BR00109990,J18,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.094199,0.064575,0.161873,-0.041443,-0.06339,0.193849,-0.124109,0.034587,0.073446,0.099643
5035,0,Control,BR00109990,G9,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.034564,-0.03705,-0.037351,0.060268,0.012708,0.061491,-0.040638,-0.093101,-0.116318,-0.111381
4991,0,Control,BR00109990,B9,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.037054,0.029132,-0.070626,0.132449,-0.002434,-0.023357,0.01658,-0.085577,-0.061243,-0.080604


In [21]:
# lets check that both data
assert all(
    aligned_wells_holdout_idx == fs_wells_holdout_df.index
), "holdout indexes are not the same"

# save plate holdout for non aligned profile
fs_wells_holdout_df.to_csv(
    data_split_dir / "fs_well_holdout.csv.gz", index=False, compression="gzip"
)

## Saving training dataset

Once the data holdout has been generated, the next step is to save the training dataset that will serve as the basis for training the multi-class logistic regression model.

In [22]:
# get summary cell injury dataset treatment and well info after holdouts
injury_after_holdout_info_df = get_injury_treatment_info(
    profile=aligned_fs_profile_df, groupby_key="injury_type"
)

# display
print("shape:", injury_after_holdout_info_df.shape)
injury_after_holdout_info_df

shape: (15, 5)


Unnamed: 0,injury_type,injury_code,n_wells,n_compounds,compound_list
0,Control,0,8408,1,[DMSO]
1,Cytoskeletal,1,1102,14,"[Nocodazole, Colchicine, Paclitaxel, Vinblasti..."
7,Miscellaneous,5,1006,38,"[L-Buthionine-(S,R)-sulfoximine, CDDO Im, Cino..."
6,Kinase,3,750,12,"[Wortmannin, Staurosporine, PI-103, BEZ-235, S..."
3,Genotoxin,4,738,21,"[Camptothecin, CX-5461, Doxorubicin, Cladribin..."
5,Hsp90,2,418,3,"[Radicicol, Geldanamycin, 17-AAG]"
11,Redox,6,215,11,"[Menadione, PKF118-310, Dunnione, MGR2, SIN-1 ..."
12,Saponin,10,163,10,"[Digitonin, Saikosaponin A, Polygalasaponin F,..."
4,HDAC,7,138,5,"[AR-42, SAHA, ITF 2357, Panobinostat, Apicidin]"
10,Proteasome,9,117,4,"[Carfilzomib, Bortezomib, (S)-MG132, (R)-MG132]"


In [23]:
# shape of the update training and testing dataset after removing holdout
print("training shape after removing holdouts", aligned_fs_profile_df.shape)
fs_profile_df.head()

training shape after removing holdouts (13502, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_Texture_InverseDifferenceMoment_DNA_20_0,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumVariance_DNA_20_0
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.011258,9.8e-05,0.057244,0.160847,-0.083034,-0.02329,-0.066369,-0.015235,-0.035909,-0.032067
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.064689,0.025857,0.099848,0.017477,0.0213,0.058137,-0.09728,-0.073545,-0.044883,-0.01524
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.020937,0.04106,0.119247,0.111741,0.041592,0.224199,-0.088845,0.000327,-0.003115,-0.014406
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.006589,0.022156,0.036473,-0.013141,0.00869,0.06086,0.044924,0.040528,0.070877,0.072871
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.028361,0.007213,0.023068,0.110361,0.054405,0.030157,0.06648,0.03891,0.048559,0.056829


Generating the training and testing sets for both the aligned and non-aligned feature-selected profiles.

In [24]:
# split the data into trianing and testing sets
meta_cols, _ = split_meta_and_features(aligned_fs_profile_df)
X = aligned_fs_profile_df[aligned_fs_features]
y = aligned_fs_profile_df["injury_code"]

# splitting dataset
aligned_X_train, aligned_X_test, aligned_y_train, aligned_y_test = train_test_split(
    X, y, train_size=0.80, random_state=seed, stratify=y
)

# saving training dataset as csv file
aligned_X_train.to_csv(
    data_split_dir / "aligned_X_train.csv.gz", compression="gzip", index=False
)
aligned_X_test.to_csv(
    data_split_dir / "aligned_X_test.csv.gz", compression="gzip", index=False
)
aligned_y_train.to_csv(
    data_split_dir / "aligned_y_train.csv.gz", compression="gzip", index=False
)
aligned_y_test.to_csv(
    data_split_dir / "aligned_y_test.csv.gz", compression="gzip", index=False
)

# display data split sizes
print("aligned X training size", aligned_X_train.shape)
print("aligned X testing size", aligned_X_test.shape)
print("aligned y training size", aligned_y_train.shape)
print("aligned y testing size", aligned_y_test.shape)

aligned X training size (10801, 221)
aligned X testing size (2701, 221)
aligned y training size (10801,)
aligned y testing size (2701,)


Next, using the indexes produced from the data splits, we will generate the training and testing sets for the non-aligned (feature-selected only) cell injury profiles. These indexes are derived from the raw labeled cell injury dataset, but we will apply them only to the feature space of the feature-selected cell injury profiles.

In [25]:
# generating the train test split split for the unaligned cell injury
# fs_features = feature from the only feature selected cell injury profile
fs_X_train = raw_cell_injury_profile_df.iloc[aligned_X_train.index][fs_features]
fs_X_test = raw_cell_injury_profile_df.iloc[aligned_X_test.index][fs_features]
fs_y_train = raw_cell_injury_profile_df.iloc[aligned_y_train.index]["injury_code"]
fs_y_test = raw_cell_injury_profile_df.iloc[aligned_y_test.index]["injury_code"]

# now saving the data
# saving training dataset as csv file
fs_X_train.to_csv(data_split_dir / "fs_X_train.csv.gz", compression="gzip", index=False)
fs_X_test.to_csv(data_split_dir / "fs_X_test.csv.gz", compression="gzip", index=False)
fs_y_train.to_csv(data_split_dir / "fs_y_train.csv.gz", compression="gzip", index=False)
fs_y_test.to_csv(data_split_dir / "fs_y_test.csv.gz", compression="gzip", index=False)

# display data split sizes
print("feature selected only X training size", fs_X_train.shape)
print("feature selected only X testing size", fs_X_test.shape)
print("feature selected only y training size", fs_y_train.shape)
print("feature selected only y testing size", fs_y_test.shape)

feature selected only X training size (10801, 352)
feature selected only X testing size (2701, 352)
feature selected only y training size (10801,)
feature selected only y testing size (2701,)


In [26]:
# save metadata after holdout
cell_injury_metadata = aligned_fs_profile_df[aligned_fs_meta]
cell_injury_metadata.to_csv(
    data_split_dir / "aligned_cell_injury_metadata_after_holdout.csv.gz",
    compression="gzip",
    index=False,
)
# display
print("Metadata shape", cell_injury_metadata.shape)
cell_injury_metadata.head()

Metadata shape (13502, 33)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Compound PubChem CID,Compound PubChem URL,Control Type,Channels,Comment [Image File Path],Comment [Image Prefix],Mahalanobis distance,Mahalanobis distance significant,Relative well cellcount,Relative well cellcount significant
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c02,7.51,No,1.02,No
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c03,6.21,No,1.11,No
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c04,10.94,No,1.02,No
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c05,7.59,No,1.06,No
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,679.0,https://pubchem.ncbi.nlm.nih.gov/compound/679,Negative,"Ch1 (blue): Nuclei, Ch2 (green): ER, Ch3 (yell...",/incoming/BR00110363/,r02c06,5.28,No,1.0,No


## Generating data split summary file

In [27]:
# name of the columns
data_col_name = [
    "Number of Wells (Total Data)",
    "Number of Wells (Train Split)",
    "Number of Wells (Test Split)",
    "Number of Wells (Plate Holdout)",
    "Number of Wells (Treatment Holdout)",
    "Number of Wells (Well Holdout)",
]


# Total amount summary
injury_before_holdout_info_df = injury_before_holdout_info_df.rename(
    columns={"n_wells": data_col_name[0]}
)
# Data Splitting: Train-Test Summary
# This process creates the test split profile and compares its values
# to the raw data to ensure no changes were made at the index level.
# By verifying the test split against the original data, we confirm that
# the indices remain consistent and unchanged during the split.

# full aligned fs profile feature space
full_aligned_fs_space = meta_cols + aligned_fs_features

# generate profile summary for aligned_X_train data
profile = aligned_X_train.merge(
    aligned_fs_profile_df[meta_cols], how="left", right_index=True, left_index=True
)
profile = profile[full_aligned_fs_space]

# check to see if indices have not change
assert profile.equals(
    raw_cell_injury_profile_df[full_aligned_fs_space].loc[profile.index]
)

# generating summary for aligned train data
injury_train_info_df = get_and_rename_injury_info(
    profile=profile,
    groupby_key="injury_type",
    column_name=data_col_name[1],
)

# generate profile summary for aligned_X_test data
profile = aligned_X_test.merge(
    aligned_fs_profile_df[meta_cols], how="left", right_index=True, left_index=True
)
profile = profile[full_aligned_fs_space]

# check to see if indices have not change
assert profile.equals(
    raw_cell_injury_profile_df[full_aligned_fs_space].loc[profile.index]
)

# generate profile summary for aligned_X_test data
injury_test_info_df = get_and_rename_injury_info(
    profile=profile,
    groupby_key="injury_type",
    column_name=data_col_name[2],
)

# Holdouts summary
injury_plate_holdout_info_df = get_and_rename_injury_info(
    profile=plate_holdout_df, groupby_key="injury_type", column_name=data_col_name[3]
)

injury_treatment_holdout_info_df = get_and_rename_injury_info(
    profile=treatment_holdout_df,
    groupby_key="injury_type",
    column_name=data_col_name[4],
)

injury_well_holdout_info_df = get_and_rename_injury_info(
    profile=wells_heldout_df, groupby_key="injury_type", column_name=data_col_name[5]
)

# Select interested columns
total_data_summary = injury_before_holdout_info_df[["injury_type", data_col_name[0]]]
train_split_summary = injury_train_info_df[["injury_type", data_col_name[1]]]
test_split_summary = injury_test_info_df[["injury_type", data_col_name[2]]]
plate_holdout_info_df = injury_plate_holdout_info_df[["injury_type", data_col_name[3]]]
treatment_holdout_summary = injury_treatment_holdout_info_df[
    ["injury_type", data_col_name[4]]
]
well_holdout_summary = injury_well_holdout_info_df[["injury_type", data_col_name[5]]]

In [28]:
# merge the summary data splits into one, update data type to integers
merged_summary_df = (
    total_data_summary.merge(train_split_summary, on="injury_type", how="outer")
    .merge(test_split_summary, on="injury_type", how="outer")
    .merge(plate_holdout_info_df, on="injury_type", how="outer")
    .merge(treatment_holdout_summary, on="injury_type", how="outer")
    .merge(well_holdout_summary, on="injury_type", how="outer")
    .fillna(0)
    .set_index("injury_type")
)[data_col_name].astype(int)

# update index and rename it 'injury_type' to "Cellular Injury"
merged_summary_df = merged_summary_df.reset_index().rename(
    columns={"injury_type": "Cellular Injury"}
)

# save as csv file
merged_summary_df.to_csv(data_split_dir / "aligned_summary_data_split.csv", index=False)

# display
merged_summary_df

Unnamed: 0,Cellular Injury,Number of Wells (Total Data),Number of Wells (Train Split),Number of Wells (Test Split),Number of Wells (Plate Holdout),Number of Wells (Treatment Holdout),Number of Wells (Well Holdout)
0,Control,9855,6726,1682,1072,0,375
1,Cytoskeletal,1472,882,220,181,12,177
2,Miscellaneous,1302,805,201,171,18,107
3,Kinase,1104,600,150,120,12,222
4,Genotoxin,944,590,148,73,48,85
5,Hsp90,552,334,84,54,0,80
6,Redox,312,172,43,54,24,19
7,Saponin,288,130,33,102,12,11
8,HDAC,168,110,28,30,0,0
9,Mitochondria,144,92,23,12,0,17


In [29]:
aligned_X_train

Unnamed: 0,Cells_AreaShape_Center_Y,Nuclei_RadialDistribution_MeanFrac_AGP_3of4,Cytoplasm_Intensity_IntegratedIntensityEdge_RNA,Cytoplasm_Intensity_MassDisplacement_Mito,Cells_RadialDistribution_MeanFrac_DNA_4of4,Cells_RadialDistribution_MeanFrac_DNA_1of4,Cells_Intensity_IntegratedIntensity_DNA,Cytoplasm_AreaShape_Zernike_9_7,Cytoplasm_AreaShape_Zernike_1_1,Nuclei_Intensity_MeanIntensity_DNA,...,Cytoplasm_RadialDistribution_RadialCV_Mito_2of4,Nuclei_AreaShape_Zernike_8_4,Cytoplasm_RadialDistribution_MeanFrac_Mito_3of4,Nuclei_RadialDistribution_RadialCV_Mito_2of4,Cytoplasm_RadialDistribution_RadialCV_DNA_1of4,Nuclei_Intensity_IntegratedIntensityEdge_RNA,Cytoplasm_AreaShape_Solidity,Cells_RadialDistribution_RadialCV_AGP_3of4,Cytoplasm_Intensity_MassDisplacement_DNA,Cytoplasm_Intensity_MaxIntensity_AGP
2161,0.009605,-0.024311,0.176434,0.114215,0.061683,-0.036071,-0.109576,0.033731,-0.002682,-0.137098,...,0.099895,-0.039828,0.009790,0.136396,-0.048611,0.172864,0.036852,-0.016312,-0.022189,0.178365
9408,-0.010438,-0.074035,-0.292777,0.028674,-0.001566,0.005958,0.050170,0.005739,-0.011285,0.049422,...,0.011872,-0.014659,-0.014107,0.008597,0.011793,-0.291528,-0.016250,-0.030798,-0.030012,-0.176677
15831,0.014844,0.956912,-0.857596,-0.029655,-0.913831,0.333799,0.025361,-0.490821,-0.036708,0.571568,...,0.101591,-0.394279,-0.004591,-0.433273,1.625553,-0.122598,-1.739672,-0.572896,-0.230745,0.066949
16076,0.012248,0.095890,0.092654,0.008163,-0.004674,-0.022394,0.100521,0.003571,0.051459,0.213723,...,-0.004355,0.007787,0.021782,-0.030538,-0.117717,0.082624,0.020435,0.059771,-0.025888,0.028344
16208,0.009151,1.471051,-0.588496,-0.013725,-0.793533,0.502853,-0.226770,-0.248307,0.039840,-0.118319,...,-0.157560,-0.332592,-0.604735,2.968704,0.481544,0.823948,-0.339561,0.090877,-0.542140,0.832465
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14846,-0.066363,-0.047576,-0.054420,0.054178,0.044821,-0.046987,0.024665,-0.032386,0.021676,0.064961,...,0.060937,0.000368,-0.018849,-0.027572,0.096787,0.021070,0.087825,0.053350,0.100450,0.202643
7949,0.059814,0.009167,-0.271565,0.069975,-0.036996,0.040264,0.135613,-0.003203,-0.005058,0.236390,...,0.071867,0.027161,0.008176,0.005304,0.076478,-0.288609,0.025572,0.045990,0.082247,0.074327
13790,0.005665,0.158275,0.255886,-0.048889,0.048809,0.018521,0.149582,-0.024412,0.014413,0.078695,...,-0.042644,-0.040146,0.003253,-0.012581,-0.133286,0.217894,-0.023678,0.095027,-0.039866,-0.038256
10352,-0.026096,0.560357,0.049673,0.666035,-0.079224,-0.075563,0.771144,-0.093712,0.265943,0.179701,...,0.176196,-0.123266,-0.241621,0.673708,0.706570,1.156130,0.298863,0.335104,0.349594,0.327419
