# Feature Processing and Selection

This notebook focuses on exploration using two essential files: the annotations data extracted from the actual screening profile (available in the [IDR repository](https://github.com/IDR/idr0133-dahlin-cellpainting/tree/main/screenA)) and the metadata retrieved from the supplementary section of the [research paper](https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-023-36829-x/MediaObjects/41467_2023_36829_MOESM5_ESM.xlsx).

We explore the number of unique compounds associated with each cell injury and subsequently cross-reference this information with the screening profile. The aim is to assess the feasibility of using the data for training a machine learning model to predict cell injury.

We apply feature selection through [pycytominer](https://github.com/cytomining/pycytominer) to capture the most informative features representing various cellular injury types within the morphology space. Then, we utilize the selected feature profiles for machine learning applications.

In [13]:
import sys
import json
import pathlib
from collections import defaultdict

import pandas as pd
from pycytominer import feature_select

sys.path.append("../../")
from src import utils

Setting up paths

In [14]:
# data directory
data_dir = pathlib.Path("../../data").resolve(strict=True)
results_dir = pathlib.Path("../../results").resolve(strict=True)
fs_dir = (results_dir / "0.feature_selection").resolve()
fs_dir.mkdir(exist_ok=True)

# jump feature space path
jump_feature_space_path = (data_dir / "JUMP_data/jump_feature_space.json").resolve(
    strict=True
)

# data paths
suppl_meta_path = (data_dir / "41467_2023_36829_MOESM5_ESM.csv.gz").resolve(strict=True)
screen_anno_path = (data_dir / "idr0133-screenA-annotation.csv.gz").resolve(strict=True)

Loading cell-injury well aggregated profiles

In [15]:
# loading jump feature space
jump_feature_space = utils.load_json_file(jump_feature_space_path)

# loading in cell-injury dataset
image_profile_df = pd.read_csv(screen_anno_path)

# spit columns and only get metadata dataframe
meta, feature = utils.split_meta_and_features(image_profile_df)
meta_df = image_profile_df[meta]

compounds_df = meta_df[["Compound Name", "Compound Class"]]

suppl_meta_df = pd.read_csv(suppl_meta_path)
cell_injury_df = suppl_meta_df[["Cellular injury category", "Compound alias"]]

print("Cell injury screen shape:", image_profile_df.shape)

Cell injury screen shape: (23111, 403)


## Labeling Cell Injury data

Here, we are collecting all the samples treated solely with DMSO. Any well treated with DMSO will be labeled as "Control."

In [16]:
# Get all wells treated with DMSO and label them as "Control" as the injury_type
control_df = image_profile_df.loc[image_profile_df["Compound Name"] == "DMSO"]
control_df.insert(0, "injury_type", "Control")

# display
print("Shape of the control:", control_df.shape)
control_df.head()

Shape of the control: (9855, 404)


Unnamed: 0,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,Experimental Condition [Treatment time (h)],...,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumEntropy_DNA_5_0,Nuclei_Texture_SumVariance_DNA_20_0
0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,9.8e-05,0.057244,0.160847,-0.083034,-0.02329,-0.066369,-0.015235,-0.035909,-0.013321,-0.032067
1,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,0.025857,0.099848,0.017477,0.0213,0.058137,-0.09728,-0.073545,-0.044883,-0.089842,-0.01524
2,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,0.04106,0.119247,0.111741,0.041592,0.224199,-0.088845,0.000327,-0.003115,0.016075,-0.014406
3,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,0.022156,0.036473,-0.013141,0.00869,0.06086,0.044924,0.040528,0.070877,0.038779,0.072871
4,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,0.007213,0.023068,0.110361,0.054405,0.030157,0.06648,0.03891,0.048559,0.050371,0.056829


Next, the `injured_df` is generated, which will exclusively contain wells treated with a component that induces an injury. This was accomplished by utilizing supplemental data that detailed which treatments caused specific injuries. We then cross-referenced this data with the image-based profile to identify wells treated with those components and labeled them with the associated injury.

In [17]:
# creating a dictionary that contains the {injury_type : [list of treatments]}
injury_and_compounds = defaultdict(list)
for injury, compound in cell_injury_df.values.tolist():
    injury_and_compounds[injury].append(compound)

# cross reference injury and associated treatments into the screen image-based profile
injury_profiles = []
for injury_type, compound_list in injury_and_compounds.items():
    # selecting data frame with the treatments associated with the injury
    sel_profile = image_profile_df[
        image_profile_df["Compound Name"].isin(compound_list)
    ]

    # add a column to the data subset indicating what type of injury it is
    # and store it
    sel_profile.insert(0, "injury_type", injury_type)
    injury_profiles.append(sel_profile)

# concat the control and all injured labeled wells into a single data frame
injured_df = pd.concat(
    [
        control_df,
        pd.concat(injury_profiles).dropna(subset="injury_type").reset_index(drop=True),
    ]
)

# creating cell injury coder and encoder dictionary
cell_injuries = injured_df["injury_type"].unique()
injury_codes = defaultdict(lambda: {})
for idx, injury in enumerate(cell_injuries):
    injury_codes["encoder"][injury] = idx
    injury_codes["decoder"][idx] = injury

# update injured_df with injury codes
injured_df.insert(
    0,
    "injury_code",
    injured_df["injury_type"].apply(lambda injury: injury_codes["encoder"][injury]),
)

# split meta and feature column names
injury_meta, injury_feats = utils.split_meta_and_features(injured_df)

# save the injury codes json file
with open(fs_dir / "injury_codes.json", mode="w") as f:
    json.dump(injury_codes, f)

# display
print("Shape of cell injury dataframe", injured_df.shape)
print("Number of meta features", len(injury_meta))
print("Number of features", len(injury_feats))
print("Number of plates", len(injured_df["Plate"].unique()))
print("Number of injuries", len(injured_df["injury_type"].unique()))
print("Number of treatments", len(injured_df["Compound Name"].unique()))
print("List of Compounds", injured_df["Compound Name"].unique())
print("List of Injuries", injured_df["injury_type"].unique())
injured_df.head()

Shape of cell injury dataframe (16703, 405)
Number of meta features 33
Number of features 372
Number of plates 84
Number of injuries 15
Number of treatments 145
List of Compounds ['DMSO' 'Nocodazole' 'Colchicine' 'Paclitaxel' 'Vinblastine' 'Ispinesib'
 'ARQ 621' 'SB-743921' 'Epothilone B' 'Cytochalasin B' 'Monastrol'
 'Cytochalasin D' 'Latrunculin B' 'Citrinin' 'Podophyllotoxin'
 'Citreoviridin' 'Radicicol' 'Geldanamycin' '17-AAG' 'Wortmannin'
 'Staurosporine' 'PI-103' 'BEZ-235' 'AZD 1152-HQPA' 'Saracatinib'
 'PKC 412' 'Lestaurtinib' 'Dasatinib' 'LY294002' 'Sorafenib' 'KW 2449'
 'Sunitinib' 'Camptothecin' 'CX-5461' 'Doxorubicin' 'Cladribine'
 'Etoposide' 'Aphidicolin' 'Gemcitabine' 'Cisplatin' 'Oxaliplatin'
 'Carboplatin' 'Dacarbazine' 'Lomustine' 'SN-38' 'Decitabine' 'Busulfan'
 'Irinotecan' 'Chlorambucil' 'Thio-TEPA' 'Carmustine' 'Melphalan'
 'Cyclophosphamide' 'β-Amanitin' 'L-Buthionine-(S,R)-sulfoximine'
 'CDDO Im' 'Cinobufagin' 'Puromycin' 'Brefeldin A' 'Tetrandrine'
 'Pristimerin

Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumEntropy_DNA_5_0,Nuclei_Texture_SumVariance_DNA_20_0
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,9.8e-05,0.057244,0.160847,-0.083034,-0.02329,-0.066369,-0.015235,-0.035909,-0.013321,-0.032067
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.025857,0.099848,0.017477,0.0213,0.058137,-0.09728,-0.073545,-0.044883,-0.089842,-0.01524
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.04106,0.119247,0.111741,0.041592,0.224199,-0.088845,0.000327,-0.003115,0.016075,-0.014406
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.022156,0.036473,-0.013141,0.00869,0.06086,0.044924,0.040528,0.070877,0.038779,0.072871
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.007213,0.023068,0.110361,0.054405,0.030157,0.06648,0.03891,0.048559,0.050371,0.056829


After generating the complete cell injury dataframe, we will check for any rows containing NaN values and remove them if found.

In [18]:
# next is to drop rows that NaNs
df = injured_df[injury_feats]
nan_idx_to_drop = df[df.isna().any(axis=1)].index

# display
print(f"shape of dataframe before drop NaN rows {injured_df.shape}")
print(f"There are {len(nan_idx_to_drop)} rows to drop that contains NaN's")

# update
injured_df = injured_df.drop(nan_idx_to_drop)
print(injured_df.shape)
injured_df.head()

shape of dataframe before drop NaN rows (16703, 405)
There are 2 rows to drop that contains NaN's
(16701, 405)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumEntropy_DNA_5_0,Nuclei_Texture_SumVariance_DNA_20_0
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,9.8e-05,0.057244,0.160847,-0.083034,-0.02329,-0.066369,-0.015235,-0.035909,-0.013321,-0.032067
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.025857,0.099848,0.017477,0.0213,0.058137,-0.09728,-0.073545,-0.044883,-0.089842,-0.01524
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.04106,0.119247,0.111741,0.041592,0.224199,-0.088845,0.000327,-0.003115,0.016075,-0.014406
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.022156,0.036473,-0.013141,0.00869,0.06086,0.044924,0.040528,0.070877,0.038779,0.072871
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.007213,0.023068,0.110361,0.054405,0.030157,0.06648,0.03891,0.048559,0.050371,0.056829


Save the labeled cell-injury dataset into the ./data directory


In [19]:
injured_df.to_csv(
    data_dir / "labeled_cell_injury_profile.csv.gz",
    index=False,
    compression="gzip",
)

## Feature Selection with Cell-Injury Data

Here, we will perform a feature selection using Pycytominer on the labeled cell-injury dataset to identify morphological features that are indicative of cellular damage. By selecting these key features, we aim to enhance our understanding of the biological mechanisms underlying cellular injuries. The selected features will be utilized to train a multi-class logistic regression model, allowing us to determine which morphological characteristics are most significant in discerning various types of cellular injuries.## Feature selecting on the cell-injury data. 

In this section we generated to two files:
- feature selected cell injury profiles 
- the feature space associated with this profile.

In [20]:
# conduct feature selection using pycytominer
fs_cell_injury_profile = feature_select(
    profiles=injured_df,
    features=injury_feats,
    operation=[
        "correlation_threshold",
        "variance_threshold",
        "drop_outliers",
        "drop_na_columns",
    ],
    corr_threshold=0.9,
    corr_method="pearson",
    freq_cut=0.05,
    outlier_cutoff=500,
    na_cutoff=0.05,
)

# split meta and morphology feature columns
fs_cell_injury_meta, fs_cell_injury_feats = utils.split_meta_and_features(
    fs_cell_injury_profile
)

# display
print(f"N features cell-injury profile {len(injury_feats)}")
print(f"N features fs-cell-injury profile {len(fs_cell_injury_feats)}")
print(f"N features dropped {len(injury_feats) - len(fs_cell_injury_feats)}")

N features cell-injury profile 372
N features fs-cell-injury profile 352
N features dropped 20


After generating the feature-selected cell-injury profiles, we will save both the selected features space and the profiles in the `results/0.feature_selection/` directory.

In [21]:
# if feature space json file does not exists, create one and use this feature space for downstream
cell_injury_selected_feature_space_path = (
    fs_dir / "fs_cell_injury_only_feature_space.json"
).resolve()
if not cell_injury_selected_feature_space_path.exists():
    # saving morphology feature space in JSON file
    print("Feature space file does not exist, creating one...")
    fs_cell_injury_feature_space = {}
    fs_cell_injury_feature_space["name"] = "fs_cell_injury"
    fs_cell_injury_feature_space["n_plates"] = len(
        fs_cell_injury_profile["Plate"].unique()
    )
    fs_cell_injury_feature_space["n_meta_features"] = len(fs_cell_injury_meta)
    fs_cell_injury_feature_space["n_features"] = len(fs_cell_injury_feats)
    fs_cell_injury_feature_space["meta_features"] = fs_cell_injury_meta
    fs_cell_injury_feature_space["features"] = fs_cell_injury_feats
    with open(fs_dir / "fs_cell_injury_only_feature_space.json", mode="w") as stream:
        json.dump(fs_cell_injury_feature_space, stream)

# saving feature selected cell-injury profile
fs_cell_injury_profile.to_csv(fs_dir / "fs_cell_injury_only.csv.gz", index=False)

print(fs_cell_injury_profile.shape)
fs_cell_injury_profile.head()

(16701, 385)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Nuclei_Texture_InverseDifferenceMoment_DNA_20_0,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumVariance_DNA_20_0
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.011258,9.8e-05,0.057244,0.160847,-0.083034,-0.02329,-0.066369,-0.015235,-0.035909,-0.032067
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.064689,0.025857,0.099848,0.017477,0.0213,0.058137,-0.09728,-0.073545,-0.044883,-0.01524
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.020937,0.04106,0.119247,0.111741,0.041592,0.224199,-0.088845,0.000327,-0.003115,-0.014406
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.006589,0.022156,0.036473,-0.013141,0.00869,0.06086,0.044924,0.040528,0.070877,0.072871
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.028361,0.007213,0.023068,0.110361,0.054405,0.030157,0.06648,0.03891,0.048559,0.056829


## Identifying Shared Features between JUMP and Cell Injury Datasets

In this section, we identify the shared features present in both the normalized cell-injury and the JUMP pilot dataset. Next, we utilize these shared features to update our dataset and use it for feature selection in the next step.

In [22]:
# Grab all JUMP morphological features
jump_feats = set(jump_feature_space["features"])

# find shared features and create data frame
shared_features = list(jump_feats.intersection(set(injury_feats)))
shared_features_df = pd.concat(
    [injured_df[injury_meta], injured_df[shared_features].fillna(0)], axis=1
)

# split meta and feature column
shared_meta, shared_feats = utils.split_meta_and_features(shared_features_df)

# display
print("Number of features in Cell Injury", len(injury_feats))
print("Number of features in JUMP", len(jump_feats))
print("Number of shared feats", len(shared_features))
print(
    "Number of features that are not overlapping",
    len(injury_feats) - len(shared_features),
)
print("N features in shared injured profile", len(shared_feats))
print("Shape of shared cell injury profile", shared_features_df.shape)
shared_features_df.head()

Number of features in Cell Injury 372
Number of features in JUMP 5792
Number of shared feats 221
Number of features that are not overlapping 151
N features in shared injured profile 221
Shape of shared cell injury profile (16701, 254)


Unnamed: 0,injury_code,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Cells_RadialDistribution_MeanFrac_ER_3of4,Nuclei_Intensity_StdIntensityEdge_ER,Nuclei_RadialDistribution_RadialCV_DNA_4of4,Cells_RadialDistribution_FracAtD_DNA_3of4,Nuclei_Intensity_IntegratedIntensityEdge_ER,Nuclei_Intensity_MinIntensity_ER,Cells_RadialDistribution_RadialCV_Mito_3of4,Nuclei_RadialDistribution_RadialCV_DNA_2of4,Cytoplasm_AreaShape_Zernike_7_5,Cells_RadialDistribution_MeanFrac_DNA_1of4
0,0,Control,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.001427,-0.051765,0.08926,0.002696,-0.048802,-0.006978,0.064019,0.095432,0.02186,-0.001711
1,0,Control,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.050403,-0.004125,-0.029568,-0.029957,0.020328,0.067573,-0.040367,0.017204,-0.019313,0.018188
2,0,Control,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.03847,-0.078511,-0.034188,-0.009395,-0.044589,0.035589,-0.165888,-0.032308,0.010803,0.014888
3,0,Control,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,-0.023922,-0.030677,0.011993,-0.013818,0.051871,0.103215,-0.039797,-0.011596,0.002641,0.009786
4,0,Control,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,...,0.051581,-0.033829,-0.014885,0.004196,-0.001608,0.013377,-0.042161,-0.048675,-0.019703,0.059695


## Applying Feature Selection with Pycytominer

In this section, we apply Pycytominer's feature selection function to the JUMP-aligned cell injury profiles. This process generates two key outputs:

- A feature-selected, aligned cell injury profile
- The aligned selected feature space, saved in a JSON file

In [23]:
# Applying feature selection using pycytominer
aligned_cell_injury_fs_df = feature_select(
    profiles=shared_features_df,
    features=shared_feats,
    operation=[
        "variance_threshold",
        "drop_outliers",
        "drop_na_columns",
    ],
    freq_cut=0.05,
    outlier_cutoff=500,
    na_cutoff=0.05,
)

# split meta and feature column names
fs_injury_meta, fs_injury_feats = utils.split_meta_and_features(
    aligned_cell_injury_fs_df
)

# counting number of cell injuries
cell_injuries = aligned_cell_injury_fs_df["injury_type"].unique()

# display
print("Number of meta features", len(fs_injury_meta))
print("Number of features", len(fs_injury_feats))
print("Shape of fs shared profile", aligned_cell_injury_fs_df.shape)
print("number of cell injury types", len(cell_injuries))
print(cell_injuries)
print(aligned_cell_injury_fs_df.shape)
aligned_cell_injury_fs_df.head()

# save shared feature selected profile
aligned_cell_injury_fs_df.to_csv(
    fs_dir / "aligned_cell_injury_profile_fs.csv.gz",
    index=False,
    compression="gzip",
)

Number of meta features 33
Number of features 221
Shape of fs shared profile (16701, 254)
number of cell injury types 15
['Control' 'Cytoskeletal' 'Hsp90' 'Kinase' 'Genotoxin' 'Miscellaneous'
 'Redox' 'HDAC' 'mTOR' 'Proteasome' 'Saponin' 'Mitochondria' 'Ferroptosis'
 'Tannin' 'Nonspecific reactive']
(16701, 254)


Save the aligned feature space information while maintaining feature space order

In [24]:
# split meta and feature column names
fs_injury_meta, fs_injury_feats = utils.split_meta_and_features(
    aligned_cell_injury_fs_df
)

# saving info of feature space
jump_feature_space = {
    "name": "cell_injury",
    "n_plates": len(aligned_cell_injury_fs_df["Plate"].unique()),
    "n_meta_features": len(fs_injury_meta),
    "n_features": len(fs_injury_feats),
    "meta_features": fs_injury_meta,
    "features": fs_injury_feats,
}

# if the feature space file does not exists, create one and use this feature space for downstream
selected_feature_space_path = (
    fs_dir / "aligned_cell_injury_shared_feature_space.json"
).resolve()
if not selected_feature_space_path.exists():
    print("Feature space file does not exist, creating one...")
    with open(selected_feature_space_path, mode="w") as f:
        json.dump(jump_feature_space, f)

# if it d oes exist then we have to check the selected features in this notebook matches with the one saved
loaded_selected_feature_space = utils.load_json_file(selected_feature_space_path)[
    "features"
]

# Check if all elements of list1 are in list2 and vice versa
all_in_list2 = all(item in fs_injury_feats for item in loaded_selected_feature_space)
all_in_list1 = all(item in loaded_selected_feature_space for item in fs_injury_feats)
assert all_in_list2 and all_in_list1, "The lists do not contain the same elements."