# Improved Metadata Search

This notebook focuses on metadata search using two essential files: the annotations data extracted from the actual screening profile (available in the [IDR repository](https://github.com/IDR/idr0133-dahlin-cellpainting/tree/main/screenA)) and the metadata retrieved from the supplementary section of the [research paper](https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-023-36829-x/MediaObjects/41467_2023_36829_MOESM5_ESM.xlsx).

The objective is to identify the number of unique compounds associated with each cell injury and subsequently cross-reference this information with the screening profile. The aim is to assess the feasibility of using the data for training a machine learning model to predict cell injury.


In [1]:
import pathlib
from collections import defaultdict

import pandas as pd

Setting up parameters below:


In [2]:
# data directory
data_dir = pathlib.Path("./data").resolve(strict=True)
results_dir = pathlib.Path("./results")
results_dir.mkdir(exist_ok=True)

# data paths
suppl_meta_path = data_dir / "41467_2023_36829_MOESM5_ESM.csv"
screen_anno_path = data_dir / "idr0133-screenA-annotation.csv"

# load data
image_profile_df = pd.read_csv(screen_anno_path)
meta_df = image_profile_df[image_profile_df.columns[:31]]
compounds_df = meta_df[["Compound Name", "Compound Class"]]

suppl_meta_df = pd.read_csv(suppl_meta_path)
cell_injury_df = suppl_meta_df[["Cellular injury category", "Compound alias"]]

In this process, we extract information regarding various injury types and the corresponding number of compounds known to induce each type of injury.
Subsequently, we perform a cross-reference with the selected compounds and identify wells that exhibit a match.


In [3]:
# getting profilies based on injury and compound type
injury_and_compounds = defaultdict(list)
for injury, compound in cell_injury_df.values.tolist():
    injury_and_compounds[injury].append(compound)

# cross reference selected injury and associated components into the screen profile
injury_profiles = []
for injury_type, compound_list in injury_and_compounds.items():
    sel_profile = image_profile_df[
        image_profile_df["Compound Name"].isin(compound_list)
    ]
    sel_profile.insert(0, "injury_type", injury_type)
    injury_profiles.append(sel_profile)

In [4]:
# creating a dataframe that contains stratified screen Data
strat_screen_df = pd.concat(injury_profiles)
strat_screen_df.to_csv(results_dir / "stratified_plate_screen_profile.csv", index=False)

# display df
strat_screen_df.head()

Unnamed: 0,injury_type,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,Experimental Condition [Treatment time (h)],...,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumEntropy_DNA_5_0,Nuclei_Texture_SumVariance_DNA_20_0
158,Cytoskeletal,BR00110363,E17,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,0.561075,0.139535,0.188096,-1.035562,0.655389,0.182888,-0.004066,0.130472,-0.418286,0.283484
159,Cytoskeletal,BR00110363,E18,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,0.642707,0.052501,0.130166,-1.304556,0.438742,0.187985,0.088121,0.289709,-0.451626,0.461128
160,Cytoskeletal,BR00110363,E19,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,0.599857,0.184587,0.111444,-1.462714,0.821791,0.22949,0.121207,0.165713,-0.342221,0.388047
161,Cytoskeletal,BR00110363,E20,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,0.513671,0.137843,0.165498,-1.005157,0.264772,0.169579,0.142331,0.264883,-0.161366,0.337277
162,Cytoskeletal,BR00110363,E21,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,...,0.402869,0.083364,0.181626,-1.068167,0.469826,0.411077,0.427186,0.45869,-0.012347,0.658387


> **Table 1:** This DataFrame categorizes wells based on their injury types and with its corresponding compounds linked to each specific injury type.
> Note the new column `injury_type` indicating the assigned injury type for each well.
> This assignment is determined by the component with which the well has been treated.


Next we wanted to extract some metadata regarding how many compound and wells are treated with a given compounds


In [5]:
# getting meta information of the collected data
meta_injury = []
for df in injury_profiles:
    injury_type = df["injury_type"].unique()[0]
    n_wells = df.shape[0]
    n_compounds = len(df["Compound Name"].unique().tolist())
    compound_list = df["Compound Name"].unique().tolist()

    meta_injury.append([injury_type, n_wells, n_compounds, compound_list])


injury_meta_df = pd.DataFrame(
    meta_injury, columns=["injury_type", "n_wells", "n_compounds", "compound_list"]
)
injury_meta_df.to_csv(results_dir / "injury_metadata.csv", index=False)
injury_meta_df

Unnamed: 0,injury_type,n_wells,n_compounds,compound_list
0,Cytoskeletal,1472,15,"[Nocodazole, Colchicine, Paclitaxel, Vinblasti..."
1,Hsp90,552,3,"[Radicicol, Geldanamycin, 17-AAG]"
2,Kinase,1104,13,"[Wortmannin, Staurosporine, PI-103, BEZ-235, A..."
3,Genotoxin,944,22,"[Camptothecin, CX-5461, Doxorubicin, Cladribin..."
4,Miscellaneous,1304,39,"[L-Buthionine-(S,R)-sulfoximine, CDDO Im, Cino..."
5,Redox,312,12,"[Menadione, PKF118-310, 4-Amino-1-naphthol (HC..."
6,HDAC,168,5,"[AR-42, SAHA, ITF 2357, Panobinostat, Apicidin]"
7,mTOR,96,2,"[Torin 1, Rapamycin]"
8,Proteasome,144,4,"[Carfilzomib, Bortezomib, (S)-MG132, (R)-MG132]"
9,Saponin,288,11,"[Digitonin, Saikosaponin A, Polygalasaponin F,..."


> **Table 2** This DataFrame contains information about wells associated with a specific injury type.
> It includes details such as the number of components used along with the list of the components responsible for the identified injury type.


Lastly, we extract of all control wells, which are treated with DMSO.


In [6]:
# getting only control wells
control_df = image_profile_df.loc[image_profile_df["Compound Name"] == "DMSO"]
control_df.to_csv(results_dir / "control_wells.csv", index=False)
control_df

Unnamed: 0,Plate,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,Experimental Condition [Treatment time (h)],Experimental Condition [Experimental Batch],...,Nuclei_Texture_InverseDifferenceMoment_DNA_5_0,Nuclei_Texture_InverseDifferenceMoment_RNA_5_0,Nuclei_Texture_SumAverage_AGP_5_0,Nuclei_Texture_SumAverage_DNA_10_0,Nuclei_Texture_SumAverage_Mito_5_0,Nuclei_Texture_SumAverage_RNA_5_0,Nuclei_Texture_SumEntropy_DNA_10_0,Nuclei_Texture_SumEntropy_DNA_20_0,Nuclei_Texture_SumEntropy_DNA_5_0,Nuclei_Texture_SumVariance_DNA_20_0
0,BR00110363,B2,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,1,...,0.000098,0.057244,0.160847,-0.083034,-0.023290,-0.066369,-0.015235,-0.035909,-0.013321,-0.032067
1,BR00110363,B3,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,1,...,0.025857,0.099848,0.017477,0.021300,0.058137,-0.097280,-0.073545,-0.044883,-0.089842,-0.015240
2,BR00110363,B4,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,1,...,0.041060,0.119247,0.111741,0.041592,0.224199,-0.088845,0.000327,-0.003115,0.016075,-0.014406
3,BR00110363,B5,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,1,...,0.022156,0.036473,-0.013141,0.008690,0.060860,0.044924,0.040528,0.070877,0.038779,0.072871
4,BR00110363,B6,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,1,...,0.007213,0.023068,0.110361,0.054405,0.030157,0.066480,0.038910,0.048559,0.050371,0.056829
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22962,BR00114088,O19,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,2,...,-0.012598,-0.070136,-0.000996,0.034415,0.134332,0.076276,0.046929,0.041168,0.038829,0.068569
22963,BR00114088,O20,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,2,...,-0.009897,-0.057176,-0.052223,0.026655,-0.012996,0.026551,-0.013322,-0.022516,-0.020001,-0.018153
22964,BR00114088,O21,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,2,...,0.039103,-0.000412,-0.226914,-0.000051,0.034653,-0.051597,-0.044620,-0.008060,-0.076127,0.003774
22965,BR00114088,O22,Homo sapiens,NCBITaxon,NCBITaxon_9606,U2OS,EFO,EFO_0002869,24,2,...,-0.037776,-0.034531,-0.204716,-0.037208,0.051886,0.002383,0.138127,0.129287,0.124966,0.164324


> **Table 3**: This dataframe only contains the control wells
