# Load DINO profiles

The unnormalized ("raw") DINO well profiles are in "no_postprocessing" subfolder. There are 3 additional subfolders corresponding to MAD robustize, sphering and sphering + MAD robustize normalizations. The best normalization in the preprint was sphering + MAD robustize.

Some notes:

- Only 241 'plate_type = orf' plates were processed.
- There are 758 wells with all NaN embedding values, indicating that these were empty wells.
  - The NaN-wells were omitted prior to normalization. However, the raw embedding file still contains those.
- The full embedding dimensionality is 384 (standard ViT-S).
- In normalized embeddings there are fewer features (< 384), as variance threshold was applied before normalization.
  - This was necessary for sphering and sphering + MAD robustize.
- In normalized embeddings, a small fraction of wells dropped out due to the missing metadata in the previous version of CellProfiler data in S3. You could recover the missing wells from the raw DINO embeddings ("no_postprocessing"), which only has ['batch', 'plate', 'well'] metadata columns.

In [1]:
import pathlib
import pandas as pd

In [2]:
data_level = "sphering_mad_robustize"

# aws s3 sync --no-sign-request s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace_dl/consensus_original input/DINO_MorphMap/
data_dir = pathlib.Path(f"input/DINO_MorphMap/{data_level}")

profiles = pd.read_csv(data_dir / "well_features.csv.gz")

  profiles = pd.read_csv(data_dir / "well_features.csv")


In [3]:
profiles = profiles.convert_dtypes()

for column in profiles.columns:
    if column.startswith("emb_"):
        profiles[column] = profiles[column].astype(float)

profiles = profiles.rename(columns={"well": "Well", "plate": "Plate", "batch": "Batch"})

profiles.columns = [
    "Metadata_" + column if not column.startswith("emb_") else column
    for column in profiles.columns
]

profiles = profiles.filter(regex="^emb|^Metadata_Plate|^Metadata_Well")

profiles.insert(0, "Metadata_Source", "source_4")

profiles.head()

Unnamed: 0,Metadata_Source,Metadata_Well,Metadata_Plate,emb_0000,emb_0001,emb_0003,emb_0005,emb_0007,emb_0008,emb_0010,...,emb_0370,emb_0371,emb_0372,emb_0374,emb_0376,emb_0377,emb_0378,emb_0379,emb_0380,emb_0381
0,source_4,A01,BR00126394,-2.561849,-1.042774,0.413569,-2.040438,1.974785,0.628373,2.825115,...,1.391231,1.072742,1.070098,-2.524422,-2.203632,0.822729,1.619326,-2.723403,-3.069529,1.083985
1,source_4,A02,BR00126394,-0.779505,0.463716,0.796192,-0.394054,-0.344869,0.007658,0.145809,...,0.273735,0.975894,0.611825,-0.50713,-0.011007,0.568512,2.185838,0.376129,-0.257927,-0.892569
2,source_4,A03,BR00126394,-0.239596,-0.074533,-0.038069,-2.328869,0.824137,-1.349908,0.282123,...,1.018676,-0.347542,0.407167,-0.809827,-0.458211,-0.326297,-0.310142,-0.068883,-1.314599,-0.136548
3,source_4,A04,BR00126394,0.979042,0.447647,-1.249208,-1.112229,0.060602,0.840982,-0.926625,...,0.45722,0.385088,-0.243921,0.723454,1.068393,0.718475,-0.126813,-0.651767,-1.129037,1.112781
4,source_4,A05,BR00126394,-0.132727,-0.27365,0.093046,-1.131262,1.41714,1.043113,0.565158,...,-0.26044,0.262585,1.013874,0.053362,-0.268365,-0.752636,0.714139,-0.197719,-0.417125,-0.677995


In [4]:
profiles.to_parquet(f"output/{data_level}_profiles.parquet", index=False)