# Loading profiles from the JUMP Cell Painting Datasets  
This notebook loads a small number of plates with precomputed features and the metadata information.
## Import libraries

In [1]:
import io
import pandas as pd
import plotly.express as px
import plotly.io as pio
import numpy as np

pio.renderers.default = "notebook_connected"  # Set to "svg" or "png" for static plots or "notebook_connected" for interactive plots


## Helper functions

In [4]:
profile_formatter = (
    "s3://cellpainting-gallery/cpg0016-jump/"
    "{Metadata_Source}/workspace/profiles/"
    "{Metadata_Batch}/{Metadata_Plate}/{Metadata_Plate}.parquet"
)

loaddata_formatter = (
    "s3://cellpainting-gallery/cpg0016-jump/"
    "{Metadata_Source}/workspace/load_data_csv/"
    "{Metadata_Batch}/{Metadata_Plate}/load_data_with_illum.parquet"
)


## Load metadata

The following files contain the metadata information for the entire dataset.
The schema is [here](metadata/README.md).

In [5]:
plates = pd.read_csv("metadata/plate.csv")
wells = pd.read_csv("metadata/well.csv")
compound = pd.read_csv("metadata/compound.csv")
orf = pd.read_csv("metadata/orf.csv")


In [6]:
np.unique(plates.Metadata_PlateType)

array(['COMPOUND', 'COMPOUND_EMPTY', 'DMSO', 'ORF', 'POSCON8', 'TARGET1',
       'TARGET2'], dtype=object)

## Sample plates
Let's sample two plates of a certain type (encoded in `Metadata_PlateType`) from each data-generating center (`Metadata_Source`). Note that only 10 out of the 13 sources are currently available and `source_1` does not have the plate type being queried below.

In [7]:
sample = (
    plates.query('Metadata_PlateType=="ORF"')
)
sample

Unnamed: 0,Metadata_Source,Metadata_Batch,Metadata_Plate,Metadata_PlateType
989,source_4,2021_04_26_Batch1,BR00117035,ORF
990,source_4,2021_04_26_Batch1,BR00117036,ORF
991,source_4,2021_04_26_Batch1,BR00117037,ORF
992,source_4,2021_04_26_Batch1,BR00117038,ORF
993,source_4,2021_04_26_Batch1,BR00117039,ORF
...,...,...,...,...
1256,source_4,2021_08_30_Batch13,BR00124781,ORF
1257,source_4,2021_08_30_Batch13,BR00124782,ORF
1258,source_4,2021_08_30_Batch13,BR00126062,ORF
1259,source_4,2021_08_30_Batch13,BR00126404,ORF


In [19]:
np.unique(sample.Metadata_Batch)

array(['2021_04_26_Batch1', '2021_05_10_Batch3', '2021_05_17_Batch4',
       '2021_05_31_Batch2', '2021_06_07_Batch5', '2021_06_14_Batch6',
       '2021_06_21_Batch7', '2021_07_12_Batch8', '2021_07_26_Batch9',
       '2021_08_02_Batch10', '2021_08_09_Batch11', '2021_08_23_Batch12',
       '2021_08_30_Batch13'], dtype=object)

`TARGET2` plates are "sentinel" plates that are run in each batch. More on all this in future updates.

## Loading profiles
Now let's load the profiles from these plates.

Setting `columns = None` below will load all of the features.

<div class="alert alert-warning">
WARNING: Files are located in S3. This loop loads only two features per each sampled plate; loading many feature and/or many plates can take several minutes.
</div>

In [15]:
dframes = []
columns = [
    "Metadata_Source",
    "Metadata_Plate",
    "Metadata_Well",
    "Cells_AreaShape_Eccentricity",
    "Nuclei_AreaShape_Area",
]
for _, row in sample.iterrows():
    s3_path = profile_formatter.format(**row.to_dict())
    dframes.append(
        pd.read_parquet(s3_path, storage_options={"anon": True}, columns=columns)
    )
dframes = pd.concat(dframes)


In [16]:
len(dframes)

768

Each row in `dframes` is well-level profile, containing thousands of features (n=4762) averaged over (typically) a couple of thousand cells per well.

## Join features with metadata

The profiles are annotated with only three columns of metadata (source, plate, well).

Let's add more metadata!

In [17]:
metadata = compound.merge(wells, on="Metadata_JCP2022")
ann_dframe = metadata.merge(
    dframes, on=["Metadata_Source", "Metadata_Plate", "Metadata_Well"]
)


We now know a little bit more about each profile:

In [18]:
ann_dframe.sample(2, random_state=42)


Unnamed: 0,Metadata_JCP2022,Metadata_InChIKey,Metadata_InChI,Metadata_Source,Metadata_Plate,Metadata_Well,Cells_AreaShape_Eccentricity,Nuclei_AreaShape_Area
29,JCP2022_999999,,,source_4,BR00126709,O22,0.78466,1055.0
15,JCP2022_999999,,,source_4,BR00123528B,L21,0.7824,1107.5


More metadata information will be added in the future. 

## Plot features


The scatterplot below contains every well in the sampled dataset.

In the interactive plot (see settings for `pio.renderers.default` above), you can hover over the points to see the JCP ID and the InChiKey for a given compound.

<div class="alert alert-warning">
NOTE: Because these are raw, unnormalized features, you will notice discernable clusters corresponding to each source due to batch effects.
Upcoming data releases will included normalized features, where these effects are mitigated to some extent. 
</div>

In [19]:
from pickle import FALSE, TRUE

px.scatter(
    ann_dframe,
    x="Cells_AreaShape_Eccentricity",
    y="Nuclei_AreaShape_Area",
    color="Metadata_Source",
    hover_name="Metadata_JCP2022",
    hover_data=["Metadata_InChIKey"],
)


So that's just a couple of (raw) measurements from the sentinel plates for 10/13 of the sources, for the principal dataset alone. 

## Load images

[LoadData](https://cellprofiler-manual.s3.amazonaws.com/CPmanual/LoadData.html) CSV files provide Metadata associated with the images to be processed.

In [9]:
sample = sample.sample(2)
load_data = []
for _, row in sample.iterrows():
    s3_path = loaddata_formatter.format(**row.to_dict())
    load_data.append(pd.read_parquet(s3_path, storage_options={"anon": True}))
load_data = pd.concat(load_data)


Let's pick a row at random and inspect it

In [10]:
sample_loaddata = load_data.sample(1, random_state=42)
pd.melt(sample_loaddata)

Unnamed: 0,variable,value
0,Metadata_Source,source_4
1,Metadata_Batch,2021_08_23_Batch12
2,Metadata_Plate,BR00126709
3,Metadata_Well,K05
4,Metadata_Site,7
5,FileName_IllumAGP,BR00126709_IllumAGP.npy
6,FileName_IllumDNA,BR00126709_IllumDNA.npy
7,FileName_IllumER,BR00126709_IllumER.npy
8,FileName_IllumMito,BR00126709_IllumMito.npy
9,FileName_IllumRNA,BR00126709_IllumRNA.npy


The `Metadata_` columns can be used to link the images to profiles. 
Let's pick a profile and view it's corresponding image. 

In [20]:
sample_profile = ann_dframe.sample(1, random_state=22)
sample_profile.melt()

Unnamed: 0,variable,value
0,Metadata_JCP2022,JCP2022_999999
1,Metadata_InChIKey,
2,Metadata_InChI,
3,Metadata_Source,source_4
4,Metadata_Plate,BR00126709
5,Metadata_Well,O21
6,Cells_AreaShape_Eccentricity,0.78095
7,Nuclei_AreaShape_Area,1077.3


First link the profile to it's images.
These are well-level profiles, and each well has typically 9 sites imaged.

In [21]:
sample_linked = pd.merge(
    load_data, sample_profile, on=["Metadata_Source", "Metadata_Plate", "Metadata_Well"]
)
sample_linked[['Metadata_Well', 'Metadata_Site']]

Unnamed: 0,Metadata_Well,Metadata_Site
0,O21,1
1,O21,2
2,O21,3
3,O21,4
4,O21,5
5,O21,6
6,O21,7
7,O21,8
8,O21,9


Inspect details of a single site for this profile

In [22]:
sample_linked.iloc[:1].melt()

Unnamed: 0,variable,value
0,Metadata_Source,source_4
1,Metadata_Batch,2021_08_23_Batch12
2,Metadata_Plate,BR00126709
3,Metadata_Well,O21
4,Metadata_Site,1
5,FileName_IllumAGP,BR00126709_IllumAGP.npy
6,FileName_IllumDNA,BR00126709_IllumDNA.npy
7,FileName_IllumER,BR00126709_IllumER.npy
8,FileName_IllumMito,BR00126709_IllumMito.npy
9,FileName_IllumRNA,BR00126709_IllumRNA.npy


Now load and display a single channel of this 5-channel image

In [24]:
import os
import requests
from io import BytesIO
from matplotlib import pyplot as plt
from matplotlib import image as mpimg
import boto3

image_url = os.path.join(
    sample_linked.iloc[0].PathName_OrigDNA, sample_linked.iloc[0].FileName_OrigDNA
)
print(image_url)
s3_client = boto3.client("s3")
response = s3_client.get_object(
    Bucket=image_url.split("/")[2], Key="/".join(image_url.split("/")[3:])
)
image = mpimg.imread(BytesIO(response["Body"].read()), format="tiff")

plt.imshow(image, cmap="gray")
image_url


s3://cellpainting-gallery/cpg0016-jump/source_4/images/2021_08_23_Batch12/images/BR00126709__2021-09-02T03_20_56-Measurement1/Images/r15c21f01p01-ch5sk1fk1fl1.tiff


NoCredentialsError: Unable to locate credentials

There's a lot more to come! We will add more example notebooks as we go.