Add whitening normalization to this repo #38

gwaybio · 2020-05-15T18:27:58Z

The profiles deposited in #34 do not include whitening normalization. Previously, (see #4 (comment)) I elected to leave the whitened data to a future data upload because of this caveat:

Pycytominer currently does have a whiten implementation, and I applied it to the two 4a profiles in a test case. The test case did not go smoothly, so it is likely I will need to tinker with the pycytominer implementation a bit (hard to estimate how long the delay will be).

@shntnu also notes in #4 (comment)

Going forward, we will very likely produce at least two different Level 4a profiles

whole-well z-scored

DMSO z-scored

because depending on the layout, one might be better than the other.

We will then produce corresponding 4b (normalized feature selected) versions of the two 4a profiles.
We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.

The text was updated successfully, but these errors were encountered:

gwaybio · 2020-09-15T13:58:21Z

Whitening has been fixed in pycytominer version

AdeboyeML · 2020-09-21T19:01:13Z

@shntnu @gwaygenomics - What whitening method should we use for the normalization of the profiles

There are four whitening methods: PCA, PCA-cor, ZCA-cor and ZCA to choose from.
At present, I have tried the four methods on two profiles. I realized that in order to use the PCA-cor, and ZCA-cor without getting this error "Divide by zero error, make sure low variance columns are removed", all columns with zero (0.0) values have to be dropped prior to whitening.
Also, I realized that after using PCA and ZCA, a few (1 - 2) of the normalized columns returned zeros (0.0) as their values.

gwaybio · 2020-09-22T13:33:11Z

getting this error "Divide by zero error, make sure low variance columns are removed", all columns with zero (0.0) values have to be dropped prior to whitening.

@AdeboyeML and I walked through this issue yesterday. The error is raised here. As I was writing the whitening methods, I noticed that the transformation fails when there are low variance features present.

Decision

Because of this error, let's form the whitening profiles using level 4b data instead of level 4a data (description of data levels).

gwaybio · 2020-09-22T13:52:08Z

Pending

What whitening method should we use for the normalization of the profiles

This is also a two part question (the same answer is probably the same for both questions).

What whitening method should we use in this repo?
What whitening method should we use as a reasonable default in pycytominer? (see Modify normalize.py for easier usage of whitening method cytomining/pycytominer#96)

@niranjchandrasekaran - I know you've done extensive testing on whitening variations. I also have UMAP profiles from one plate transformed using the different strategies (see below). Do you have a strong recommendation?

UMAP Coordinates of Four Whitening Methods

test_whitening.pdf

Click to see code that generated the pdf of figures

import umap
import pandas as pd
import plotnine as gg
from pycytominer import normalize
from pycytominer.cyto_utils import infer_cp_features

# Load data
commit = "da8ae6a3bc103346095d61b4ee02f08fc85a5d98"
batch = "2016_04_01_a549_48hr_batch1"
plate = "SQ00014812"
profile_file = f"{plate}_normalized_feature_select.csv.gz"

base_url = "https://github.com/broadinstitute/lincs-cell-painting/raw/"
url = f"{base_url}{commit}/profiles/{batch}/{plate}/{profile_file}"
df = pd.read_csv(url)

# Apply transformations, UMAP transform, and plot
plotlist = []
for method in ["PCA", "ZCA", "PCA-cor", "ZCA-cor", "mad_robustize"]:
    for dmso_norm in [True, False]:
        if dmso_norm:
            samples = "Metadata_broad_sample == 'DMSO'"
            label = "DMSO normalized"
        else:
            samples = "all"
            label = "All samples normalized"

        if method == "mad_robustize":
            transform = "mad_robustize"
            label = f"MAD Robustize\n{label}"
        else:
            transform = "whiten"
            label = f"{method} Whitening\n{label}"

        normalize_df = normalize(
            df,
            features="infer",
            meta_features="infer",
            samples=samples,
            method=transform,
            output_file="none",
            compression=None,
            float_format=None,
            whiten_center=False,
            whiten_method=method
        )

        cp_features = infer_cp_features(normalize_df)
        meta_features = infer_cp_features(normalize_df, metadata=True)
        
        # Apply UMAP
        reducer = umap.UMAP(random_state=123)
        embedding_df = reducer.fit_transform(normalize_df.loc[:, cp_features])

        embedding_df = pd.DataFrame(embedding_df)
        embedding_df.columns = ["x", "y"]
        embedding_df = pd.concat(
            [
                normalize_df.loc[:, meta_features],
                embedding_df
            ],
            axis="columns"
        )
        embedding_df = embedding_df.assign(dmso_label="DMSO")
        embedding_df.loc[embedding_df.Metadata_broad_sample != "DMSO", "dmso_label"] = "compound"
        embedding_gg = (
            gg.ggplot(embedding_df, gg.aes(x="x", y="y")) +
            gg.geom_point(gg.aes(size="Metadata_mg_per_ml", color="Metadata_broad_sample"), alpha=0.5) +
            gg.facet_grid("~dmso_label") +
            gg.ggtitle(label) +
            gg.theme_bw() +
            gg.theme(legend_position="none", strip_background=gg.element_rect(colour="black", fill="#fdfff4"))
        )
        plotlist.append(embedding_gg)
        
gg.save_as_pdf_pages(plotlist, "test_whitening.pdf")

Based on these qualitative results, I think we should definitely normalize using DMSO profiles. The other options (PCA, PCA-cor, ZCA, ZCA-cor) are less clear.

niranjchandrasekaran · 2020-09-23T04:59:17Z

@gwaygenomics ZCA-cor has been my go-to method in the JUMP-CP pilots. DMSO based standardization has also worked quite well. PCA based whitening hasn't worked well in my hands though it may have something to do how it was applied to the data (plate-wise or platemap-wise or all plates).

Based on my experience with the pilots, I would say that either DMSO based standardization or ZCA-cor can be offered as the default method in pycytominer.

gwaybio · 2020-09-23T12:54:08Z

Thanks @niranjchandrasekaran

it may have something to do how it was applied to the data (plate-wise or platemap-wise or all plates).

Interesting! We are planning on doing plate-wise whitening - I don't see a benefit of platemap-wise normalization, but perhaps I am missing a key piece.

Based on my experience with the pilots, I would say that either DMSO based standardization or ZCA-cor can be offered as the default method in pycytominer.

Cool. I believe ZCA-cor = good performance and PCA = bad performance is also what @AdeboyeML observed. We should also apply ZCA-cor using only DMSO profiles in the lincs dataset. Like this:

whitened_df = normalize(
    normalized_feature_selected_df,  # For each of the two level 4A profiles
    features="infer",
    meta_features="infer",
    samples="Metadata_broad_sample == 'DMSO'",  # This is the key arg to learn the whiten transform using only DMSO
    method="whiten",
    whiten_center=False,
    whiten_method="ZCA-cor"
)

gwaybio · 2020-09-23T19:17:54Z

We decided today at profiling checkin that ZCA-cor against DMSO profiles per-plate is the way to go

AdeboyeML · 2020-09-23T19:57:07Z

@gwaygenomics Yes, ZCA-cor will be used as default for the whitening.

I just want to clarify that when I set samples="Metadata_broad_sample == 'DMSO'", and whiten_center=False, as the normalization parameters for the level 4b data (normalized_feature_select_DMSO and normalized_feature_select profiles), It doesn't give the expected de-correlation result.

I think it is best to set as default samples=all and whiten_center=True

gwaybio · 2020-09-23T20:01:28Z

@AdeboyeML - when you visualize the heatmaps, are you looking at only the DMSO profiles? We do not expect to see a decorrelated result in the full plate.

Also, can you post the resulting heatmap in this issue? It'll be great to refer back to in the future, for our future selves!

AdeboyeML · 2020-09-23T20:43:52Z

@gwaygenomics - So I am looking at both the normalized_feature_select_DMSO and normalized_feature_select profiles.

For the heatmap, I selected the first 14 features from each profiles, as it is not feasible to view the whole morphological features in an heatmap, it will be too clumsy

- Before Whitening:

- After ZCA-cor Whitening -- using the below parameters:

whitened_df = normalize(
   normalized_feature_selected_df,  # For each of the two level 4b profiles
   features="infer",
   meta_features="infer",
   samples="Metadata_broad_sample == 'DMSO'",  # This is the key arg to learn the whiten transform using only DMSO
   method="whiten",
   whiten_center=False,
   whiten_method="ZCA-cor"
)

- After ZCA-cor Whitening -- using the below parameters:

whitened_df = normalize(
   normalized_feature_selected_df,  # For each of the two level 4b profiles
   features="infer",
   meta_features="infer",
   samples="all"
   method="whiten",
   whiten_center= True,
   whiten_method="ZCA-cor"
)

gwaybio · 2020-09-23T20:49:07Z

So I am looking at both the normalized_feature_select_DMSO and normalized_feature_select profiles.

great, this is exactly what we want to ultimately do.

My question is, which profiles are you using to generate the heatmap? In each of the two files (normalized_feature_select_DMSO and normalized_feature_select profiles) there are 384 profiles. Only a small portion of them (~20 I think) are treated with DMSO (negative control). We should be building the heatmap with the two profiles subset to only DMSO treatment wells when using normalize() with samples="Metadata_broad_sample == 'DMSO'".

Does this make sense?

AdeboyeML · 2020-09-23T22:16:50Z

@gwaygenomics Yes, I think I now understand your question. There are 24 portion of each profile that are treated with DMSO. These 24 DMSO treated wells have the same correlation results as the samples="Metadata_broad_sample == 'DMSO'"

gwaybio · 2020-09-29T21:30:23Z

Note that we should also update pycytominer version #53 and that we are rebranding whiten to spherize (they are synonyms) (see cytomining/pycytominer#102)

gwaybio · 2021-03-09T22:37:25Z

I added a first pass spherize implementation for batch 1 and batch 2 data in #60

gwaybio · 2021-03-22T15:11:20Z

#60 is now merged

gwaybio added this to Blocked in Version 1 Release May 15, 2020

This was referenced Jul 26, 2020

Troubleshooting our whitening transform cytomining/pycytominer#90

Closed

Updating whitening implementations cytomining/pycytominer#91

Merged

gwaybio moved this from Blocked to To Do in Version 1 Release Sep 15, 2020

gwaybio mentioned this issue Sep 21, 2020

Modify normalize.py for easier usage of whitening method cytomining/pycytominer#96

Closed

gwaybio moved this from To Do to In Progress in Version 1 Release Sep 25, 2020

gwaybio assigned AdeboyeML Sep 25, 2020

gwaybio closed this as completed Mar 22, 2021

Version 1 Release automation moved this from In Progress to Done Mar 22, 2021

gwaybio mentioned this issue Jul 18, 2021

Make decisions up to consensus clearer #73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add whitening normalization to this repo #38

Add whitening normalization to this repo #38

gwaybio commented May 15, 2020

gwaybio commented Sep 15, 2020

AdeboyeML commented Sep 21, 2020

gwaybio commented Sep 22, 2020

gwaybio commented Sep 22, 2020 •

edited

Loading

niranjchandrasekaran commented Sep 23, 2020

gwaybio commented Sep 23, 2020

gwaybio commented Sep 23, 2020

AdeboyeML commented Sep 23, 2020

gwaybio commented Sep 23, 2020

AdeboyeML commented Sep 23, 2020

gwaybio commented Sep 23, 2020

AdeboyeML commented Sep 23, 2020

gwaybio commented Sep 29, 2020

gwaybio commented Mar 9, 2021

gwaybio commented Mar 22, 2021

Add whitening normalization to this repo #38

Add whitening normalization to this repo #38

Comments

gwaybio commented May 15, 2020

gwaybio commented Sep 15, 2020

AdeboyeML commented Sep 21, 2020

gwaybio commented Sep 22, 2020

Decision

gwaybio commented Sep 22, 2020 • edited Loading

Pending

UMAP Coordinates of Four Whitening Methods

niranjchandrasekaran commented Sep 23, 2020

gwaybio commented Sep 23, 2020

gwaybio commented Sep 23, 2020

AdeboyeML commented Sep 23, 2020

gwaybio commented Sep 23, 2020

AdeboyeML commented Sep 23, 2020

- Before Whitening:

- After ZCA-cor Whitening -- using the below parameters:

- After ZCA-cor Whitening -- using the below parameters:

gwaybio commented Sep 23, 2020

AdeboyeML commented Sep 23, 2020

gwaybio commented Sep 29, 2020

gwaybio commented Mar 9, 2021

gwaybio commented Mar 22, 2021

gwaybio commented Sep 22, 2020 •

edited

Loading