Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add whitening normalization to this repo #38

Closed
gwaybio opened this issue May 15, 2020 · 15 comments
Closed

Add whitening normalization to this repo #38

gwaybio opened this issue May 15, 2020 · 15 comments
Assignees

Comments

@gwaybio
Copy link
Member

gwaybio commented May 15, 2020

The profiles deposited in #34 do not include whitening normalization. Previously, (see #4 (comment)) I elected to leave the whitened data to a future data upload because of this caveat:

Pycytominer currently does have a whiten implementation, and I applied it to the two 4a profiles in a test case. The test case did not go smoothly, so it is likely I will need to tinker with the pycytominer implementation a bit (hard to estimate how long the delay will be).

@shntnu also notes in #4 (comment)

Going forward, we will very likely produce at least two different Level 4a profiles

  • whole-well z-scored
  • DMSO z-scored

because depending on the layout, one might be better than the other.

We will then produce corresponding 4b (normalized feature selected) versions of the two 4a profiles.
We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.

@gwaybio
Copy link
Member Author

gwaybio commented Sep 15, 2020

Whitening has been fixed in pycytominer version

@AdeboyeML
Copy link

@shntnu @gwaygenomics - What whitening method should we use for the normalization of the profiles

  • There are four whitening methods: PCA, PCA-cor, ZCA-cor and ZCA to choose from.

  • At present, I have tried the four methods on two profiles. I realized that in order to use the PCA-cor, and ZCA-cor without getting this error "Divide by zero error, make sure low variance columns are removed", all columns with zero (0.0) values have to be dropped prior to whitening.

  • Also, I realized that after using PCA and ZCA, a few (1 - 2) of the normalized columns returned zeros (0.0) as their values.

@gwaybio
Copy link
Member Author

gwaybio commented Sep 22, 2020

getting this error "Divide by zero error, make sure low variance columns are removed", all columns with zero (0.0) values have to be dropped prior to whitening.

@AdeboyeML and I walked through this issue yesterday. The error is raised here. As I was writing the whitening methods, I noticed that the transformation fails when there are low variance features present.

Decision

Because of this error, let's form the whitening profiles using level 4b data instead of level 4a data (description of data levels).

@gwaybio
Copy link
Member Author

gwaybio commented Sep 22, 2020

Pending

What whitening method should we use for the normalization of the profiles

This is also a two part question (the same answer is probably the same for both questions).

  1. What whitening method should we use in this repo?
  2. What whitening method should we use as a reasonable default in pycytominer? (see Modify normalize.py for easier usage of whitening method cytomining/pycytominer#96)

@niranjchandrasekaran - I know you've done extensive testing on whitening variations. I also have UMAP profiles from one plate transformed using the different strategies (see below). Do you have a strong recommendation?

UMAP Coordinates of Four Whitening Methods

test_whitening.pdf

Click to see code that generated the pdf of figures
import umap
import pandas as pd
import plotnine as gg
from pycytominer import normalize
from pycytominer.cyto_utils import infer_cp_features

# Load data
commit = "da8ae6a3bc103346095d61b4ee02f08fc85a5d98"
batch = "2016_04_01_a549_48hr_batch1"
plate = "SQ00014812"
profile_file = f"{plate}_normalized_feature_select.csv.gz"

base_url = "https://github.com/broadinstitute/lincs-cell-painting/raw/"
url = f"{base_url}{commit}/profiles/{batch}/{plate}/{profile_file}"
df = pd.read_csv(url)

# Apply transformations, UMAP transform, and plot
plotlist = []
for method in ["PCA", "ZCA", "PCA-cor", "ZCA-cor", "mad_robustize"]:
    for dmso_norm in [True, False]:
        if dmso_norm:
            samples = "Metadata_broad_sample == 'DMSO'"
            label = "DMSO normalized"
        else:
            samples = "all"
            label = "All samples normalized"

        if method == "mad_robustize":
            transform = "mad_robustize"
            label = f"MAD Robustize\n{label}"
        else:
            transform = "whiten"
            label = f"{method} Whitening\n{label}"

        normalize_df = normalize(
            df,
            features="infer",
            meta_features="infer",
            samples=samples,
            method=transform,
            output_file="none",
            compression=None,
            float_format=None,
            whiten_center=False,
            whiten_method=method
        )

        cp_features = infer_cp_features(normalize_df)
        meta_features = infer_cp_features(normalize_df, metadata=True)
        
        # Apply UMAP
        reducer = umap.UMAP(random_state=123)
        embedding_df = reducer.fit_transform(normalize_df.loc[:, cp_features])

        embedding_df = pd.DataFrame(embedding_df)
        embedding_df.columns = ["x", "y"]
        embedding_df = pd.concat(
            [
                normalize_df.loc[:, meta_features],
                embedding_df
            ],
            axis="columns"
        )
        embedding_df = embedding_df.assign(dmso_label="DMSO")
        embedding_df.loc[embedding_df.Metadata_broad_sample != "DMSO", "dmso_label"] = "compound"
        embedding_gg = (
            gg.ggplot(embedding_df, gg.aes(x="x", y="y")) +
            gg.geom_point(gg.aes(size="Metadata_mg_per_ml", color="Metadata_broad_sample"), alpha=0.5) +
            gg.facet_grid("~dmso_label") +
            gg.ggtitle(label) +
            gg.theme_bw() +
            gg.theme(legend_position="none", strip_background=gg.element_rect(colour="black", fill="#fdfff4"))
        )
        plotlist.append(embedding_gg)
        
gg.save_as_pdf_pages(plotlist, "test_whitening.pdf")

Based on these qualitative results, I think we should definitely normalize using DMSO profiles. The other options (PCA, PCA-cor, ZCA, ZCA-cor) are less clear.

@niranjchandrasekaran
Copy link
Member

@gwaygenomics ZCA-cor has been my go-to method in the JUMP-CP pilots. DMSO based standardization has also worked quite well. PCA based whitening hasn't worked well in my hands though it may have something to do how it was applied to the data (plate-wise or platemap-wise or all plates).

Based on my experience with the pilots, I would say that either DMSO based standardization or ZCA-cor can be offered as the default method in pycytominer.

@gwaybio
Copy link
Member Author

gwaybio commented Sep 23, 2020

Thanks @niranjchandrasekaran

it may have something to do how it was applied to the data (plate-wise or platemap-wise or all plates).

Interesting! We are planning on doing plate-wise whitening - I don't see a benefit of platemap-wise normalization, but perhaps I am missing a key piece.

Based on my experience with the pilots, I would say that either DMSO based standardization or ZCA-cor can be offered as the default method in pycytominer.

Cool. I believe ZCA-cor = good performance and PCA = bad performance is also what @AdeboyeML observed. We should also apply ZCA-cor using only DMSO profiles in the lincs dataset. Like this:

whitened_df = normalize(
    normalized_feature_selected_df,  # For each of the two level 4A profiles
    features="infer",
    meta_features="infer",
    samples="Metadata_broad_sample == 'DMSO'",  # This is the key arg to learn the whiten transform using only DMSO
    method="whiten",
    whiten_center=False,
    whiten_method="ZCA-cor"
)

@gwaybio
Copy link
Member Author

gwaybio commented Sep 23, 2020

We decided today at profiling checkin that ZCA-cor against DMSO profiles per-plate is the way to go

@AdeboyeML
Copy link

@gwaygenomics Yes, ZCA-cor will be used as default for the whitening.

  • I just want to clarify that when I set samples="Metadata_broad_sample == 'DMSO'", and whiten_center=False, as the normalization parameters for the level 4b data (normalized_feature_select_DMSO and normalized_feature_select profiles), It doesn't give the expected de-correlation result.

I think it is best to set as default samples=all and whiten_center=True

@gwaybio
Copy link
Member Author

gwaybio commented Sep 23, 2020

@AdeboyeML - when you visualize the heatmaps, are you looking at only the DMSO profiles? We do not expect to see a decorrelated result in the full plate.

Also, can you post the resulting heatmap in this issue? It'll be great to refer back to in the future, for our future selves!

@AdeboyeML
Copy link

@gwaygenomics - So I am looking at both the normalized_feature_select_DMSO and normalized_feature_select profiles.

  • For the heatmap, I selected the first 14 features from each profiles, as it is not feasible to view the whole morphological features in an heatmap, it will be too clumsy

- Before Whitening:

newplot (67)

- After ZCA-cor Whitening -- using the below parameters:

whitened_df = normalize(
   normalized_feature_selected_df,  # For each of the two level 4b profiles
   features="infer",
   meta_features="infer",
   samples="Metadata_broad_sample == 'DMSO'",  # This is the key arg to learn the whiten transform using only DMSO
   method="whiten",
   whiten_center=False,
   whiten_method="ZCA-cor"
)

newplot (65)

- After ZCA-cor Whitening -- using the below parameters:

whitened_df = normalize(
   normalized_feature_selected_df,  # For each of the two level 4b profiles
   features="infer",
   meta_features="infer",
   samples="all"
   method="whiten",
   whiten_center= True,
   whiten_method="ZCA-cor"
)

newplot (66)

@gwaybio
Copy link
Member Author

gwaybio commented Sep 23, 2020

So I am looking at both the normalized_feature_select_DMSO and normalized_feature_select profiles.

great, this is exactly what we want to ultimately do.

My question is, which profiles are you using to generate the heatmap? In each of the two files (normalized_feature_select_DMSO and normalized_feature_select profiles) there are 384 profiles. Only a small portion of them (~20 I think) are treated with DMSO (negative control). We should be building the heatmap with the two profiles subset to only DMSO treatment wells when using normalize() with samples="Metadata_broad_sample == 'DMSO'".

Does this make sense?

@AdeboyeML
Copy link

@gwaygenomics Yes, I think I now understand your question. There are 24 portion of each profile that are treated with DMSO. These 24 DMSO treated wells have the same correlation results as the samples="Metadata_broad_sample == 'DMSO'"

@gwaybio gwaybio moved this from To Do to In Progress in Version 1 Release Sep 25, 2020
@gwaybio
Copy link
Member Author

gwaybio commented Sep 29, 2020

Note that we should also update pycytominer version #53 and that we are rebranding whiten to spherize (they are synonyms) (see cytomining/pycytominer#102)

@gwaybio
Copy link
Member Author

gwaybio commented Mar 9, 2021

I added a first pass spherize implementation for batch 1 and batch 2 data in #60

@gwaybio
Copy link
Member Author

gwaybio commented Mar 22, 2021

#60 is now merged

@gwaybio gwaybio closed this as completed Mar 22, 2021
Version 1 Release automation moved this from In Progress to Done Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants