-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add whitening normalization to this repo #38
Comments
Whitening has been fixed in pycytominer version |
@shntnu @gwaygenomics - What whitening method should we use for the normalization of the profiles
|
@AdeboyeML and I walked through this issue yesterday. The error is raised here. As I was writing the whitening methods, I noticed that the transformation fails when there are low variance features present. DecisionBecause of this error, let's form the whitening profiles using level 4b data instead of level 4a data (description of data levels). |
Pending
This is also a two part question (the same answer is probably the same for both questions).
@niranjchandrasekaran - I know you've done extensive testing on whitening variations. I also have UMAP profiles from one plate transformed using the different strategies (see below). Do you have a strong recommendation? UMAP Coordinates of Four Whitening MethodsClick to see code that generated the pdf of figuresimport umap
import pandas as pd
import plotnine as gg
from pycytominer import normalize
from pycytominer.cyto_utils import infer_cp_features
# Load data
commit = "da8ae6a3bc103346095d61b4ee02f08fc85a5d98"
batch = "2016_04_01_a549_48hr_batch1"
plate = "SQ00014812"
profile_file = f"{plate}_normalized_feature_select.csv.gz"
base_url = "https://github.com/broadinstitute/lincs-cell-painting/raw/"
url = f"{base_url}{commit}/profiles/{batch}/{plate}/{profile_file}"
df = pd.read_csv(url)
# Apply transformations, UMAP transform, and plot
plotlist = []
for method in ["PCA", "ZCA", "PCA-cor", "ZCA-cor", "mad_robustize"]:
for dmso_norm in [True, False]:
if dmso_norm:
samples = "Metadata_broad_sample == 'DMSO'"
label = "DMSO normalized"
else:
samples = "all"
label = "All samples normalized"
if method == "mad_robustize":
transform = "mad_robustize"
label = f"MAD Robustize\n{label}"
else:
transform = "whiten"
label = f"{method} Whitening\n{label}"
normalize_df = normalize(
df,
features="infer",
meta_features="infer",
samples=samples,
method=transform,
output_file="none",
compression=None,
float_format=None,
whiten_center=False,
whiten_method=method
)
cp_features = infer_cp_features(normalize_df)
meta_features = infer_cp_features(normalize_df, metadata=True)
# Apply UMAP
reducer = umap.UMAP(random_state=123)
embedding_df = reducer.fit_transform(normalize_df.loc[:, cp_features])
embedding_df = pd.DataFrame(embedding_df)
embedding_df.columns = ["x", "y"]
embedding_df = pd.concat(
[
normalize_df.loc[:, meta_features],
embedding_df
],
axis="columns"
)
embedding_df = embedding_df.assign(dmso_label="DMSO")
embedding_df.loc[embedding_df.Metadata_broad_sample != "DMSO", "dmso_label"] = "compound"
embedding_gg = (
gg.ggplot(embedding_df, gg.aes(x="x", y="y")) +
gg.geom_point(gg.aes(size="Metadata_mg_per_ml", color="Metadata_broad_sample"), alpha=0.5) +
gg.facet_grid("~dmso_label") +
gg.ggtitle(label) +
gg.theme_bw() +
gg.theme(legend_position="none", strip_background=gg.element_rect(colour="black", fill="#fdfff4"))
)
plotlist.append(embedding_gg)
gg.save_as_pdf_pages(plotlist, "test_whitening.pdf") Based on these qualitative results, I think we should definitely normalize using DMSO profiles. The other options (PCA, PCA-cor, ZCA, ZCA-cor) are less clear. |
@gwaygenomics ZCA-cor has been my go-to method in the JUMP-CP pilots. DMSO based standardization has also worked quite well. PCA based whitening hasn't worked well in my hands though it may have something to do how it was applied to the data (plate-wise or platemap-wise or all plates). Based on my experience with the pilots, I would say that either DMSO based standardization or ZCA-cor can be offered as the default method in pycytominer. |
Thanks @niranjchandrasekaran
Interesting! We are planning on doing plate-wise whitening - I don't see a benefit of platemap-wise normalization, but perhaps I am missing a key piece.
Cool. I believe whitened_df = normalize(
normalized_feature_selected_df, # For each of the two level 4A profiles
features="infer",
meta_features="infer",
samples="Metadata_broad_sample == 'DMSO'", # This is the key arg to learn the whiten transform using only DMSO
method="whiten",
whiten_center=False,
whiten_method="ZCA-cor"
) |
We decided today at profiling checkin that |
@gwaygenomics Yes, ZCA-cor will be used as default for the whitening.
|
@AdeboyeML - when you visualize the heatmaps, are you looking at only the DMSO profiles? We do not expect to see a decorrelated result in the full plate. Also, can you post the resulting heatmap in this issue? It'll be great to refer back to in the future, for our future selves! |
@gwaygenomics - So I am looking at both the normalized_feature_select_DMSO and normalized_feature_select profiles.
- Before Whitening:- After ZCA-cor Whitening -- using the below parameters:
- After ZCA-cor Whitening -- using the below parameters:
|
great, this is exactly what we want to ultimately do. My question is, which profiles are you using to generate the heatmap? In each of the two files (normalized_feature_select_DMSO and normalized_feature_select profiles) there are 384 profiles. Only a small portion of them (~20 I think) are treated with DMSO (negative control). We should be building the heatmap with the two profiles subset to only DMSO treatment wells when using Does this make sense? |
@gwaygenomics Yes, I think I now understand your question. There are 24 portion of each profile that are treated with DMSO. These 24 DMSO treated wells have the same correlation results as the |
Note that we should also update pycytominer version #53 and that we are rebranding |
I added a first pass spherize implementation for batch 1 and batch 2 data in #60 |
#60 is now merged |
The profiles deposited in #34 do not include whitening normalization. Previously, (see #4 (comment)) I elected to leave the whitened data to a future data upload because of this caveat:
@shntnu also notes in #4 (comment)
The text was updated successfully, but these errors were encountered: