Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Required Steps for Depositing Profiles #4

Closed
gwaybio opened this issue Mar 8, 2020 · 8 comments
Closed

Required Steps for Depositing Profiles #4

gwaybio opened this issue Mar 8, 2020 · 8 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented Mar 8, 2020

I am working towards processing all Drug Repurposing data and adding the results in this repository. The cell health project (https://github.com/broadinstitute/cell-health) now requires that the data are uniformly processed, documented, and made available here.

I will outline below the necessary steps required to get the data and processing pipelines uploaded.

  1. Make sure there are only small floating point differences between cytominer-derived profiles and pycytominer-derived profiles.
  2. Implement broad sample specific annotations
  3. Rerun the "all" profiles pipeline described in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (currently a private repo)
    • This needs to be rerun with the updated robustize_mad normalization strategy, which will also require a decision on whole-plate or DMSO-specific normalization.
  4. Rerun 4.apply module in cell-health
    • Only after steps 1-3 are complete, can I rerun the 4.apply module
    • I will explore whether or not to make the lincs-cell-painting profile repository a submodule of the cell-health project
@shntnu
Copy link
Collaborator

shntnu commented Apr 3, 2020

Implement broad sample specific annotations

@gwaygenomics Can remind about the input you need on this? I'll use cytotools/annotate as a reference to provide inputs.

@gwaybio
Copy link
Member Author

gwaybio commented Apr 3, 2020

Can remind about the input you need on this? I'll use cytotools/annotate as a reference to provide inputs.

Ah, that is a good reference, thanks for the pointer.

I wasn't sure about the cytominer strategy of splitting core functionality from cyto-specific functionality so I put cytominer progress on hold. The primary reason for putting it on hold was so that the lincs data could be processed with a more stable (and thus more reproducible) tool.

However, it sounds like the stability of cytominer (and pycytominer) is likely to occur in a longer timeframe than we need the lincs profiles. A potential intermediate solution could be to freeze a pycytominer version using conda (after confirming floating point differences) for lincs-specific processing. What do you think?

@shntnu
Copy link
Collaborator

shntnu commented Apr 3, 2020

Rerun the "all" profiles pipeline described in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (currently a private repo)

  • This needs to be rerun with the updated robustize_mad normalization strategy, which will also require a decision on whole-plate or DMSO-specific normalization.

Going forward, we will very likely produce at least two different Level 4a profiles

  • whole-well z-scored
  • DMSO z-scored
    because depending on the layout, one might be better than the other.

We will then produce corresponding 4b (normalized feature selected) versions of the two 4a profiles.

We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.

Which among these profiles are best for an application is still an open research question. But until then, we just produce them all.

@gwaygenomics Does that sound reasonable?

This does complicate the analysis for cell-health because you now need to decide which of the two 4a profiles you should use for predictions. For that case, I'd go with whole-plate because that makes it similar to the way you've processed the CRISPR data IIRC>

@shntnu
Copy link
Collaborator

shntnu commented Apr 3, 2020

A potential intermediate solution could be to freeze a pycytominer version using conda (after confirming floating point differences) for lincs-specific processing. What do you think?

That sounds good to me, and will very likely be the strategy we will use for all data processing using pycytominer, right?

@gwaybio
Copy link
Member Author

gwaybio commented Apr 3, 2020

@shntnu and I chatted about this offline. I will summarize our decisions below:

  • I will confirm floating point differences in pycytominer (compared to current cytominer profiles)
  • I will apply the two normalization schemes (whole-well and DMSO)
  • These two normalization schemes will propagate to two separate feature selected files and two separate consensus files

Also, here are answers to the specific questions:

For that case, I'd go with whole-plate because that makes it similar to the way you've processed the CRISPR data IIR

I normalize profiles by EMPTY CRISPR perturbations. See here.

That sounds good to me, and will very likely be the strategy we will use for all data processing using pycytominer, right?

Similar, but not exactly the same. Eventually pycytominer will be traditionally versioned on pypi and conda. Currently, pycytominer is versioned by github hash (see here). It is also worth noting that we can always reprocess the profiles again. This is the beauty of versioned data!

@gwaybio
Copy link
Member Author

gwaybio commented Apr 28, 2020

@shntnu I have a couple followup questions now that I've started adding the processing code in #21 (cc @niranjchandrasekaran)

Question 1 - Should we use z-score normalization or robustize_mad?

Going forward, we will very likely produce at least two different Level 4a profiles
whole-well z-scored
DMSO z-scored
because depending on the layout, one might be better than the other.

The default in cytominer_scripts/normalize.R is robustize. I assume that I should continue using this method.

Question 2 - Is it ok to leave the whitened version for a future update?

We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.

Pycytominer currently does have a whiten implementation, and I applied it to the two 4a profiles in a test case. The test case did not go smoothly, so it is likely I will need to tinker with the pycytominer implementation a bit (hard to estimate how long the delay will be).

Question 3 - How should I form the level 5 consensus data?

My current plan is as follows:

  1. Process each plate independently
  2. Generate an across-plate consensus signature on broad_sample and dose.
  3. The consensus signature will be based on median
  4. Output one single file for the full consensus signature
  5. Output a separate file for a feature selected consensus signature (derived after calculating consensus)

@shntnu
Copy link
Collaborator

shntnu commented Apr 29, 2020

The default in cytominer_scripts/normalize.R is robustize. I assume that I should continue using this method.

Yes. Rationale: mostly empirical – robustize resulted in higher (compared to standardize) replicate correlations of Level 4 across a few experiments we tested this in.

Question 2 - Is it ok to leave the whitened version for a future update?

Yes, definitely ok.

How should I form the level 5 consensus data?

Your plan sounds good.

There's an incompatibility that I need to address in the handbook cytomining/profiling-handbook#53. Ugh. So glad we are thinking through provenance and reproducibility via this project!

@gwaybio
Copy link
Member Author

gwaybio commented May 15, 2020

Closing this issue in favor of project management in https://github.com/broadinstitute/lincs-cell-painting/projects/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants