Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frozen data version 1 #63

Merged
merged 22 commits into from
May 21, 2021
Merged

Frozen data version 1 #63

merged 22 commits into from
May 21, 2021

Conversation

gwaybio
Copy link
Member

@gwaybio gwaybio commented Mar 30, 2021

I update pycytominer and add associated fixes as described in #62

TODO

  • Check example batch 2 data
  • Run full pipeline for both batches, add profiles
  • Rerun cytominer vs. pycytominer comparison notebook
  • Rerun consensus signature notebook - note that I will no longer need to recode dose
  • Rerun spherize notebook

In the next PR, I will migrate from git lfs to dvc

@gwaybio gwaybio marked this pull request as ready for review April 20, 2021 20:14
@gwaybio gwaybio changed the title Frozen data Frozen data version 1 Apr 20, 2021
@gwaybio gwaybio requested a review from shntnu April 20, 2021 20:14
@gwaybio
Copy link
Member Author

gwaybio commented Apr 20, 2021

@shntnu - this is good to go. Sorry for the HUGE amounts of files (most are just profiles).

Please pay extra attention to any updated documentation. Any code changes will require a complete rerun (which I'll only do if absolutely necessary). If necessary, I can address #65 simultaneously.

The next step will be to update to dvc!

]

# Output option
float_format = "%5g"
compression = "gzip"
compression_options = {"method": "gzip", "mtime": 1}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hooray!

python profiling_pipeline.py --batch "2017_12_05_Batch2" --plate_prefix "BR" --well_col "Metadata_Well" --plate_col "Metadata_Plate" --extract_cell_line
```bash
# Make sure you are in the profiles/ directory
./run.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

if batch == "2017_12_05_Batch2":
spherize_df = (
profile_df
.groupby("Metadata_cell_line")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to edit but pretty sure you don't need this logic (grouping by cell line is trivially valid in batch 1 as well)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding spherizing overall, we will make a lot of changes over time which are likely to improve the quality:

  1. Also group by timepoint and not just cell line because if we don't we might be effectively factoring out subspaces we care about
  2. Figure out epsilon
  3. Drop outliers prior to spherizing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also group by timepoint and not just cell line because if we don't we might be effectively factoring out subspaces we care about

nice catch, i'll update this

Figure out epsilon

I think i'll skip this one for data freeze version 1. It seems to far off in the future to wait on.

Drop outliers prior to spherizing

Yep, we already do this. We're discussing in #65 (comment) and once we decide there, I'll run this notebook again with all the suggested changes (I'll also add drop outliers to consensus signature generation notebook)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I agree you should skip epslion optimization, and it's great you're updating 1. as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And great that the outlier features will be gone! For dropping outlier samples, we need new functionality in pycytominer, right? See cytomining/pycytominer#140

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pycytominer currently has a very crude outlier removal strategy (here), which can be specified as an operation in feature_select().

It should easily handle Michael's features as defined in #65, but you're right, the method needs to be improved in the future. Thanks for opening that pycytominer issue!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it needs to be a different operation altogether (a row filter, not a column filter; drop_outlier_features is the latter), but we can discuss this in cytomining/pycytominer#140

All set here from my end.

@shntnu
Copy link
Collaborator

shntnu commented Apr 20, 2021

Please pay extra attention to any updated documentation.

I focused on only .py and .md

I didn't see any documentation changes other than the README.md

Did I miss any documentation?

Any code changes will require a complete rerun (which I'll only do if absolutely necessary). If necessary, I can address #65 simultaneously.

No need to do #65 except perhaps this thing you suggested:

I can add a prominent note to make sure these are dropped in all downstream analyses in a README in #63

@gwaybio
Copy link
Member Author

gwaybio commented Apr 21, 2021

Did I miss any documentation?

Nope, I think you got it all. Thank you!

I can make all of these changes, and we should be good to merge soon

adding dose info to profile readme, adding outlier feature drop to consensus readme
@gwaybio
Copy link
Member Author

gwaybio commented Apr 28, 2021

Alright @shntnu - this is ready for your eyes again. Here is what changed:

  • Adding note to how we decided to remove features 8a794f9
  • Rearranging documentation and minor fixes 6adb1c3
  • Adding the new blocklist features 510fe92
  • Using the new blocklist feature file in feature selection 2201875
  • Adding the updated consensus files 49ec1c7
  • Fixing sphering (sphering based on time as well and dropping new blocklist features) 7c96161
  • Adding spherized profiles 3a47efd
  • Adding a UMAP visualization of spherized profiles acb3e89

@gwaybio gwaybio requested a review from shntnu May 19, 2021 16:02
Copy link
Collaborator

@shntnu shntnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, I totally missed the notification 23 days ago, sorry!

Everything looks good. Thank you for documenting everything point-by-point.

Just one q:

  • Adding the new blocklist features 510fe92

Do you want to update this as well? https://figshare.com/articles/dataset/Blacklist_Features_-_Cell_Profiler/10255811

(but no need to wait for that of course)

@gwaybio
Copy link
Member Author

gwaybio commented May 21, 2021

Do you want to update this as well? https://figshare.com/articles/dataset/Blacklist_Features_-_Cell_Profiler/10255811

I don't think so... Although i do think that we want to update this figshare document to include other version-specific CellProfiler blocklists. At the very least, much more thought needs to go into updating it (much more thought for me at least!). As a separate but related note: I really want to do a deep dive into CellProfiler features... i think its the first step to understanding generic morphology features, which we'll want to annotate with more interpretable biology. It'll also help us with interpreting DeepProfiler features in the future.

@gwaybio gwaybio merged commit 60d85f6 into broadinstitute:master May 21, 2021
@gwaybio gwaybio deleted the data-freeze branch May 21, 2021 20:13
@shntnu
Copy link
Collaborator

shntnu commented May 21, 2021

As a separate but related note: I really want to do a deep dive into CellProfiler features... i think its the first step to understanding generic morphology features, which we'll want to annotate with more interpretable biology. It'll also help us with interpreting DeepProfiler features in the future.

There are two resources that I can think of that will be relevant for this effort

  1. A well-documented readme https://github.com/carpenterlab/2016_bray_natprot/wiki/What-do-Cell-Painting-features-mean%3F
  2. An incomplete notebook
    https://github.com/cytomining/cytominergallery/blob/105be87878d13024ef283e284cf085351b498883/notebooks/empty_readouts.Rmd (knit here https://rpubs.com/shantanu/cp_feature_stats)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants