Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cell clustering steps in #420

Merged
merged 36 commits into from
Jul 9, 2021
Merged

Add cell clustering steps in #420

merged 36 commits into from
Jul 9, 2021

Conversation

alex-l-kong
Copy link
Contributor

@alex-l-kong alex-l-kong commented May 5, 2021

What is the purpose of this PR?

Closes #365. Closes #417. Cell clustering is the final major non-visualization-heavy step, which this PR addresses.

How did you implement your changes

We can piggyback off of the results we get from pixel clustering since we use the SOM and meta clusters assigned there. Also, there is no need for batch processing or subsetting because there are far fewer cells than there are pixels.

Remaining issues

A few things I'm waiting from Candace:

  • 99.9% normalization: do we need it for the consensus clustering step or not? UPDATE: yes, we will.
  • Decay rate parameter: do we need the alpha to be customizable, or will a default of 0.05 to 0.01 work? UPDATE: yes, alpha should be customizable.
  • Possibly more to come

Testing will be held off until these are resolved.

@alex-l-kong alex-l-kong self-assigned this May 5, 2021
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@alex-l-kong alex-l-kong requested review from ngreenwald and removed request for ngreenwald May 11, 2021 03:09
@alex-l-kong
Copy link
Contributor Author

Not completely done verified with Candace's outputs yet, but the cell clustering structure has been solidified. I've also put the requests Candace has made for the time being in.

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a good start, it's clear you put in a lot of work. The main issue is with the comments/documentation. Many of the docstrings and comments are not sufficiently descriptive to actually help the reader understand what is being done. For many of the functions, I honestly couldn't figure out what they were supposed to do based on what you had written.

It's super important to provide a sufficient level of detail in your comments so that someone doesn't have to examine the underlying code directly or hop down three levels of helper functions to figure out what is going on. The point of comments is to avoid that mental work. As you continue to submit PRs with code that's more and more separate from what we've done before, this type of communication will only become more crucial. When it was stuff related to the segmentation code I could basically figure it out even if the comments weren't helpful since I had written the underlying stuff, but this isn't the case anymore.

ark/phenotyping/create_cell_som.R Show resolved Hide resolved
ark/phenotyping/run_cell_som.R Outdated Show resolved Hide resolved
ark/phenotyping/run_cell_som.R Show resolved Hide resolved
ark/phenotyping/cell_consensus_cluster.R Outdated Show resolved Hide resolved
ark/phenotyping/cell_consensus_cluster.R Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
@alex-l-kong
Copy link
Contributor Author

For the time being, I'm going to leave 99.9% normalization in. We can always remove it if needed.

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a lot of miscellaneous comments. @cliu72 can correct me, but I don't think there's any scenario where we would want to train a cell SOM on only a subset of the pixel metaclusters, correct? The logical workflow is to either include all of the pixel clusters, or all of the pixel metaclusters, but never a subset of one of them.

If that's correct, I think we should remove those options, since right now they're not implemented correctly anyway.

ark/phenotyping/cell_consensus_cluster.R Outdated Show resolved Hide resolved
ark/phenotyping/cell_consensus_cluster.R Outdated Show resolved Hide resolved
ark/phenotyping/cell_consensus_cluster.R Outdated Show resolved Hide resolved
ark/phenotyping/create_cell_som.R Outdated Show resolved Hide resolved
ark/phenotyping/create_cell_som.R Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Show resolved Hide resolved
ark/phenotyping/som_utils.py Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
@alex-l-kong
Copy link
Contributor Author

Still haven't added testing functions (added and restructured a lot of functions), but cluster_pixels/run_trained_som are several orders faster when we precompute the row sums and remove the cbind call.

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few comments about the input validation

ark/phenotyping/run_pixel_som.R Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Outdated Show resolved Hide resolved
ark/phenotyping/som_utils.py Show resolved Hide resolved
@alex-l-kong
Copy link
Contributor Author

In addition to the code review comments being addressed, the tests have (finally) been added!

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing just the new code first, there's a ton of tests, need to spend more time to go through those.

Have you run everything through from start to finish?

ark/phenotyping/som_utils.py Show resolved Hide resolved
ark/phenotyping/som_utils.py Show resolved Hide resolved
Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some organization stuff, but I think it makes more sense to revisit once everything is on the branch and ready to merge to master


for fov in fovs:
# read the pixel data for the fov
pixel_data = feather.read_dataframe(os.path.join(preprocessed_path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, makes it much easier to read with the helper function

@ngreenwald ngreenwald merged commit 962112d into new_som Jul 9, 2021
@ngreenwald ngreenwald deleted the cell_cluster branch July 9, 2021 22:58
ngreenwald pushed a commit that referenced this pull request Mar 9, 2022
* Initial commit of new SOM branch

* R SOM integration (#375)

* Initial commit of new cluster boost step

* Add RST files

* Try explicitly downloading cytolib

* Try directly installing cytolib

* See if changes to cytolib library made it in

* Keep only the TIFs we absolutely need for testing

* Remove old preprocessing.py and cluster.py files: everything now in som_utils.py and som_runner.R

* Decrease size of FlowSOM images to speed up testing

* Reduced size of images for segmentation testing too, and adjusted function arguments accordingly

* Update to testing: only unit test example_flowsom_clustering.ipynb, and mock up create_pixels instead of directly running it

* Add new testing framework

* Change the directory structure of the channel data and the segmentation labels, which also requires tests to be updated

* Fix a typo in segmentation_dir for testing example_flowsom_clustering.ipynb

* Change path in som_utils.py

* Remove comments

* Adjust path name parameter passing for cluster_pixels and som_runner.R

* Add coverage by checking path names in cluster_pixels

* Basic path modifications

* Add framework for subsetting SOM (#379)

* Initial commit of new subsetting strategy

* Finalize subsetting method

* Fix notebook testing for subsetted som

* Add a ton of testing

* Nuke references to point

* Add tables library to fix import errors, and fix PYCODESTYLE

* Remove MIBItiff tests for flowsom clustering and mock up cluster_pixels in notebook test to write to HDF5 file

* PYCODESTYLE

* Reorganize SOM

* Remove old functions and adjust notebook tests for new structure

* Correct typo

* Add next version of SOM

* Need to add feather to autodoc

* Documentation fix for cluster_pixels

* Documentation fix for train_som

* Add seed and num_passes default arguments

* Remove extraneous comments

* Address code review comments outside of the normalization

* Fix indentation in conf.py

* Address code review comments

* Nuke another h5py

* Add consensus clustering process (#394)

* Initial commit of consensus clustering

* Update notebooks testing to include dummy consensus cluster results creation

* Address code review comments

* Add 99.9% normalization using subsetted data to run_trained_som.R (#396)

* Batch the FlowSOM process (#399)

* Batch preprocessing step

* Adjust notebook to handle new preprocessing

* Add implementation with multiple fovs batched in run_trained_som.R

* Remove old cluster batching code

* Batch segmentation label loading

* Change segmentation labels loading to numpy

* Remove old commented code, change create_fov_pixel_mat to return matrices and create_pixel_matrix to save matrices

* Fix consensus clustering so it assigns hclust values to each pixel (#407)

* Fix hierarchical clustering to assign to every element

* Add testing for new consensus clustering method

* Remove extra comma

* Add heatmap visualizations in for SOM results (#410)

* Initial results for visualizing the SOM heatmap

* Adjust tests to account for new directory setting in example_flowsom_clustering

* Add random seeds

* Add seed call to ConsensusClusterPlus

* Add numPasses back

* Remove channels argument from create_pixel_matrix

* Add save directories based on datetime to separate runs

* Address code review comments (part 1)

* Add numeric random seed

* Code review comments part 1 (no directory structure fix yet)

* Add new directory setup and notebook testing

* Fix typo

* Fix directory creation bugs in notebook tests

* Change cell ordering

* Add metadata columns to SOM cluster and consensus cluster data (#418)

* Add Candace's suggestions

* Update tests to reflect new method of creating and reading .feathers

* Change cluster assignment step from cbind to $ for minor speedup

* Fix notebooks_test

* Add _feature_1 generation

* Re-add notebook tests

* Add cell clustering steps in (#420)

* Initial setup of train_cell_som

* Add initial cell clustering results

* Forgot to update tests with new function names

* More function fixes

* PYCODESTYLE

* Fix documentation of cell_consensus_cluster

* Another docstring fix

* Add parameters to set learning rate and normalize in run_cell_som.R

* Remove unnecessary subset

* Fix 99.9% normalization in run_cell_som.R

* Typo

* Add row sum normalization and removal of zero sum rows to R script files

* Confirm cell clustering

* Make comments clearer

* Remove redundant fov from fov_pixel_data

* Fix parameters in example_flowsom_clustering.ipynb

* Remove subtracting by 1 error

* Clarify comments for cell clustering

* Almost all code review comments addressed

* Add efficiencies to pixel clustering and cell table merging

* Fix comments and print outs

* PYCODESTYLE

* Tests, lot's of tests...

* Add check to verify_same_elements that tests for order, and make sure ordering is ensured before run_pixel_som and run_cell_som

* Fix col_index -> column_index

* More index fixing

* Index fixes for clustering cells

* Add updated FlowSOM notebook with results

* Add metadata for cluster_cells test

* Move row sum normalization to Python, keep column percentile normalization in R

* Update FlowSOM notebook, and update notebook tests (hold off on cell clustering tests until we see what Brian discovers about Travis testing with R)

* Update error message printed by verify_same_elements if enforce_order=True

* PYCODESTYLE

* Remove _norm from file paths

* Add more descriptive notes to the pixel/cell clustering notebook (#428)

* Add descriptive notes to the FlowSOM notebook, add more argument setting parameters, and update notebook tests accordingly

* Fix call to execute in notebooks_test_utils

* Rename example_flowsom_clustering to example_pixel_cell_clustering

* Address code review comments

* Fix notebook merge

* Create pixel and cell cluster overlays (#431)

* Initial commit

* Add overlay visualizations

* Add tests for pixel overlay and fix segmentation mask generation

* Add cluster_col and column_prefix checks where needed

* Add code to save pixel and cell masks as .tiff files in example_pixel_cell_clustering

* Create datasets that integrate with remapping GUI (#447)

* Add functions to generate data formats for Zak's visualization

* Add option to keep the count column when aggregating the pixel cluster data

* Address more code review comments

* PYCODESTYLE

* Separate pixel and cell clustering processes, add code to generate data files for Zak

* Reassign pixel_consensus_dir back to None in example_cell_clustering.ipynb

* Add more descriptive visualizations for pixel and cell clustering (#460)

* Visualizations 1-4 in progress

* Fix docstring

* Fix a few tests

* Typo in docstring

* Docstring fix

* Docfix in right location

* Visualizations 5-6 added in

* Adjust tests to account for cell_size normalization

* Visualizations 7-8 done

* Docstring fix in compute_cell_cluster_counts

* Version commit of cell clustering

* Address a lot of Candace requests

* Update pipeline to include weighted cell channel expression heatmaps

* Notebook documentation link sweetener

* Initial addressing of code review comments

* More code review comments addressed

* Remove comments

* Update notebooks to include changes made in addressing code review comments

* Remove stray line in data_utils.py that clipped all segmentation label 1's from Candace's dataset

* Remove viz arguments to former viz functions that now just return the data

* Convert visualization into pure computational functions and add more testing

* Update cell clustering notebook to separate computation from visualization

* Update pixel clustering notebook

* Update cell clustering notebook

* Address a TON of code review comments

* Docstring

* Add img_sub_folder argument to create_pixel_matrix

* Suffix fix in data_utils

* Add img_sub_folder argument into example_pixel_clustering

* Replace cluster_prefix with None

* Initial work on .csv saving (likely to be retconned)

* Address final code review comment (saving to CSV)

* Doc fix

* Add new notebooks

* Clarify comment in cell clustering notebook

* Add FINAL final correction (normalize by cell size prior to running R)

* Update notebook documentation: saving to .csv, not .feather

* First Candace comments addressed (everything not in the notebook, and NOTE in pixel clustering

* Doc fix

* Add prefixes to all variables in the notebooks

* Address a hella ton of code review comments

* Doc fix in data_utils

* Adjust data_utils tests for new column names in pixel and cell cluster mask generation

* Fix last code review comment (do not save two versions of weighted channel average per cell cluster)

* Tested cell and pixel clustering on various cluster column pairs

* Address code review comments except for no segmentation label support

* Doc fix

* Doc fix (round 2)

* Doc fix (round 3)

* Fix notebook test for pixel clustering

* Support pixel clustering without segmentation images

* Typo in doc

* Adjust pipeline to accommodate segmentation_dir = None

* Doc fix

* More doc fixes

* Fix notebook test

* Explicitly specify max_k and cap in notebook

* Documentation fixes

* Fix a comment

* Remove preprocessed_dir and subsetted_dir from documentation at the top

* Remove TODO from pixel clustering

* Update documentation in cell clustering

* Update weighted cell channel average computation to take file path to weighted cell table

* Adjust cell pipeline for weighted cell channel avg

* Remove cell outputs from cell clustering notebook

* Fix column subset

* Condense cells in pixel and cell clustering process

* Update notebook utils

* Address one final comment

* Address Candace's review comments

* Address next set of Candace review comments (biggest change: z-score bug fixed)

* Minor change to documentation of JSON params saved in pixel clustering

* Add lower-bound capping using pmax

* clusterCountsNorm to clusterCountsNormSub for consistency

* Include both cell SOM and meta visualizations for weighted channel average, and nuke more mentions of hCluster_cap

* Add option to specify dtype when creating the pixel matrix

* Doc fixes

* Add more detail for cell_table_name in cell clustering notebook

* Fix comments in cell clustering

* Code review comments addressed

* Move z-scoring back to R, inputs to Zak's visualization now un-z-scored

* Add remapping scheme to som_utils.py

* Add re-mapping scheme

* Fix cell clustering name

* Need to add cell size normalization for weighted cell channel table

* Fix tick_range None

* Change default plot back to 'tab20'

* Add the cmap arg back to the pixel and cell clustering notebooks

* Nuke z-score in notebook

* Add a test to ensure the cell table creation fails if the user provides a cell table without the required columns

* Doc fixes in the cell clustering notebook

* Update to take into account user renaming of meta clusters, and ensure the colors for each metacluster are consistent across every visualization

* Add tests for remapping cell clusters

* Fix notebook typo

* Notebook update

* Add test to ensure columns and SOM clusters are valid for remapping purposes

* Add a test for the pixel and cell cluster overlay

* Add tests to ensure the pixel meta clusters are in the right order when remapped

* Add tests to ensure the cell meta clusters are in the right order

* Clarify comments

* Fix some docstrings related to the pixel SOM/meta cluster counts per cell

* One more minor fix

* More doc fixes

* More notebook documentation fixes

* More doc fixes

* Reduce the time window for automatically rebuilding Docker on new requirements.txt to 30 minutes

* Integrate interactive remapping workflow into the pixel and cell clustering pipeline (#497)

* Initial commit of remapping addition to clustering pipeline

* Add heatmap testing

* Rework tests for file reading in metacluster_remap_gui

* Add sample notebooks with re-visualization scheme

* Relabel barchart, fix off-by-one meta cluster colormap error, ensure heatmap meta cluster colors are grouped together

* Don't divide pixel and cell cluster counts by 1000 when displaying the bar chart

* Add sample notebooks with new updated visualizations

* Clarifying comment

* Remap doc fixes

* Remove duplicate comment about referring to documentation for cell weighting mechanism

* Mega OCD

* Prevent loading of .DS_Store files when generating pixel cluster masks

* Bug fix in io_utils.list_files

* Address all code review comments except for column names in interactive remapping

* Add notebooks with seg_suffix in JSON params and cell heatmap without pixel_meta_cluster_rename_ prefix

* Add test for optional prefix trimming of column names in metacluster_from_files

* Patch file loading

* Need to specify loading .tif and .tiff files only

* Fix path

* Fix path to pixel clustering

* Fix notebook tests and deploy new image to Docker

* Use travis_wait to force 20 minute timeout

* Don't deploy on PR check anymore

* Add specific pixel_output_dir and cell_output_dir folders to store intermediate files

* Remove extraneous files

* Remove mibi

* Remove QC metric notebook test

* Change function calls in notebooks_test.py

* Add cell clustering test pipeline

* Adjust pixel cluster test pipeline to be similar to cell cluster test pipeline

* Add check_contrast = False

* Remove commented notebook tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants