Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boost clustering process (aka integrate R SOM into clustering pipeline) #368

Closed
wants to merge 398 commits into from

Conversation

alex-l-kong
Copy link
Contributor

@alex-l-kong alex-l-kong commented Dec 10, 2020

What is the purpose of this PR?

Closes #363. One of the bottlenecks that we're currently facing is in the cluster assignment step; we do this row-by-row, and it doesn't need to be this way. This PR will transition this to a batch-processing approach.

How did you implement your changes

We have to play around with numpy broadcasting because many of these operations require reshaping arrays and expanding dimensions. numpy broadcasting allows us to process thousands, even millions of rows at once, although there are diminishing returns the larger the batch size becomes.

UPDATE: we will not be using our own implementation anymore, but rather R's implementation. It runs hundreds of miles faster due to pre-compiled C-code and much better memory management than MiniSOM. It's also easy to integrate as R actually plays fairly nice with Docker.

Remaining issues

With a batch size of 1000000, we see a 1/3 speed improvement as opposed to going row-by-row, which is good, but it might not be good enough considering Candace's dataset is actually on the smaller side. There are a few things we should consider:

  • Reducing the number of operations: currently, I have a ton of reshapes and swapaxes due to the way broadcasting works in numpy. However, we can potentially reduce some of these. Feel free to tackle these low-hanging fruits in the comments section!
  • Calculating an optimal batch size to use. I'm not sure if this is possible, but if you know of a way, please let me know!
  • Multithreading/multiprocessing: if we want an even larger boost, we could consider batching the processes in parallel as well. Note that this method will be dependent on the number of cores the user has and could lead to several kernels dying due to its memory intensity. However, if we've hit a lower bound in terms of time with just numpy broadcasting, we can consider 2x-ing or 4x-ing (in an ideal world) this process with parallelization.

UPDATE: with R SOM integration, we no longer need to worry about these.

ackagel and others added 30 commits August 13, 2020 16:05
* compute_close_cell_num_random speedup

* removed redundancy from compute_close_cell_num_random and corrected other calls

* refactored close_cell_num computation

* forced mark1poslables to ints

* corrected enumerated for loop logic

* fixed some pep8 mistakes

Co-authored-by: ackagel <kagelosaurus@gmail.com>
Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* code cleanup

* cleanup fix

* new edits

* other fixes

* docstring fixes

* final fixes

* marker_num fix

* neighborhood fix

Co-authored-by: Jaiveer Singh <jaiveers.24@gmail.com>
* change package name

* renamed references

* updated notebooks

* update travis command

* removed remaining reference to segmentation

* pep8 from copy/paste

* reordered test functions, removed old code

* more sneaky pep8

* typo
* update docker commands for new name

* Update README.md
* New and improved (like 2-4 lines changed haha) spatial analysis testing with random seed for no enrich and removal of duplicate tests

* Just had to add that comment about 42 to botch PEP8...

* Hopefully changing the random seed works (goodbye 42...)

Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* exclude master from branch build

* logic for travis

* add PRs

* add branch
* requirements test file

* removed test packages from requirements
* test driven development

* fixed indexing error
* Increase coverage of data_utils by adding test functions for combine_point_directories, stitch_images, and split_img_stack, also add documentation where missing

* Fix tests

* Remove extraneous print statement comments

* Address code review comments about documentation and stitching test

* Add test of output shape for test_split_img_stack

* Fix dimensions test and annoying bug that causes io.imread to flip the image to (N, N, 3) if saved as (3, N, N)
* Fix non-descriptive variable naming and add comments describing cell size normalization and arcsinh transformations

* Update comment which describes normalized and transformed data

* Make comments more clear about normalizing and transforming

Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* added comments to start_docker.sh

* added update flag (untested)

* minor syntax error fixes in start_docker.sh

* removed updated_ script naming system

* updated readme

* clarified warning in readme

* moved template/scripts updating to separate shell script for testing

* added testing function for notebook updating bash script

* Clarify update logic

Co-authored-by: ackagel <kagelosaurus@gmail.com>
Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* Initial directory setup of Sphinx

* Add missing folders to docs directory

* Fix requirements.txt typo for recommonmark

* Add HTML directory from Makefile

* Remove sphinxcontrib from requirements.txt

* Update with more detailed examples and parsing

* Remove extraneous files from _build

* Add /_build to the .gitignore so it won't try to commit files from there (use Makefile to generate them OYO)

* Add more config parameters and libraries needed to requirements.txt

* Update index.rst to handle new Markdown files

* Add _markdown files used to generate code

* Add more files to markdown generated

* Fix annoying numpy.core.multiarray import error and add mock library to requirements.txt

* Remove extraneous markdown files and separate documentation-specific libraries into separate requirements.txt file

* Check what happens when requirements.txt is re-added

* Fix requirements and add new markdown without module content

* Add xarray intersphinx (still need to address potential issues with this), mostly to see if readthedocs now passes

* Forgot to add ipython to rtd-requirements.txt

* Make description in ark.analysis.md clearer
* Make contributing file

* Create pull_request_template.md

* Add links to git and github explainers
* disable ignore E501 in pytest.ini

* fixed line length errors in analysis folder

* undid mega tabs in analysis

* full google style guidification of docstings in analysis

* line length fixes in segmentation + some docstring revions

* removed line length errors in utils

* fixed statsmodels import error and corrected calc_dist_matrix docstring

* added more accurate type info to docstrings

* fixed return+type docstrings

* fixed mislabeled types in utils files

* added pandas to intersphinx autodoc

* updated testing tools

* fixed tox.ini keyword

* add skip explainations to style testing

* correct capitalization and test md in return docstring

* test return docstring

* another docstring test

* test new returns: docstring standard

* test codeblock in docstring

* another codeblock test

* add markdown coversion to docstrings

* another try at return docstrings in spatial analysis

* forgot to save and then commit

* restyled md in docstring

* prevent accidental indented codeblocks

* praying for correct formating

* last attempt at md fix

* undid md to rst in docstrings and editted analysis+segmentation

* switch mds to rsts (might be causeing some doc skips)

* undid md to rst

* mock mibidata imports so data_utils and marker_quant get built

* added doclink to readme and standardized return docstrings

* apply Noah's formatting recommendations

* fix transform_expression_matrix docstring

* removed redunant description on tupled return docstrings

Co-authored-by: ackagel <kagelosaurus@gmail.com>
Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* started jupyter notebook

* Updated and Added Function Calls

* Added data for 4 patients

* added more example calls

* switched to new name

* documentation

* init commit

* added comments

* tested

* ran test

* Fixed UMAP import

* generated graphs

* ran example spatial w/ error

* reran it

* no more errors

* fixed data

* Fixed x-axis labels

* New Graphs working

* Added save directory

* Fixed spelling error

* added heading

* generated graphs

* Fixed typo

* Fixed File Directory

* coverage

* removed generated files

Co-authored-by: Neeraj Rattehalli <neeraj.rattehalli@gmail.com>
Co-authored-by: Kevin Chu <kevinlc221@gmail.com>
Co-authored-by: William Xu <williamxu132@gmail.com>
* Fix randomization error in synthetic spatial gen and begin updating type of distance matrix to xarray

* Update notebooks and spatial_analysis_utils functions to handle xarray data type for distance matrices

* Forgot to change spatial analysis script outside of Docker

* Further integration

* Fix closenum tests

* Address minor changes (aka other than dict passing)

* Fix centroid labelling with prop.label, which makes life hella easier

* Change Erin's data and adjust accordingly

* Pycodestyle and xarray import crap

* Last fixes

* Update expression matrix, calc_dist_matrix, and notebook to reflect string-only fov indexing

* Forgot to add new segmentation labels

* Change list to .values for indexing purposes

* Break up logic and update comments

* Fix summing error in closenum random

* Remove old data files
* better test_cols in dim reduction test

* created test utils and move some data_utils_test functions in

* paired xr, tiff generation added to test_utils

* restructured fill variable in test utils

* added clear directory method to test utils

* swap one test in load_imgs_from_mibitiff with new format

* switch to set order filling and remove unneeded functionality from test_utils

* test_load_imgs_from_mibitiff uses test_utils

* unabreviate test utils import and clarify xarray comparison function

* moved all channels testing for mibitiffs into load_imgs_from_mibitiff test function

* rewrite test_load_imgs_from_multitiff

* use test_utils in test_load_imgs_from_tree

* added label generation and testing for load_imgs_from_dir

* refactor generate deepcell input tests

* refactor test_combine_xarrays

* refactor crop test functions

* reduce sizes in test_split_img_stack

* refactored image stitching and img stack splitting test functions

* segmented df generator added to test utils and included in analysis testing

* added recursion to xarray comparison to loop over multiple dims

* move create_test_extraction_data to test_utils.py

* small refactor of test_compute_marker_counts

* chaned np.alls to np.array_equals and clarified args in image xarray creation function

* fixed all zero checks in data_utils_test

* clarified args in xarray label making function in test_compute_marker_counts

* small refactor of expression matrix generation test functions

* refactor of complete extraction matrix generation tests

* removed .DS_Stores (idk how they got added)

* corrected errors in refactor

* got tired of writing f strings for test points and channel names

* added some doc strings to test_utils

* image and xarrays have default row and col size in test utils generators

* clarified 4D to 3D conversion in marker_quantification_test

* default arg support for xarray and image generation in test_utils

* reduced need for expand dims in marker quant tests

* alphabetical sorting moved into data loading functions (as opposed to testing functions)

* renamed _create_* functions to _write_* for clarity

* added most of the docstrings to test_utils

* added module to doc md

* added arg names to gen_fov_chan_names calls in data_utils_tests

* add last docstrings to test_utils

Co-authored-by: ackagel <kagelosaurus@gmail.com>
Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* Add file for context dependent spatial analysis

* Move functionality to spatial_analysis

* Split up compute_pos_cell_labels and add starting functionality for context-dependent spatial analysis

* Forgot to remove old function

* Add clarifying comment for mark1labels_per_id

* This is why you also run the spatial_analysis_utils_test before pushing

* Remove None types for get_pos_cell_labels functions and add documentation to them

* Remove old get_pos_cell_labels function

* Line length oopsies

* Need to rebuild documentation

* Add new documentation which doesn't add submodule and ignores test files

* Reset documentation

* Remove quotations from sphinx-apidoc

* Handle more refactoring of spatial analysis and make Makefile ignore parameter simpler

* Add some get_pos_cell_labels tests and a cluster test for compute_close_cell_num

* Add test utils back to ark.utils.md

* Blank lines error fix'

* Add headers for modules back in
* See if adding fail_on_warning in a readthedocs yml file does the trick

* Change to 4 spaces indent for .readthedocs.yml

* Update documentation: remove extraneous md files and change Makefile to fail if warning caught

* Intentionally put a mistake in the documentation to try and force error

* Get configuration right to ensure we fail on warning

* Fix version error (removed)

* Add an intentional indentation error (appeared on local make html) to see if readthedocs fails

* Fix error when trying to break spatial_analysis.py

* Change .readthedocs.yml

* Add more configuration

* See if changing to version 2 changes anything

* Now fix intentional issue and see if readthedocs passes now

* This should fix it...

* Fix docstring in test_utils and configure conf.py to ignore ark.md

* Try out new conf.py settings to see if Sphinx makes the documentation for us

* Re-add _static and _templates and update params in conf.py to build documentation in correct locations

* Fix regex to ignore testing files

* Update .gitignore to ensure people do not try to add the _markdown files generated if testing locally

* Configure conf.py to catch docstring errors, and intentionally push a bad build to see if ReadTheDocs actually catches them

* Fix argument-list errors caught by autodoc-process-docstring

* conf.py now works to catch basic return errors not caught by arglist checking

* conf.py now handles argument-less function docstring checks, and fixed bulleted list issue in test_utils.py

* Remove TODOs in conf.py

* Remove extra arguments checks, leave just 1 all-encompassing one
* Fix error with nbsphinx that causes a color error on readthedocs: we don't need these libraries

* Change requirement to mibilib
* Add boxplot viz function

* Add test function for draw_boxplot and change save_dir functionality so it actually acts like a save_dir

* Fix leftover PYCODESTYLE errors

* Increase coverage with error tests, and fix split_vals check

* PYCODESTYLE

* Add new changes, including demo of draw_boxplot

* Fix visualize cells test and add plot barchart tests

* Finish up visualization cleanup, and adding creation of viz directory in notebook

* Fix documentation for get_sorted_data

* Change column to visualize boxplot for

* Make get_sorted_data args and variable names more clear for documentation purposes
@alex-l-kong
Copy link
Contributor Author

alex-l-kong commented Jan 18, 2021

This PR now contains the following:

  • A fully-integrated R SOM implementation with a separate R script running the process in the background. It reads in the preprocessed pixel matrix and can use the MapDataToCodes function to map clusters to the entire dataset based on weights trained on just a small subset.
  • Complete integration with Docker.
  • Complete integration with Travis, which runs the entire example_flowsom_clustering notebook and does unit-based testing with some of the parameters.

This PR does not address:

  • Subsetting, which needs to be handled as a separate command-line argument in som_runner.R. Another PR will be handling that one. Currently, we simply train and run MapDataToCodes on the same dataset passed in.
  • Preprocessing memory optimization to prevent Docker crashes. Another PR will be handling that one.

Concerns:

  • Due to the intense R setup that needs to be done in Travis, the tests take a very long time to run. In particular, FlowSOM is a massive installation that requires a ton of dependencies to be compiled. I have not yet found a way to cache the required R packages needed on Travis, and this is in large part because Travis has extremely poor multi-language support (for example, cannot specify separate caches for Python and R packages). if anyone has any experience in this regard would appreciate it.

@ngreenwald
Copy link
Member

Looks like you picked a name for a branch that existed previously, which is leading to a super messy history with 300+ commits. Can you copy your branch with a new rename and reopen the PR?

That's pretty annoying re: travis installation. Worst case scenario, we'll have to mock FlowSOM, and just test the preprocessing/postprocessing part of the codebase.

@alex-l-kong
Copy link
Contributor Author

@ngreenwald so the reason the commit history is really long is because I ran into an issue of committing file larger than 1GB to the repo and I messed up in trying to remove the file from the commit history. The same problem of large commit history exists when I tried to create a separate branch (r-som). I'll have to figure out how to remove all the unnecessary commits without causing a ton of merge conflicts.

@ngreenwald
Copy link
Member

Got it. You should be able to rebase all of the commits in your current which are before the 1gb file onto a new branch, then rebase all of the commits after the 1 gb commit onto that same new branch

@alex-l-kong
Copy link
Contributor Author

With a better PR without a ton of older commit history stored, this one's officially nuked.

@alex-l-kong alex-l-kong deleted the cluster_boost branch March 12, 2021 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants