-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boost clustering process (aka integrate R SOM into clustering pipeline) #368
Conversation
* compute_close_cell_num_random speedup * removed redundancy from compute_close_cell_num_random and corrected other calls * refactored close_cell_num computation * forced mark1poslables to ints * corrected enumerated for loop logic * fixed some pep8 mistakes Co-authored-by: ackagel <kagelosaurus@gmail.com> Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* code cleanup * cleanup fix * new edits * other fixes * docstring fixes * final fixes * marker_num fix * neighborhood fix Co-authored-by: Jaiveer Singh <jaiveers.24@gmail.com>
* change package name * renamed references * updated notebooks * update travis command * removed remaining reference to segmentation * pep8 from copy/paste * reordered test functions, removed old code * more sneaky pep8 * typo
* update docker commands for new name * Update README.md
* New and improved (like 2-4 lines changed haha) spatial analysis testing with random seed for no enrich and removal of duplicate tests * Just had to add that comment about 42 to botch PEP8... * Hopefully changing the random seed works (goodbye 42...) Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* exclude master from branch build * logic for travis * add PRs * add branch
* requirements test file * removed test packages from requirements
* test driven development * fixed indexing error
* Increase coverage of data_utils by adding test functions for combine_point_directories, stitch_images, and split_img_stack, also add documentation where missing * Fix tests * Remove extraneous print statement comments * Address code review comments about documentation and stitching test * Add test of output shape for test_split_img_stack * Fix dimensions test and annoying bug that causes io.imread to flip the image to (N, N, 3) if saved as (3, N, N)
* Fix non-descriptive variable naming and add comments describing cell size normalization and arcsinh transformations * Update comment which describes normalized and transformed data * Make comments more clear about normalizing and transforming Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* added comments to start_docker.sh * added update flag (untested) * minor syntax error fixes in start_docker.sh * removed updated_ script naming system * updated readme * clarified warning in readme * moved template/scripts updating to separate shell script for testing * added testing function for notebook updating bash script * Clarify update logic Co-authored-by: ackagel <kagelosaurus@gmail.com> Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* Initial directory setup of Sphinx * Add missing folders to docs directory * Fix requirements.txt typo for recommonmark * Add HTML directory from Makefile * Remove sphinxcontrib from requirements.txt * Update with more detailed examples and parsing * Remove extraneous files from _build * Add /_build to the .gitignore so it won't try to commit files from there (use Makefile to generate them OYO) * Add more config parameters and libraries needed to requirements.txt * Update index.rst to handle new Markdown files * Add _markdown files used to generate code * Add more files to markdown generated * Fix annoying numpy.core.multiarray import error and add mock library to requirements.txt * Remove extraneous markdown files and separate documentation-specific libraries into separate requirements.txt file * Check what happens when requirements.txt is re-added * Fix requirements and add new markdown without module content * Add xarray intersphinx (still need to address potential issues with this), mostly to see if readthedocs now passes * Forgot to add ipython to rtd-requirements.txt * Make description in ark.analysis.md clearer
* Make contributing file * Create pull_request_template.md * Add links to git and github explainers
* disable ignore E501 in pytest.ini * fixed line length errors in analysis folder * undid mega tabs in analysis * full google style guidification of docstings in analysis * line length fixes in segmentation + some docstring revions * removed line length errors in utils * fixed statsmodels import error and corrected calc_dist_matrix docstring * added more accurate type info to docstrings * fixed return+type docstrings * fixed mislabeled types in utils files * added pandas to intersphinx autodoc * updated testing tools * fixed tox.ini keyword * add skip explainations to style testing * correct capitalization and test md in return docstring * test return docstring * another docstring test * test new returns: docstring standard * test codeblock in docstring * another codeblock test * add markdown coversion to docstrings * another try at return docstrings in spatial analysis * forgot to save and then commit * restyled md in docstring * prevent accidental indented codeblocks * praying for correct formating * last attempt at md fix * undid md to rst in docstrings and editted analysis+segmentation * switch mds to rsts (might be causeing some doc skips) * undid md to rst * mock mibidata imports so data_utils and marker_quant get built * added doclink to readme and standardized return docstrings * apply Noah's formatting recommendations * fix transform_expression_matrix docstring * removed redunant description on tupled return docstrings Co-authored-by: ackagel <kagelosaurus@gmail.com> Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* started jupyter notebook * Updated and Added Function Calls * Added data for 4 patients * added more example calls * switched to new name * documentation * init commit * added comments * tested * ran test * Fixed UMAP import * generated graphs * ran example spatial w/ error * reran it * no more errors * fixed data * Fixed x-axis labels * New Graphs working * Added save directory * Fixed spelling error * added heading * generated graphs * Fixed typo * Fixed File Directory * coverage * removed generated files Co-authored-by: Neeraj Rattehalli <neeraj.rattehalli@gmail.com> Co-authored-by: Kevin Chu <kevinlc221@gmail.com> Co-authored-by: William Xu <williamxu132@gmail.com>
* Fix randomization error in synthetic spatial gen and begin updating type of distance matrix to xarray * Update notebooks and spatial_analysis_utils functions to handle xarray data type for distance matrices * Forgot to change spatial analysis script outside of Docker * Further integration * Fix closenum tests * Address minor changes (aka other than dict passing) * Fix centroid labelling with prop.label, which makes life hella easier * Change Erin's data and adjust accordingly * Pycodestyle and xarray import crap * Last fixes * Update expression matrix, calc_dist_matrix, and notebook to reflect string-only fov indexing * Forgot to add new segmentation labels * Change list to .values for indexing purposes * Break up logic and update comments * Fix summing error in closenum random * Remove old data files
* better test_cols in dim reduction test * created test utils and move some data_utils_test functions in * paired xr, tiff generation added to test_utils * restructured fill variable in test utils * added clear directory method to test utils * swap one test in load_imgs_from_mibitiff with new format * switch to set order filling and remove unneeded functionality from test_utils * test_load_imgs_from_mibitiff uses test_utils * unabreviate test utils import and clarify xarray comparison function * moved all channels testing for mibitiffs into load_imgs_from_mibitiff test function * rewrite test_load_imgs_from_multitiff * use test_utils in test_load_imgs_from_tree * added label generation and testing for load_imgs_from_dir * refactor generate deepcell input tests * refactor test_combine_xarrays * refactor crop test functions * reduce sizes in test_split_img_stack * refactored image stitching and img stack splitting test functions * segmented df generator added to test utils and included in analysis testing * added recursion to xarray comparison to loop over multiple dims * move create_test_extraction_data to test_utils.py * small refactor of test_compute_marker_counts * chaned np.alls to np.array_equals and clarified args in image xarray creation function * fixed all zero checks in data_utils_test * clarified args in xarray label making function in test_compute_marker_counts * small refactor of expression matrix generation test functions * refactor of complete extraction matrix generation tests * removed .DS_Stores (idk how they got added) * corrected errors in refactor * got tired of writing f strings for test points and channel names * added some doc strings to test_utils * image and xarrays have default row and col size in test utils generators * clarified 4D to 3D conversion in marker_quantification_test * default arg support for xarray and image generation in test_utils * reduced need for expand dims in marker quant tests * alphabetical sorting moved into data loading functions (as opposed to testing functions) * renamed _create_* functions to _write_* for clarity * added most of the docstrings to test_utils * added module to doc md * added arg names to gen_fov_chan_names calls in data_utils_tests * add last docstrings to test_utils Co-authored-by: ackagel <kagelosaurus@gmail.com> Co-authored-by: Noah Greenwald <noahfgreenwald@gmail.com>
* Add file for context dependent spatial analysis * Move functionality to spatial_analysis * Split up compute_pos_cell_labels and add starting functionality for context-dependent spatial analysis * Forgot to remove old function * Add clarifying comment for mark1labels_per_id * This is why you also run the spatial_analysis_utils_test before pushing * Remove None types for get_pos_cell_labels functions and add documentation to them * Remove old get_pos_cell_labels function * Line length oopsies * Need to rebuild documentation * Add new documentation which doesn't add submodule and ignores test files * Reset documentation * Remove quotations from sphinx-apidoc * Handle more refactoring of spatial analysis and make Makefile ignore parameter simpler * Add some get_pos_cell_labels tests and a cluster test for compute_close_cell_num * Add test utils back to ark.utils.md * Blank lines error fix' * Add headers for modules back in
* See if adding fail_on_warning in a readthedocs yml file does the trick * Change to 4 spaces indent for .readthedocs.yml * Update documentation: remove extraneous md files and change Makefile to fail if warning caught * Intentionally put a mistake in the documentation to try and force error * Get configuration right to ensure we fail on warning * Fix version error (removed) * Add an intentional indentation error (appeared on local make html) to see if readthedocs fails * Fix error when trying to break spatial_analysis.py * Change .readthedocs.yml * Add more configuration * See if changing to version 2 changes anything * Now fix intentional issue and see if readthedocs passes now * This should fix it... * Fix docstring in test_utils and configure conf.py to ignore ark.md * Try out new conf.py settings to see if Sphinx makes the documentation for us * Re-add _static and _templates and update params in conf.py to build documentation in correct locations * Fix regex to ignore testing files * Update .gitignore to ensure people do not try to add the _markdown files generated if testing locally * Configure conf.py to catch docstring errors, and intentionally push a bad build to see if ReadTheDocs actually catches them * Fix argument-list errors caught by autodoc-process-docstring * conf.py now works to catch basic return errors not caught by arglist checking * conf.py now handles argument-less function docstring checks, and fixed bulleted list issue in test_utils.py * Remove TODOs in conf.py * Remove extra arguments checks, leave just 1 all-encompassing one
* Fix error with nbsphinx that causes a color error on readthedocs: we don't need these libraries * Change requirement to mibilib
* Add boxplot viz function * Add test function for draw_boxplot and change save_dir functionality so it actually acts like a save_dir * Fix leftover PYCODESTYLE errors * Increase coverage with error tests, and fix split_vals check * PYCODESTYLE * Add new changes, including demo of draw_boxplot * Fix visualize cells test and add plot barchart tests * Finish up visualization cleanup, and adding creation of viz directory in notebook * Fix documentation for get_sorted_data * Change column to visualize boxplot for * Make get_sorted_data args and variable names more clear for documentation purposes
… it from example_visualization.ipynb
This PR now contains the following:
This PR does not address:
Concerns:
|
Looks like you picked a name for a branch that existed previously, which is leading to a super messy history with 300+ commits. Can you copy your branch with a new rename and reopen the PR? That's pretty annoying re: travis installation. Worst case scenario, we'll have to mock FlowSOM, and just test the preprocessing/postprocessing part of the codebase. |
@ngreenwald so the reason the commit history is really long is because I ran into an issue of committing file larger than 1GB to the repo and I messed up in trying to remove the file from the commit history. The same problem of large commit history exists when I tried to create a separate branch (r-som). I'll have to figure out how to remove all the unnecessary commits without causing a ton of merge conflicts. |
Got it. You should be able to rebase all of the commits in your current which are before the 1gb file onto a new branch, then rebase all of the commits after the 1 gb commit onto that same new branch |
d4ea67e
to
c5330a3
Compare
With a better PR without a ton of older commit history stored, this one's officially nuked. |
What is the purpose of this PR?
Closes #363. One of the bottlenecks that we're currently facing is in the cluster assignment step; we do this row-by-row, and it doesn't need to be this way. This PR will transition this to a batch-processing approach.
How did you implement your changes
We have to play around with
numpy
broadcasting because many of these operations require reshaping arrays and expanding dimensions.numpy
broadcasting allows us to process thousands, even millions of rows at once, although there are diminishing returns the larger the batch size becomes.UPDATE: we will not be using our own implementation anymore, but rather R's implementation. It runs hundreds of miles faster due to pre-compiled C-code and much better memory management than MiniSOM. It's also easy to integrate as R actually plays fairly nice with Docker.
Remaining issues
With a batch size of 1000000, we see a 1/3 speed improvement as opposed to going row-by-row, which is good, but it might not be good enough considering Candace's dataset is actually on the smaller side. There are a few things we should consider:
numpy
. However, we can potentially reduce some of these. Feel free to tackle these low-hanging fruits in the comments section!numpy
broadcasting, we can consider 2x-ing or 4x-ing (in an ideal world) this process with parallelization.UPDATE: with R SOM integration, we no longer need to worry about these.