Add notebook to run generic cell clustering #892

alex-l-kong · 2023-01-24T21:06:59Z

What is the purpose of this PR?

Addresses #881. Closes #853.

How did you implement your changes

This adds the generic cell clustering notebook to the pipeline, which allows users to start at step 3 without needing to run the full Pixie pipeline prior.

Only additional functionality needed internally in cell_cluster_utils.py is the ability to skip weighted cell channel processing steps.

review-notebook-app · 2023-01-24T21:07:03Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…ark-analysis into start_cell_cluster

…re generalized inputs run without cell_size col

alex-l-kong · 2023-01-31T01:19:59Z

@ngreenwald @srivarra only thing left before review is to check in on the example flowchart. I'm not the most artistically strong, so would appreciate any inputs on how to best organize it after adding this option in!

Alternatively, we can choose to leave it out and allow the README.md text to do the talking.

ngreenwald · 2023-01-31T03:48:16Z

We can wait on the flow chart, it's not yet clear how this fits into the overall workflow, will depend on how we end up running the model. I would leave it out of the readme for now, and just have a brief explanation at the top of the notebook. Once we decide how to integrate it, we'll update all the documentation.

ngreenwald

See below, overall looking quite good. The preprocessing and postprocessing for generic clustering won't be the same as for Pixie, and won't always be the same depending on the inputs

ark/phenotyping/cell_cluster_utils.py

templates/3b_Generic_Cluster_Cells.ipynb

…ring notebook

…eneric cell clustering)

…ll clustering

ark/phenotyping/cell_cluster_utils.py

cliu72 · 2023-02-02T19:27:24Z

ark/phenotyping/cell_cluster_utils.py

+       os.path.exists(meta_cluster_channel_avg_path):
+        print("Already generated average weighted channel expression files, skipping")
+        return
+
    print("Compute average weighted channel expression across cell SOM clusters")
    cell_som_cluster_channel_avg = compute_cell_cluster_channel_avg(


It seems to be that compute_cell_cluster_channel_avg and compute_cell_cluster_expr_avg are doing very similar things (and have very similar names) - any way to just combine those 2?

We can spin up a combination solution, however there are three reasons I've left them separated:

Running compute_cell_cluster_channel_avg involves an additional file. This leads to enough components being different between the two (validation, subsetting, etc.) that I'd argue it's better to separate the two to avoid code bloat.

Code maintainability: if a future bug fix/patch/etc. involves the averaging step, it'd be (just a bit) easier to isolate the problem to one of these functions, as opposed to a single function that handles multiple cases.

Documentation: the way the functions are, it's clear what the column expression and channel averaging functions do and what arguments they take. I fear it'd be more difficult to understand if we combined all of this in one function.

@ngreenwald let me know what you think.

I could see it going either way honestly, up to you guys

I'm not going to die on this hill, but I think the functionality is so similar that it makes it more confusing for the user that they are two functions (especially with such similar names). Like when I was reviewing the code, I spent quite a bit of time trying to understand why it's 2 separate functions. In my view, code maintainability would actually be easier if it's one function since the functionality is basically the same. Basically all the function does is average over cell som/meta clusters for a set of input columns. The reason why compute_cell_cluster_channel_avg takes an additional input is because the clusters and the input features are in two different files - it's only a few lines to first combine those into one table, those lines could easily be moved to before wherever the function is being called.

But like I said, if you truly think it'll be annoying to combine them, we can leave it. But at the very lease, I think the function names should be changed so they're not so similar.

@cliu72 how about we make a separate issue for this and revisit it later this month? I do agree it's worth exploring options here, but I think this is extending beyond the scope of this particular PR.

At the very least, can we change the name of the functions as a part of this one? compute_cell_cluster_channel_avg and compute_cell_cluster_expr_avg have basically the same meaning.

cliu72

Looks pretty good to me, just a few small things

…t at intermediate steps

cliu72 · 2023-02-03T22:39:14Z

Also can we wait to merge this until my paper is accepted? Or remove the "3" from the notebook name until after my paper is accepted? Lorenz or whoever wants to use it can probably just go to this branch or tell them to use the appropriate notebook. I don't want to confuse my reviewers about why we're offering 2 notebooks for cell clustering.

ngreenwald · 2023-02-03T23:00:31Z

Yes, we'll definitely pick a different name in the meantime, we don't want people getting confused

alex-l-kong · 2023-02-04T01:38:22Z

@cliu72 @ngreenwald ok this PR should have addressed most everything so far. Let me know if there are any additional issues to resolve before merging this into #885 (not main). We can create additional PRs off #885 if any more issues come up (ex. adding option to turn off 99.9% normalization for generic cell clustering, renaming the notebooks).

Anything that involves the Pixie pipeline as a whole (ex. condensing the two averaging functions) will be addressed in a PR separate from #885.

ngreenwald

LGTM

…ustering process (#885) * Include explicit preprocessing steps for cell SOM input data * Add tests for helper cell cluster functions that changed with new process * Make a minor comment change to trigger a new build with updated package cache * Explicitly install charset-normalizer to fix dependency issues * Save unnormalized cluster counts for reference purposes * Add notebook to run generic cell clustering (#892) * First draft of generic cell clustering process * Ensure the None checks run properly for weighted cell channel validation * Attempt to add tests to 3b notebook * Ensure generic cell clustering integrated with notebooks * Add description of generalized cell inputs to data_types.md, and ensure generalized inputs run without cell_size col * Add generic cell clustering process to the README * Remove 3b notebook from README for now * Patch up remaining errors * Move cell cluster summary file generation to separate functions * Remove extraneous comments from notebook * Relocate segmentation variable settings to end of generic cell clustering notebook * Remove test for column mismatch in cluster_cells (needed to support generic cell clustering) * Massive renaming to avoid confusion of counts referring to generic cell clustering * PYCODESTYLE * Normalize new naming conventions; remove more counts refs * Fix parameter to test_generate_wc_avg_files * Update cell cluster pipeline to save Pixie results at the end, and not at intermediate steps * Remove comments from notebook 3 * Rename averaging functions to avoid ambiguity * Temporarily rename 3b_Generic_Cluster_Cells for Candace's paper * Remove unnecessary comment * Address hidden merge conflicts errors * Purge old HuggingFace references * Add seed param to cell_cluster_utils * Purge rmtree call * Change from tmi to alpineer * Change example dataset version back to main

First draft of generic cell clustering process

5324938

alex-l-kong self-assigned this Jan 24, 2023

alex-l-kong marked this pull request as draft January 24, 2023 21:07

alex-l-kong added 6 commits January 24, 2023 13:54

Ensure the None checks run properly for weighted cell channel validation

70726f6

Attempt to add tests to 3b notebook

8789699

Ensure generic cell clustering integrated with notebooks

7be48ca

Merge branch 'preprocess_pixie_cell' of https://github.com/angelolab/…

baeb3b9

…ark-analysis into start_cell_cluster

Add description of generalized cell inputs to data_types.md, and ensu…

2b13b99

…re generalized inputs run without cell_size col

Add generic cell clustering process to the README

b4b7851

alex-l-kong added 2 commits January 31, 2023 11:43

Remove 3b notebook from README for now

8a4110a

Patch up remaining errors

ef7642c

alex-l-kong requested review from ngreenwald, cliu72 and JLrumberger January 31, 2023 20:14

ngreenwald requested changes Jan 31, 2023

View reviewed changes

ark/phenotyping/cell_cluster_utils.py Show resolved Hide resolved

ark/phenotyping/cell_cluster_utils.py Outdated Show resolved Hide resolved

ark/phenotyping/cell_cluster_utils.py Outdated Show resolved Hide resolved

ngreenwald reviewed Jan 31, 2023

View reviewed changes

alex-l-kong added 6 commits February 1, 2023 14:55

Move cell cluster summary file generation to separate functions

698d7ba

Remove extraneous comments from notebook

39e1306

Relocate segmentation variable settings to end of generic cell cluste…

c741216

…ring notebook

Remove test for column mismatch in cluster_cells (needed to support g…

921c6e3

…eneric cell clustering)

Massive renaming to avoid confusion of counts referring to generic ce…

fc2951e

…ll clustering

PYCODESTYLE

7b24f0e