Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook to run generic cell clustering #892

Merged
merged 20 commits into from
Feb 6, 2023

Conversation

alex-l-kong
Copy link
Contributor

@alex-l-kong alex-l-kong commented Jan 24, 2023

What is the purpose of this PR?

Addresses #881. Closes #853.

How did you implement your changes

This adds the generic cell clustering notebook to the pipeline, which allows users to start at step 3 without needing to run the full Pixie pipeline prior.

Only additional functionality needed internally in cell_cluster_utils.py is the ability to skip weighted cell channel processing steps.

@alex-l-kong alex-l-kong self-assigned this Jan 24, 2023
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@alex-l-kong alex-l-kong marked this pull request as draft January 24, 2023 21:07
@alex-l-kong
Copy link
Contributor Author

@ngreenwald @srivarra only thing left before review is to check in on the example flowchart. I'm not the most artistically strong, so would appreciate any inputs on how to best organize it after adding this option in!

Alternatively, we can choose to leave it out and allow the README.md text to do the talking.

@ngreenwald
Copy link
Member

We can wait on the flow chart, it's not yet clear how this fits into the overall workflow, will depend on how we end up running the model. I would leave it out of the readme for now, and just have a brief explanation at the top of the notebook. Once we decide how to integrate it, we'll update all the documentation.

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below, overall looking quite good. The preprocessing and postprocessing for generic clustering won't be the same as for Pixie, and won't always be the same depending on the inputs

ark/phenotyping/cell_cluster_utils.py Show resolved Hide resolved
ark/phenotyping/cell_cluster_utils.py Outdated Show resolved Hide resolved
ark/phenotyping/cell_cluster_utils.py Outdated Show resolved Hide resolved
os.path.exists(meta_cluster_channel_avg_path):
print("Already generated average weighted channel expression files, skipping")
return

print("Compute average weighted channel expression across cell SOM clusters")
cell_som_cluster_channel_avg = compute_cell_cluster_channel_avg(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be that compute_cell_cluster_channel_avg and compute_cell_cluster_expr_avg are doing very similar things (and have very similar names) - any way to just combine those 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can spin up a combination solution, however there are three reasons I've left them separated:

  1. Running compute_cell_cluster_channel_avg involves an additional file. This leads to enough components being different between the two (validation, subsetting, etc.) that I'd argue it's better to separate the two to avoid code bloat.
  2. Code maintainability: if a future bug fix/patch/etc. involves the averaging step, it'd be (just a bit) easier to isolate the problem to one of these functions, as opposed to a single function that handles multiple cases.
  3. Documentation: the way the functions are, it's clear what the column expression and channel averaging functions do and what arguments they take. I fear it'd be more difficult to understand if we combined all of this in one function.

@ngreenwald let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see it going either way honestly, up to you guys

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to die on this hill, but I think the functionality is so similar that it makes it more confusing for the user that they are two functions (especially with such similar names). Like when I was reviewing the code, I spent quite a bit of time trying to understand why it's 2 separate functions. In my view, code maintainability would actually be easier if it's one function since the functionality is basically the same. Basically all the function does is average over cell som/meta clusters for a set of input columns. The reason why compute_cell_cluster_channel_avg takes an additional input is because the clusters and the input features are in two different files - it's only a few lines to first combine those into one table, those lines could easily be moved to before wherever the function is being called.

But like I said, if you truly think it'll be annoying to combine them, we can leave it. But at the very lease, I think the function names should be changed so they're not so similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cliu72 how about we make a separate issue for this and revisit it later this month? I do agree it's worth exploring options here, but I think this is extending beyond the scope of this particular PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, can we change the name of the functions as a part of this one? compute_cell_cluster_channel_avg and compute_cell_cluster_expr_avg have basically the same meaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

@cliu72 cliu72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me, just a few small things

@alex-l-kong alex-l-kong marked this pull request as ready for review February 2, 2023 23:42
@cliu72
Copy link
Contributor

cliu72 commented Feb 3, 2023

Also can we wait to merge this until my paper is accepted? Or remove the "3" from the notebook name until after my paper is accepted? Lorenz or whoever wants to use it can probably just go to this branch or tell them to use the appropriate notebook. I don't want to confuse my reviewers about why we're offering 2 notebooks for cell clustering.

@ngreenwald
Copy link
Member

Yes, we'll definitely pick a different name in the meantime, we don't want people getting confused

@alex-l-kong
Copy link
Contributor Author

alex-l-kong commented Feb 4, 2023

@cliu72 @ngreenwald ok this PR should have addressed most everything so far. Let me know if there are any additional issues to resolve before merging this into #885 (not main). We can create additional PRs off #885 if any more issues come up (ex. adding option to turn off 99.9% normalization for generic cell clustering, renaming the notebooks).

Anything that involves the Pixie pipeline as a whole (ex. condensing the two averaging functions) will be addressed in a PR separate from #885.

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alex-l-kong alex-l-kong merged commit 98f8a0c into preprocess_pixie_cell Feb 6, 2023
@alex-l-kong alex-l-kong deleted the start_cell_cluster branch February 6, 2023 19:39
alex-l-kong added a commit that referenced this pull request Feb 17, 2023
…ustering process (#885)

* Include explicit preprocessing steps for cell SOM input data

* Add tests for helper cell cluster functions that changed with new process

* Make a minor comment change to trigger a new build with updated package cache

* Explicitly install charset-normalizer to fix dependency issues

* Save unnormalized cluster counts for reference purposes

* Add notebook to run generic cell clustering (#892)

* First draft of generic cell clustering process

* Ensure the None checks run properly for weighted cell channel validation

* Attempt to add tests to 3b notebook

* Ensure generic cell clustering integrated with notebooks

* Add description of generalized cell inputs to data_types.md, and ensure generalized inputs run without cell_size col

* Add generic cell clustering process to the README

* Remove 3b notebook from README for now

* Patch up remaining errors

* Move cell cluster summary file generation to separate functions

* Remove extraneous comments from notebook

* Relocate segmentation variable settings to end of generic cell clustering notebook

* Remove test for column mismatch in cluster_cells (needed to support generic cell clustering)

* Massive renaming to avoid confusion of counts referring to generic cell clustering

* PYCODESTYLE

* Normalize new naming conventions; remove more counts refs

* Fix parameter to test_generate_wc_avg_files

* Update cell cluster pipeline to save Pixie results at the end, and not at intermediate steps

* Remove comments from notebook 3

* Rename averaging functions to avoid ambiguity

* Temporarily rename 3b_Generic_Cluster_Cells for Candace's paper

* Remove unnecessary comment

* Address hidden merge conflicts errors

* Purge old HuggingFace references

* Add seed param to cell_cluster_utils

* Purge rmtree call

* Change from tmi to alpineer

* Change example dataset version back to main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants