-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add notebook to run generic cell clustering #892
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
…ark-analysis into start_cell_cluster
…re generalized inputs run without cell_size col
@ngreenwald @srivarra only thing left before review is to check in on the example flowchart. I'm not the most artistically strong, so would appreciate any inputs on how to best organize it after adding this option in! Alternatively, we can choose to leave it out and allow the |
We can wait on the flow chart, it's not yet clear how this fits into the overall workflow, will depend on how we end up running the model. I would leave it out of the readme for now, and just have a brief explanation at the top of the notebook. Once we decide how to integrate it, we'll update all the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below, overall looking quite good. The preprocessing and postprocessing for generic clustering won't be the same as for Pixie, and won't always be the same depending on the inputs
…eneric cell clustering)
os.path.exists(meta_cluster_channel_avg_path): | ||
print("Already generated average weighted channel expression files, skipping") | ||
return | ||
|
||
print("Compute average weighted channel expression across cell SOM clusters") | ||
cell_som_cluster_channel_avg = compute_cell_cluster_channel_avg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be that compute_cell_cluster_channel_avg
and compute_cell_cluster_expr_avg
are doing very similar things (and have very similar names) - any way to just combine those 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can spin up a combination solution, however there are three reasons I've left them separated:
- Running
compute_cell_cluster_channel_avg
involves an additional file. This leads to enough components being different between the two (validation, subsetting, etc.) that I'd argue it's better to separate the two to avoid code bloat. - Code maintainability: if a future bug fix/patch/etc. involves the averaging step, it'd be (just a bit) easier to isolate the problem to one of these functions, as opposed to a single function that handles multiple cases.
- Documentation: the way the functions are, it's clear what the column expression and channel averaging functions do and what arguments they take. I fear it'd be more difficult to understand if we combined all of this in one function.
@ngreenwald let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see it going either way honestly, up to you guys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to die on this hill, but I think the functionality is so similar that it makes it more confusing for the user that they are two functions (especially with such similar names). Like when I was reviewing the code, I spent quite a bit of time trying to understand why it's 2 separate functions. In my view, code maintainability would actually be easier if it's one function since the functionality is basically the same. Basically all the function does is average over cell som/meta clusters for a set of input columns. The reason why compute_cell_cluster_channel_avg
takes an additional input is because the clusters and the input features are in two different files - it's only a few lines to first combine those into one table, those lines could easily be moved to before wherever the function is being called.
But like I said, if you truly think it'll be annoying to combine them, we can leave it. But at the very lease, I think the function names should be changed so they're not so similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cliu72 how about we make a separate issue for this and revisit it later this month? I do agree it's worth exploring options here, but I think this is extending beyond the scope of this particular PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least, can we change the name of the functions as a part of this one? compute_cell_cluster_channel_avg
and compute_cell_cluster_expr_avg
have basically the same meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me, just a few small things
Also can we wait to merge this until my paper is accepted? Or remove the "3" from the notebook name until after my paper is accepted? Lorenz or whoever wants to use it can probably just go to this branch or tell them to use the appropriate notebook. I don't want to confuse my reviewers about why we're offering 2 notebooks for cell clustering. |
Yes, we'll definitely pick a different name in the meantime, we don't want people getting confused |
@cliu72 @ngreenwald ok this PR should have addressed most everything so far. Let me know if there are any additional issues to resolve before merging this into #885 (not main). We can create additional PRs off #885 if any more issues come up (ex. adding option to turn off 99.9% normalization for generic cell clustering, renaming the notebooks). Anything that involves the Pixie pipeline as a whole (ex. condensing the two averaging functions) will be addressed in a PR separate from #885. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ustering process (#885) * Include explicit preprocessing steps for cell SOM input data * Add tests for helper cell cluster functions that changed with new process * Make a minor comment change to trigger a new build with updated package cache * Explicitly install charset-normalizer to fix dependency issues * Save unnormalized cluster counts for reference purposes * Add notebook to run generic cell clustering (#892) * First draft of generic cell clustering process * Ensure the None checks run properly for weighted cell channel validation * Attempt to add tests to 3b notebook * Ensure generic cell clustering integrated with notebooks * Add description of generalized cell inputs to data_types.md, and ensure generalized inputs run without cell_size col * Add generic cell clustering process to the README * Remove 3b notebook from README for now * Patch up remaining errors * Move cell cluster summary file generation to separate functions * Remove extraneous comments from notebook * Relocate segmentation variable settings to end of generic cell clustering notebook * Remove test for column mismatch in cluster_cells (needed to support generic cell clustering) * Massive renaming to avoid confusion of counts referring to generic cell clustering * PYCODESTYLE * Normalize new naming conventions; remove more counts refs * Fix parameter to test_generate_wc_avg_files * Update cell cluster pipeline to save Pixie results at the end, and not at intermediate steps * Remove comments from notebook 3 * Rename averaging functions to avoid ambiguity * Temporarily rename 3b_Generic_Cluster_Cells for Candace's paper * Remove unnecessary comment * Address hidden merge conflicts errors * Purge old HuggingFace references * Add seed param to cell_cluster_utils * Purge rmtree call * Change from tmi to alpineer * Change example dataset version back to main
What is the purpose of this PR?
Addresses #881. Closes #853.
How did you implement your changes
This adds the generic cell clustering notebook to the pipeline, which allows users to start at step 3 without needing to run the full Pixie pipeline prior.
Only additional functionality needed internally in
cell_cluster_utils.py
is the ability to skip weighted cell channel processing steps.