Subset pixel channel averaging to remove bottleneck #823

alex-l-kong · 2022-11-10T23:48:22Z

What is the purpose of this PR?

Closes #622. Closes #799. With several FOVs, compute_pixel_cluster_channel_avg can become extremely slow as multiple DataFrame summary stats have to be aggregated, concatenated, then aggregated again. This is unavoidable because we can't load the entire dataset into memory due to Docker limitations.

However, subsetting the number of pixels for each FOV that get used for averaging will allow us to generate computationally similar results and provide a massive performance boost.

How did you implement your changes

We add a parameter subset_proportion which functionally works like how it's implemented in create_pixel_matrix. Prior to averaging per pixel SOM or meta cluster, we randomly sample subset_proportion number of pixels from the FOV data.

An additional speed boost can be obtained by storing the computed DataFrames in a list prior to concatenating.

Remaining issues

One potential problem exists in the subsetting not being stratified across all SOM or meta clusters. As a result, a user could end up losing clusters (especially SOM clusters) during the averaging process. This may not be a major issue but still worth noting.

…ging

…rify we're subsetting prior to averaging

ngreenwald

I thought part of the slowness was loading each individual FOV? How much does this speed things up? If IO is the bottleneck, then perhaps processing a subset of FOVs, rather than a subset of each FOV, would be faster.

alex-l-kong · 2022-11-15T21:19:00Z

@ngreenwald yes, this will be more optimal. I'll update the logic.

…channel_avg

cliu72

I think it would be better to leave the default for the subsetting as 1 (i.e. no subsetting) in the function definitions, and have it as an option in the notebook for users to change it. I think a lot of people who haven't used Pixie before are going to first run the notebook as is with a small test set, and subsetting 10% of a small test set of FOVs is a small number - which might throw off the averaging, leading people to think that Pixie sucks.

I think the point you made in "Remaining Issues" is a good one - perhaps after subsetting and averaging, we can add a check that the number of clusters is what the users specified. And if it's not for whatever reason, have a warning that tells the user to increase the subset proportion. Or something like that, open to other suggestions.

ngreenwald · 2022-11-16T23:38:06Z

@alex-l-kong, I thought we talked yesterday about changing the default behavior to address this issue, specifying the number of FOVs to keep instead of a percentage of the total.

cliu72 · 2022-11-16T23:41:31Z

Ah, if that's the case, I think it'd still be best if the default behavior was to use all the FOVs. I found that a lot of people who have been asking me about Pixie so far just run the entire notebook without paying much attention to the parameters they can change...

ngreenwald · 2022-11-16T23:50:46Z

I was thinking set it to like 100 FOVs or something, where we think it wouldn't make a difference, but if you want the default to be everything to make sure there's no issues then the subsetting method doesn't really matter and we can leave it as is

cliu72 · 2022-11-16T23:56:37Z

I guess if the default is 100, we need to just add a check for datasets that have fewer than 100. I think it'd be easier to set it to everything - but I also don't have strong opinions about this.

alex-l-kong · 2022-11-17T00:22:18Z

@ngreenwald yup we did, just hadn't had the opportunity to meet with @cliu72 about it yet. @cliu72 tomorrow I'll be in lab during the afternoon. If there's anything else to discuss, does 3 PM work?

alex-l-kong · 2022-11-17T00:29:16Z

Personally, I like the idea of setting it by default to min(100, len(fovs)). I think we're at a stage where some people are just starting out with a small dataset while others will have massive cohorts. This should be a nice middle ground (so anyone with few FOVs can use them all, while those with thousands can limit their analysis to 100).

Also don't think we need an explicit notebook variable to change this unless even 100 proves problematic for some people.

cliu72 · 2022-11-17T19:46:48Z

Sure, default to min(100, len(fovs)) sounds good to me. I don't think there's a need to meet, but I will also be around if you want to discuss anything.

…verify final number of pixel clusters included

ark/phenotyping/pixel_cluster_utils.py

cliu72 · 2022-11-28T23:04:49Z

Also, food for thought but doesn't necessarily need to be addressed in this PR. Getting the means of the metaclusters doesn't actually require re-reading in individual FOVs. We could just use the SOM cluster means and the counts of each SOM cluster to do that calculation (like if SOM clusters 1, 2, and 3 are in metacluster 1, just use the expression profiles and counts of those SOM clusters to calculate metacluster 1 mean by multiplying each cluster mean by its count to get sum, add up all the clusters, then divide by total count). Could potentially speed those steps up, if the bottleneck is reading in individual FOVs.

alex-l-kong · 2022-11-28T23:33:41Z

Also, food for thought but doesn't necessarily need to be addressed in this PR. Getting the means of the metaclusters doesn't actually require re-reading in individual FOVs. We could just use the SOM cluster means and the counts of each SOM cluster to do that calculation (like if SOM clusters 1, 2, and 3 are in metacluster 1, just use the expression profiles and counts of those SOM clusters to calculate metacluster 1 mean by multiplying each cluster mean by its count to get sum, add up all the clusters, then divide by total count). Could potentially speed those steps up, if the bottleneck is reading in individual FOVs.

I think that's a good idea, we lose far less information this way. I think it's best for the next PR.

…on separated into separate functiona

…-analysis into subset_channel_avg

…erage summary file tests

review-notebook-app · 2022-11-30T22:19:17Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

alex-l-kong · 2022-11-30T23:11:37Z

Tested and verified that this new process works. I think it's going to be easier to merge this one in first prior to consensus clustering. Gonna be a less painful merge this way.

ngreenwald

Looks good, I defer to candace

cliu72

Looks good to me. Can we change the error message to:

'Averaged data contains just %d clusters out of %d, average expression file not written. '
'Increase your num_fovs_subset value.'

"Removing those values" is confusing to me.

alex-l-kong · 2022-12-06T20:10:42Z

@cliu72 error message has been fixed.

cliu72

Looks good

alex-l-kong · 2022-12-06T22:31:16Z

@ngreenwald just need your go-ahead to merge in

alex-l-kong added 2 commits November 10, 2022 15:33

Initial implementation of pixel subsetting for channel averaging

6660f25

Polish up remaining tests that rely on count checks for channel avera…

9106b99

…ging

alex-l-kong self-assigned this Nov 10, 2022

alex-l-kong marked this pull request as draft November 10, 2022 23:48

alex-l-kong and others added 3 commits November 10, 2022 15:49

Add comment in compute_pixel_cluster_channel_avg documentation to cla…

5067dde

…rify we're subsetting prior to averaging

Merge branch 'main' into subset_channel_avg

376874c

Merge branch 'main' into subset_channel_avg

c360fc9

alex-l-kong marked this pull request as ready for review November 15, 2022 20:11

alex-l-kong requested a review from ngreenwald November 15, 2022 20:57

ngreenwald requested a review from cliu72 November 15, 2022 21:02

ngreenwald requested changes Nov 15, 2022

View reviewed changes

alex-l-kong added 7 commits November 15, 2022 13:40

Update FOV subset logic

c3bccd6

Merge remote-tracking branch 'origin/subset_channel_avg' into subset_…

2d90461

…channel_avg

Include check for fov_subset_proportion too low

493b1e1

Final checks

f4fb94d

Fix test for remapping

426dbbb

Increase FOV subset proportion

f01943f

Add random seed to pytest

dfb6a99

cliu72 reviewed Nov 16, 2022

View reviewed changes

Merge branch 'main' into subset_channel_avg

86dad65

alex-l-kong added 2 commits November 21, 2022 14:19

Update pixel channel average function to take max subset of FOVs and …

6f36cbd

…verify final number of pixel clusters included

Add tests for subsetting more FOVs than there exist

9ffb268

cliu72 reviewed Nov 28, 2022

View reviewed changes

ark/phenotyping/pixel_cluster_utils.py Show resolved Hide resolved

alex-l-kong added 6 commits November 28, 2022 16:05

Initial commit of updated notebook process with summary file generati…

5262030

…on separated into separate functiona

Merge branch 'subset_channel_avg' of https://github.com/angelolab/ark…

9481753

…-analysis into subset_channel_avg

Purge new pipeline of bugs

529ab6e

Define stubs of new functions

c2f56ea

Get tests for the broken up pixel clustering functions in

896b381

Adjust existing meta cluster remapping tests so they don't include av…

1555c89

…erage summary file tests

alex-l-kong added 3 commits November 30, 2022 14:21

Documentation fix

d9c2390

PYCODESTYLE

e30bf83

Adjust test in notebooks_test.py for broken up pixel remapping function

4754b4a

alex-l-kong requested review from ngreenwald and cliu72 November 30, 2022 23:11

ngreenwald reviewed Nov 30, 2022

View reviewed changes

ngreenwald mentioned this pull request Nov 30, 2022

Add native Python consensus clustering process #839

Merged

cliu72 reviewed Dec 5, 2022

View reviewed changes

alex-l-kong and others added 2 commits December 6, 2022 11:37

Update error message for lost clusters in channel subsetting

40b697f

Merge branch 'main' into subset_channel_avg

60be1a3

alex-l-kong requested a review from cliu72 December 6, 2022 20:10

cliu72 approved these changes Dec 6, 2022

View reviewed changes

alex-l-kong requested a review from ngreenwald December 6, 2022 22:31

ngreenwald approved these changes Dec 7, 2022

View reviewed changes

ngreenwald merged commit 306af19 into main Dec 7, 2022

ngreenwald deleted the subset_channel_avg branch December 7, 2022 01:22

srivarra added the enhancement New feature or request label Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subset pixel channel averaging to remove bottleneck #823

Subset pixel channel averaging to remove bottleneck #823

alex-l-kong commented Nov 10, 2022 •

edited

Loading

ngreenwald left a comment

alex-l-kong commented Nov 15, 2022

cliu72 left a comment

ngreenwald commented Nov 16, 2022

cliu72 commented Nov 16, 2022

ngreenwald commented Nov 16, 2022

cliu72 commented Nov 16, 2022

alex-l-kong commented Nov 17, 2022

alex-l-kong commented Nov 17, 2022

cliu72 commented Nov 17, 2022

cliu72 commented Nov 28, 2022 •

edited

Loading

alex-l-kong commented Nov 28, 2022

review-notebook-app bot commented Nov 30, 2022

alex-l-kong commented Nov 30, 2022

ngreenwald left a comment

cliu72 left a comment

alex-l-kong commented Dec 6, 2022

cliu72 left a comment

alex-l-kong commented Dec 6, 2022

Subset pixel channel averaging to remove bottleneck #823

Subset pixel channel averaging to remove bottleneck #823

Conversation

alex-l-kong commented Nov 10, 2022 • edited Loading

ngreenwald left a comment

Choose a reason for hiding this comment

alex-l-kong commented Nov 15, 2022

cliu72 left a comment

Choose a reason for hiding this comment

ngreenwald commented Nov 16, 2022

cliu72 commented Nov 16, 2022

ngreenwald commented Nov 16, 2022

cliu72 commented Nov 16, 2022

alex-l-kong commented Nov 17, 2022

alex-l-kong commented Nov 17, 2022

cliu72 commented Nov 17, 2022

cliu72 commented Nov 28, 2022 • edited Loading

alex-l-kong commented Nov 28, 2022

review-notebook-app bot commented Nov 30, 2022

alex-l-kong commented Nov 30, 2022

ngreenwald left a comment

Choose a reason for hiding this comment

cliu72 left a comment

Choose a reason for hiding this comment

alex-l-kong commented Dec 6, 2022

cliu72 left a comment

Choose a reason for hiding this comment

alex-l-kong commented Dec 6, 2022

alex-l-kong commented Nov 10, 2022 •

edited

Loading

cliu72 commented Nov 28, 2022 •

edited

Loading