Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset pixel channel averaging to remove bottleneck #823

Merged
merged 31 commits into from
Dec 7, 2022

Conversation

alex-l-kong
Copy link
Contributor

@alex-l-kong alex-l-kong commented Nov 10, 2022

What is the purpose of this PR?

Closes #622. Closes #799. With several FOVs, compute_pixel_cluster_channel_avg can become extremely slow as multiple DataFrame summary stats have to be aggregated, concatenated, then aggregated again. This is unavoidable because we can't load the entire dataset into memory due to Docker limitations.

However, subsetting the number of pixels for each FOV that get used for averaging will allow us to generate computationally similar results and provide a massive performance boost.

How did you implement your changes

We add a parameter subset_proportion which functionally works like how it's implemented in create_pixel_matrix. Prior to averaging per pixel SOM or meta cluster, we randomly sample subset_proportion number of pixels from the FOV data.

An additional speed boost can be obtained by storing the computed DataFrames in a list prior to concatenating.

Remaining issues

One potential problem exists in the subsetting not being stratified across all SOM or meta clusters. As a result, a user could end up losing clusters (especially SOM clusters) during the averaging process. This may not be a major issue but still worth noting.

@alex-l-kong alex-l-kong self-assigned this Nov 10, 2022
@alex-l-kong alex-l-kong marked this pull request as draft November 10, 2022 23:48
@alex-l-kong alex-l-kong marked this pull request as ready for review November 15, 2022 20:11
Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought part of the slowness was loading each individual FOV? How much does this speed things up? If IO is the bottleneck, then perhaps processing a subset of FOVs, rather than a subset of each FOV, would be faster.

@alex-l-kong
Copy link
Contributor Author

@ngreenwald yes, this will be more optimal. I'll update the logic.

Copy link
Contributor

@cliu72 cliu72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to leave the default for the subsetting as 1 (i.e. no subsetting) in the function definitions, and have it as an option in the notebook for users to change it. I think a lot of people who haven't used Pixie before are going to first run the notebook as is with a small test set, and subsetting 10% of a small test set of FOVs is a small number - which might throw off the averaging, leading people to think that Pixie sucks.

I think the point you made in "Remaining Issues" is a good one - perhaps after subsetting and averaging, we can add a check that the number of clusters is what the users specified. And if it's not for whatever reason, have a warning that tells the user to increase the subset proportion. Or something like that, open to other suggestions.

@ngreenwald
Copy link
Member

@alex-l-kong, I thought we talked yesterday about changing the default behavior to address this issue, specifying the number of FOVs to keep instead of a percentage of the total.

@cliu72
Copy link
Contributor

cliu72 commented Nov 16, 2022

Ah, if that's the case, I think it'd still be best if the default behavior was to use all the FOVs. I found that a lot of people who have been asking me about Pixie so far just run the entire notebook without paying much attention to the parameters they can change...

@ngreenwald
Copy link
Member

I was thinking set it to like 100 FOVs or something, where we think it wouldn't make a difference, but if you want the default to be everything to make sure there's no issues then the subsetting method doesn't really matter and we can leave it as is

@cliu72
Copy link
Contributor

cliu72 commented Nov 16, 2022

I guess if the default is 100, we need to just add a check for datasets that have fewer than 100. I think it'd be easier to set it to everything - but I also don't have strong opinions about this.

@alex-l-kong
Copy link
Contributor Author

@ngreenwald yup we did, just hadn't had the opportunity to meet with @cliu72 about it yet. @cliu72 tomorrow I'll be in lab during the afternoon. If there's anything else to discuss, does 3 PM work?

@alex-l-kong
Copy link
Contributor Author

Personally, I like the idea of setting it by default to min(100, len(fovs)). I think we're at a stage where some people are just starting out with a small dataset while others will have massive cohorts. This should be a nice middle ground (so anyone with few FOVs can use them all, while those with thousands can limit their analysis to 100).

Also don't think we need an explicit notebook variable to change this unless even 100 proves problematic for some people.

@cliu72
Copy link
Contributor

cliu72 commented Nov 17, 2022

Sure, default to min(100, len(fovs)) sounds good to me. I don't think there's a need to meet, but I will also be around if you want to discuss anything.

@cliu72
Copy link
Contributor

cliu72 commented Nov 28, 2022

Also, food for thought but doesn't necessarily need to be addressed in this PR. Getting the means of the metaclusters doesn't actually require re-reading in individual FOVs. We could just use the SOM cluster means and the counts of each SOM cluster to do that calculation (like if SOM clusters 1, 2, and 3 are in metacluster 1, just use the expression profiles and counts of those SOM clusters to calculate metacluster 1 mean by multiplying each cluster mean by its count to get sum, add up all the clusters, then divide by total count). Could potentially speed those steps up, if the bottleneck is reading in individual FOVs.

@alex-l-kong
Copy link
Contributor Author

Also, food for thought but doesn't necessarily need to be addressed in this PR. Getting the means of the metaclusters doesn't actually require re-reading in individual FOVs. We could just use the SOM cluster means and the counts of each SOM cluster to do that calculation (like if SOM clusters 1, 2, and 3 are in metacluster 1, just use the expression profiles and counts of those SOM clusters to calculate metacluster 1 mean by multiplying each cluster mean by its count to get sum, add up all the clusters, then divide by total count). Could potentially speed those steps up, if the bottleneck is reading in individual FOVs.

I think that's a good idea, we lose far less information this way. I think it's best for the next PR.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@alex-l-kong
Copy link
Contributor Author

Tested and verified that this new process works. I think it's going to be easier to merge this one in first prior to consensus clustering. Gonna be a less painful merge this way.

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I defer to candace

Copy link
Contributor

@cliu72 cliu72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Can we change the error message to:

'Averaged data contains just %d clusters out of %d, average expression file not written. '
'Increase your num_fovs_subset value.'

"Removing those values" is confusing to me.

@alex-l-kong
Copy link
Contributor Author

@cliu72 error message has been fixed.

Copy link
Contributor

@cliu72 cliu72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@alex-l-kong
Copy link
Contributor Author

@ngreenwald just need your go-ahead to merge in

@ngreenwald ngreenwald merged commit 306af19 into main Dec 7, 2022
@ngreenwald ngreenwald deleted the subset_channel_avg branch December 7, 2022 01:22
@srivarra srivarra added the enhancement New feature or request label Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants