-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subset pixel channel averaging to remove bottleneck #823
Conversation
…rify we're subsetting prior to averaging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought part of the slowness was loading each individual FOV? How much does this speed things up? If IO is the bottleneck, then perhaps processing a subset of FOVs, rather than a subset of each FOV, would be faster.
@ngreenwald yes, this will be more optimal. I'll update the logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to leave the default for the subsetting as 1 (i.e. no subsetting) in the function definitions, and have it as an option in the notebook for users to change it. I think a lot of people who haven't used Pixie before are going to first run the notebook as is with a small test set, and subsetting 10% of a small test set of FOVs is a small number - which might throw off the averaging, leading people to think that Pixie sucks.
I think the point you made in "Remaining Issues" is a good one - perhaps after subsetting and averaging, we can add a check that the number of clusters is what the users specified. And if it's not for whatever reason, have a warning that tells the user to increase the subset proportion. Or something like that, open to other suggestions.
@alex-l-kong, I thought we talked yesterday about changing the default behavior to address this issue, specifying the number of FOVs to keep instead of a percentage of the total. |
Ah, if that's the case, I think it'd still be best if the default behavior was to use all the FOVs. I found that a lot of people who have been asking me about Pixie so far just run the entire notebook without paying much attention to the parameters they can change... |
I was thinking set it to like 100 FOVs or something, where we think it wouldn't make a difference, but if you want the default to be everything to make sure there's no issues then the subsetting method doesn't really matter and we can leave it as is |
I guess if the default is 100, we need to just add a check for datasets that have fewer than 100. I think it'd be easier to set it to everything - but I also don't have strong opinions about this. |
@ngreenwald yup we did, just hadn't had the opportunity to meet with @cliu72 about it yet. @cliu72 tomorrow I'll be in lab during the afternoon. If there's anything else to discuss, does 3 PM work? |
Personally, I like the idea of setting it by default to Also don't think we need an explicit notebook variable to change this unless even 100 proves problematic for some people. |
Sure, default to |
…verify final number of pixel clusters included
Also, food for thought but doesn't necessarily need to be addressed in this PR. Getting the means of the metaclusters doesn't actually require re-reading in individual FOVs. We could just use the SOM cluster means and the counts of each SOM cluster to do that calculation (like if SOM clusters 1, 2, and 3 are in metacluster 1, just use the expression profiles and counts of those SOM clusters to calculate metacluster 1 mean by multiplying each cluster mean by its count to get sum, add up all the clusters, then divide by total count). Could potentially speed those steps up, if the bottleneck is reading in individual FOVs. |
I think that's a good idea, we lose far less information this way. I think it's best for the next PR. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Tested and verified that this new process works. I think it's going to be easier to merge this one in first prior to consensus clustering. Gonna be a less painful merge this way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I defer to candace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Can we change the error message to:
'Averaged data contains just %d clusters out of %d, average expression file not written. '
'Increase your num_fovs_subset value.'
"Removing those values" is confusing to me.
@cliu72 error message has been fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
@ngreenwald just need your go-ahead to merge in |
What is the purpose of this PR?
Closes #622. Closes #799. With several FOVs,
compute_pixel_cluster_channel_avg
can become extremely slow as multiple DataFrame summary stats have to be aggregated, concatenated, then aggregated again. This is unavoidable because we can't load the entire dataset into memory due to Docker limitations.However, subsetting the number of pixels for each FOV that get used for averaging will allow us to generate computationally similar results and provide a massive performance boost.
How did you implement your changes
We add a parameter
subset_proportion
which functionally works like how it's implemented increate_pixel_matrix
. Prior to averaging per pixel SOM or meta cluster, we randomly samplesubset_proportion
number of pixels from the FOV data.An additional speed boost can be obtained by storing the computed DataFrames in a list prior to concatenating.
Remaining issues
One potential problem exists in the subsetting not being stratified across all SOM or meta clusters. As a result, a user could end up losing clusters (especially SOM clusters) during the averaging process. This may not be a major issue but still worth noting.