Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corrects randomization implementation in spatial_analysis_utils #452

Merged
merged 1 commit into from
Aug 27, 2021

Conversation

ackagel
Copy link
Contributor

@ackagel ackagel commented Aug 25, 2021

Purpose

Corrects the randomization strategy in the spatial analysis code.

Implementation

Random sampling happens without replacement on each individual bootstrap iteration. Previously, sampling with replacement was performed to bootstrap in one iteration per comparison. It was thought this was faster and on average the same as the former strategy, but as it turns out, neither is the case.

Quick benchmarks for old/new strategy:

Old implementation

bootstrap_num = 5

Screen Shot 2021-08-25 at 6 33 33 PM

bootstrap_num = 10

Screen Shot 2021-08-25 at 6 34 25 PM

New implementation

bootstrap_num = 5

Screen Shot 2021-08-25 at 6 30 39 PM

bootstrap_num = 10

Screen Shot 2021-08-25 at 6 31 23 PM

@ackagel
Copy link
Contributor Author

ackagel commented Aug 25, 2021

@cliu72

Tagged you for review here since we had talked about correcting the randomization strategy earlier (as in months ago, but just imagine it was sooner....)

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the way that it was originally done and it's a tiny bit faster, makes sense to revert back

Copy link
Contributor

@alex-l-kong alex-l-kong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but how much of a slowdown are we looking at here? I remember the tests would often time out on a smaller dataset using this method.

Copy link
Contributor

@cliu72 cliu72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@ackagel
Copy link
Contributor Author

ackagel commented Aug 27, 2021

@alex-l-kong the "old old" version was the same algorithm as this implementation, but it recomputed the binarized distance matrix on each bootstrap iteration, along with some other needless 'recomputes'.

Awhile ago, I made the mistake of correcting that while also changing the algorithm. I thought the algo change made it faster, while preserving the correct randomization strategy. Both of those conclusions were incorrect, and it turns out only the removal of the 'recomputes' led to the speed up, while the algorithm change actually slowed it down due to a memory bottleneck.

In other words, for n=num_cells this corrected implementation is O(n^2) in memory, and O(r) in 'python loops', while the previous method is O(r * n^2) in memory, and O(1) in 'python loops'. Because of the high number of cells, the memory allocation for the previous method ends up being slower than the O(r) 'python loops' in the corrected method.

tl;dr past Adam did bad and should feel bad. also python loops are less evil than excessive memory allocations.

@ackagel ackagel merged commit fb8fd28 into spatial-analysis-nb-update Aug 27, 2021
@ackagel ackagel deleted the fix-close-cell-random branch August 27, 2021 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants