Updates to DBSCAN fitting #302

nickjcroucher · 2024-02-16T15:12:23Z

Motivated by trying to fit a DBSCAN model to a large dataset. Problems were:

indistinct clustering criterion; this was very strict (separation between within and between strain cluster on both axes required), rejecting some sensible fits; now relaxed (separation only required on one axis) - let me know if you think the stricter option should still be available though a flag, or if you're happy with change across the board
slow fitting to large datasets; implemented GPU version of DBSCAN, which is fast; the problem is then assigning all distances, which is slow, because the model takes up a lot of GPU memory, and copying over batches of distances into the variable amount of remaining GPU memory (customisable with the new --assign-subsample option) negates the speed up of the initial fit
slow assignment of distances to model fit; this is inefficient, as we typically don't use the assignments of points to the initial model fit, and it takes ages on a large dataset. Instead I have added a --no-assign flag, which skips the assignment, labels the model appropriately, and allows a refined model fit that then assigns all points

If you approve these changes conceptually, then I'll add tests and docs.

Improvements to DBSCAN fitting - moving from fork to main repo

johnlees · 2024-02-16T15:16:22Z

Will take a look before end of today!

nickjcroucher · 2024-02-16T15:17:49Z

Thanks, but it's not urgent, I just wanted to fix my broken access issues and get this onto the right repo before my machine melts!

johnlees · 2024-02-16T15:36:19Z

Motivated by trying to fit a DBSCAN model to a large dataset. Problems were:

Sounds like a good thing to improve, I've typically been using refined boundaries as everything else seems to struggle. Assigning to DBSCAN models is slow -- apparently CUDA was going to add something to make this much faster?

* indistinct clustering criterion; this was very strict (separation between within and between strain cluster on both axes required), rejecting some sensible fits; now relaxed (separation only required on one axis) - let me know if you think the stricter option should still be available though a flag, or if you're happy with change across the board

No, sounds like a good change! I think that's caused frustrations before (and heaven knows we don't need more options)

* slow fitting to large datasets; implemented GPU version of DBSCAN, which is fast; the problem is then assigning all distances, which is slow, because the model takes up a lot of GPU memory, and copying over batches of distances into the variable amount of remaining GPU memory (customisable with the new `--assign-subsample` option) negates the speed up of the initial fit

Interesting, will look at this part

* slow assignment of distances to model fit; this is inefficient, as we typically don't use the assignments of points to the initial model fit, and it takes ages on a large dataset. Instead I have added a `--no-assign` flag, which skips the assignment, labels the model appropriately, and allows a refined model fit that then assigns all points

Would it be appropriate to make this the default or perhaps remove the former behaviour as an option altogether? Aware we have a very long CLI

If you approve these changes conceptually, then I'll add tests and docs.

Great, will have a quick look through the code now

johnlees

Looks like a really helpful change

I've made a few comments from a UI and maintenance perspective

PopPUNK/__main__.py

PopPUNK/assign.py

PopPUNK/dbscan.py

PopPUNK/models.py

PopPUNK/utils.py

johnlees · 2024-02-16T15:48:12Z

Looks like no issues with tests & mandrake here?

nickjcroucher · 2024-02-16T16:16:18Z

Thanks for the comments! Nope, no mandrake issues here, seems like it is a local installation issue. I will adjust the CLI and cascade the changes through the code.

johnlees

Had a look through tests and docs and those look good to me too

johnlees · 2024-02-27T13:28:18Z

One final thing – maybe want to bump the version if not already past current release

nickjcroucher · 2024-02-27T16:18:03Z

One final thing – maybe want to bump the version if not already past current release

Glad someone was paying attention, done in 2920b00.

nickjcroucher added 30 commits January 17, 2024 14:26

Allow changes to model fitting subsample

ea9229b

Add GPU HDBSCAN algorithm

a4d98af

Update GPU memory management

d302974

Update min samples limit

de8b76f

Enable GPU model assignments

edce431

Cluster counting with cupy

ab859c3

Fix cupy abbreviation

d314fd8

Get max from cupy array

3685677

Change max function

6c031eb

Change cast to int

7b7d30b

Convert data to cupy

aa8d686

Relax constraint on DBSCAN fits

54b5ab0

Simplify fitting code

78336b7

Manage GPU memory

26e9511

Cupy indexing fix

c5ab3d8

Cupy indexing fix again

50ac745

Fix prediction parsing

507cfa3

Ignore CPU only cache

da70f27

Use python

5651d2f

Assign by GPU in blocks

b5abfdc

Specify GPU assignment

aafaa1c

Improve indices

e6cbf90

Allow flexible assignment batch size

539d939

Argument typo

176b323

Fix assignment command

8ebbb89

Separate fitting and assigning batch size

8511ae8

Parse batch size

03aef56

Use cupy in plot function

fadc54d

Convert to numpy before plotting

bceaa69

Add colon

a199403

nickjcroucher added 12 commits January 23, 2024 21:35

Update assignment in fitting

758e0b1

GPU specification in save and load

3c92279

Explicit numpy conversion in predict

e857fca

Enable fitting without assigning all points

101a910

Fix bracket

6baf0a0

Clarify no-assign use

b5d04ef

Simplify DBSCAN fitting

f85bc75

Exit before network construction if not assigning

94a88d9

Do not allow assignment with an incompletely fitting model

00b8651

Fix model assignment checks

65035ba

Correct attribute checking

bc2ba19

Merge pull request #301 from nickjcroucher/update_model_fitting

c080e2a

Improvements to DBSCAN fitting - moving from fork to main repo

johnlees reviewed Feb 16, 2024

View reviewed changes

nickjcroucher added 7 commits February 23, 2024 09:44

Improve CLI

0382227

Add reassuring exit message

843f7be

Improve error message

3f1ea31

Fix awful min_sample condition

5454f7a

Clarify awful min_sample condition

f721ca1

Add tests

9a229c6

Update docs with GPU fitting info

6795025

johnlees approved these changes Feb 27, 2024

View reviewed changes

Bump version

2920b00

nickjcroucher merged commit 27e7f85 into master Feb 27, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to DBSCAN fitting #302

Updates to DBSCAN fitting #302

nickjcroucher commented Feb 16, 2024

johnlees commented Feb 16, 2024

nickjcroucher commented Feb 16, 2024

johnlees commented Feb 16, 2024

johnlees left a comment

johnlees commented Feb 16, 2024

nickjcroucher commented Feb 16, 2024

johnlees left a comment

johnlees commented Feb 27, 2024

nickjcroucher commented Feb 27, 2024

Updates to DBSCAN fitting #302

Updates to DBSCAN fitting #302

Conversation

nickjcroucher commented Feb 16, 2024

johnlees commented Feb 16, 2024

nickjcroucher commented Feb 16, 2024

johnlees commented Feb 16, 2024

johnlees left a comment

Choose a reason for hiding this comment

johnlees commented Feb 16, 2024

nickjcroucher commented Feb 16, 2024

johnlees left a comment

Choose a reason for hiding this comment

johnlees commented Feb 27, 2024

nickjcroucher commented Feb 27, 2024