Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to DBSCAN fitting #302

Merged
merged 63 commits into from Feb 27, 2024
Merged

Updates to DBSCAN fitting #302

merged 63 commits into from Feb 27, 2024

Conversation

nickjcroucher
Copy link
Collaborator

Motivated by trying to fit a DBSCAN model to a large dataset. Problems were:

  • indistinct clustering criterion; this was very strict (separation between within and between strain cluster on both axes required), rejecting some sensible fits; now relaxed (separation only required on one axis) - let me know if you think the stricter option should still be available though a flag, or if you're happy with change across the board
  • slow fitting to large datasets; implemented GPU version of DBSCAN, which is fast; the problem is then assigning all distances, which is slow, because the model takes up a lot of GPU memory, and copying over batches of distances into the variable amount of remaining GPU memory (customisable with the new --assign-subsample option) negates the speed up of the initial fit
  • slow assignment of distances to model fit; this is inefficient, as we typically don't use the assignments of points to the initial model fit, and it takes ages on a large dataset. Instead I have added a --no-assign flag, which skips the assignment, labels the model appropriately, and allows a refined model fit that then assigns all points

If you approve these changes conceptually, then I'll add tests and docs.

@johnlees
Copy link
Member

Will take a look before end of today!

@nickjcroucher
Copy link
Collaborator Author

Thanks, but it's not urgent, I just wanted to fix my broken access issues and get this onto the right repo before my machine melts!

@johnlees
Copy link
Member

Motivated by trying to fit a DBSCAN model to a large dataset. Problems were:

Sounds like a good thing to improve, I've typically been using refined boundaries as everything else seems to struggle. Assigning to DBSCAN models is slow -- apparently CUDA was going to add something to make this much faster?

* indistinct clustering criterion; this was very strict (separation between within and between strain cluster on both axes required), rejecting some sensible fits; now relaxed (separation only required on one axis) - let me know if you think the stricter option should still be available though a flag, or if you're happy with change across the board

No, sounds like a good change! I think that's caused frustrations before (and heaven knows we don't need more options)

* slow fitting to large datasets; implemented GPU version of DBSCAN, which is fast; the problem is then assigning all distances, which is slow, because the model takes up a lot of GPU memory, and copying over batches of distances into the variable amount of remaining GPU memory (customisable with the new `--assign-subsample` option) negates the speed up of the initial fit

Interesting, will look at this part

* slow assignment of distances to model fit; this is inefficient, as we typically don't use the assignments of points to the initial model fit, and it takes ages on a large dataset. Instead I have added a `--no-assign` flag, which skips the assignment, labels the model appropriately, and allows a refined model fit that then assigns all points

Would it be appropriate to make this the default or perhaps remove the former behaviour as an option altogether? Aware we have a very long CLI

If you approve these changes conceptually, then I'll add tests and docs.

Great, will have a quick look through the code now

Copy link
Member

@johnlees johnlees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a really helpful change

I've made a few comments from a UI and maintenance perspective

PopPUNK/__main__.py Outdated Show resolved Hide resolved
PopPUNK/__main__.py Show resolved Hide resolved
PopPUNK/__main__.py Outdated Show resolved Hide resolved
PopPUNK/__main__.py Outdated Show resolved Hide resolved
PopPUNK/assign.py Outdated Show resolved Hide resolved
PopPUNK/dbscan.py Show resolved Hide resolved
PopPUNK/models.py Outdated Show resolved Hide resolved
PopPUNK/models.py Show resolved Hide resolved
PopPUNK/models.py Show resolved Hide resolved
PopPUNK/utils.py Show resolved Hide resolved
@johnlees
Copy link
Member

Looks like no issues with tests & mandrake here?

@nickjcroucher
Copy link
Collaborator Author

Thanks for the comments! Nope, no mandrake issues here, seems like it is a local installation issue. I will adjust the CLI and cascade the changes through the code.

Copy link
Member

@johnlees johnlees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look through tests and docs and those look good to me too

@johnlees
Copy link
Member

One final thing – maybe want to bump the version if not already past current release

@nickjcroucher
Copy link
Collaborator Author

One final thing – maybe want to bump the version if not already past current release

Glad someone was paying attention, done in 2920b00.

@nickjcroucher nickjcroucher merged commit 27e7f85 into master Feb 27, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants