-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: dynamically increase query params for higher k
values
#131
Conversation
Hey @bclavie I think this sets ncells to a crazy high value |
ncells is usually O(sqrt num_embeddings) so for medium or large datasets ncells=4 or so is all you need, or 8. Tor tiny things it’s a bit different |
ragatouille/models/colbert.py
Outdated
) | ||
k = len(self.searcher.collection) | ||
if k > (32 * self.searcher.config.ncells): | ||
self.searcher.configure(ncells=k // 32 + 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This in particular
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out... I was optimising it with tiny datasets in mind (anything smaller and for a ~800 docs collection we are already returning fewer than expected documents for k=256) but I think you're right it might be a bit unwise for larger collections 🤔 ... maybe keep this logic but only if the collection is smaller than 10000 docs? (it's probably not the optimal logic, but empirically did well on a bunch of tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e.:
ncells = min(k//32+2, 16) for collection_size < 10000, 8 for 10000 < collection_size < 100000, 4 above that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'll merge it as is for now, but please do flag if you have any issues with this implementation... So far find that in practice most users have really small datasets (a few hundreds to thousands documents) so I'm keen to strike a good balance in that area)
No description provided.