Fix: dynamically increase query params for higher `k` values #131

bclavie · 2024-02-11T20:31:26Z

No description provided.

okhat · 2024-02-11T20:32:51Z

Hey @bclavie I think this sets ncells to a crazy high value

okhat · 2024-02-11T20:34:23Z

ncells is usually O(sqrt num_embeddings) so for medium or large datasets ncells=4 or so is all you need, or 8. Tor tiny things it’s a bit different

okhat · 2024-02-11T20:35:18Z

ragatouille/models/colbert.py

+            )
+            k = len(self.searcher.collection)
+        if k > (32 * self.searcher.config.ncells):
+            self.searcher.configure(ncells=k // 32 + 2)


This in particular

Thanks for pointing out... I was optimising it with tiny datasets in mind (anything smaller and for a ~800 docs collection we are already returning fewer than expected documents for k=256) but I think you're right it might be a bit unwise for larger collections 🤔 ... maybe keep this logic but only if the collection is smaller than 10000 docs? (it's probably not the optimal logic, but empirically did well on a bunch of tests)

i.e.:

ncells = min(k//32+2, 16) for collection_size < 10000, 8 for 10000 < collection_size < 100000, 4 above that.

(I'll merge it as is for now, but please do flag if you have any issues with this implementation... So far find that in practice most users have really small datasets (a few hundreds to thousands documents) so I'm keen to strike a good balance in that area)

bclavie added 4 commits February 11, 2024 18:46

fix: return enough results if k > ncells*32

06711ac

fix: increase both ndocs and ncells to match k

14f620a

chore: prepare release

99deef4

linting

ba1d108

bclavie mentioned this pull request Feb 11, 2024

Can't search with k over 128 #130

Closed

bclavie linked an issue Feb 11, 2024 that may be closed by this pull request

Can't search with k over 128 #130

Closed

okhat reviewed Feb 11, 2024

View reviewed changes

bclavie added 2 commits February 11, 2024 21:44

chore: saner ncells for larger datasets

66ba7f2

linting

7b0d4d0

bclavie merged commit 5409914 into main Feb 11, 2024
2 checks passed

bclavie deleted the fix/dynamically_increase_ncells branch February 11, 2024 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: dynamically increase query params for higher `k` values #131

Fix: dynamically increase query params for higher `k` values #131

bclavie commented Feb 11, 2024

okhat commented Feb 11, 2024

okhat commented Feb 11, 2024

okhat Feb 11, 2024

bclavie Feb 11, 2024 •

edited

Loading

bclavie Feb 11, 2024

bclavie Feb 11, 2024

Fix: dynamically increase query params for higher k values #131

Fix: dynamically increase query params for higher k values #131

Conversation

bclavie commented Feb 11, 2024

okhat commented Feb 11, 2024

okhat commented Feb 11, 2024

okhat Feb 11, 2024

Choose a reason for hiding this comment

bclavie Feb 11, 2024 • edited Loading

Choose a reason for hiding this comment

bclavie Feb 11, 2024

Choose a reason for hiding this comment

bclavie Feb 11, 2024

Choose a reason for hiding this comment

Fix: dynamically increase query params for higher `k` values #131

Fix: dynamically increase query params for higher `k` values #131

bclavie Feb 11, 2024 •

edited

Loading