Feat/indexing faissless #173

bclavie · 2024-03-15T13:24:54Z

Faiss is the source of a ridiculous amount of issues with this repository, and is becoming increasingly harder to play around in wide-userbase repositories as the versions compatible with recent CUDA drivers are only available via conda or by building from source.

This PR introduces:

A slight fix for RAGatouille: Dedicated ModelIndex #158 to avoid force-reloading the searcher
An implementation of K-Means in raw pytorch
Indexing to default to this implementation for any collection size smaller than 500k documents (the assumption being that users with massive collections are likely to be attached to the canonical, research behaviour of colbert-ai)
Users to be presented with a warning message about the behaviour change.
The possibility to pass use_faiss=True to force faiss being used.
A fallback behaviour where any exception when indexing with the torch kmeans will default to the existing behaviour.
If this is successful, v0.9.0 will remove faiss as a dependency

…aving been false

bclavie · 2024-03-15T16:56:34Z

@jlscheerer will merge if this looks good to you!

ragatouille/models/index.py

Anmol6 · 2024-03-15T14:42:16Z

ragatouille/models/index.py

+                "If you're confident with FAISS working issue-free on your machine, pass use_faiss=True to revert to the FAISS-using behaviour."
+            )
+            print("--------------------")
+            CollectionIndexer._original_train_kmeans = CollectionIndexer._train_kmeans


can we use another variable to track this? Would avoid directly setting object attributes!

We could yeah! This was mostly for the convenience of checking on hasattr later on, but it might be better practice to set it to a new object instead. I'll change it

Mentioned on Discord but I'm actually thinking this is a relatively sane way of doing it, because we need to keep it alive for the entirety of the session -- we're monkey-patching the colbert-ai indexer itself and we want to be able to revert anytime someone needs to use faiss, so local variables wouldn't cut it.

right yea, makes sense!

In this case, can we assign the faiss and non-faiss k-means functions as class attributes of PLAIDModelIndex. Then, at build time, we can just toggle between them (to set CollectionIndexer._train_k_means) based on the monkey_patch flag. Wdyt?

Curious to get your thoughts here too @jlscheerer !

I like the idea of having it be a class attribute on PLAIDModelIndex and toggling it on use (+ persisting the flag). This would perhaps provide more consistent behaviour when rebuilding/adding to an already persisted index (e.g., if we decide to rebuild as part of add_to_index).

Great suggestion! Implemented now.

ragatouille/models/index.py

ragatouille/models/torch_kmeans.py

ragatouille/models/index.py

jlscheerer

Looks awesome! Besides some minor inconsistencies for low score documents everything works well for me locally (albeit on a small corpus). Really only have some very minor nits!

ragatouille/models/torch_kmeans.py

bclavie · 2024-03-15T21:37:43Z

I'll be off now and probably away most of the weekend, but pointing it out here:

@jlscheerer found out that we can get empty results for queries with no relevant docs. I'm not 100% sure what the cause is, but I think one potential reason is that faiss doesn't really allow any empty clusters at the end of an iteration, whereas I'm pretty sure our approach isn't safe from empty clusters. Thankfully, it'll be pretty trivial to add empty cluster handling and see if it fixex it!

bclavie · 2024-03-18T12:17:50Z

Re-implemented the K-means to be closer to faiss' implementation. I was struggling to reproduce benchmark (JaColBERT on JSQuAD was losing 4pts recall@1, 3pts recall@5). New implementation gets virtually identical results (91.99R@1 vs 92.07 for FAISS, 97.659R@5 vs 97.658R@5 for FAISS). Running on a larger benchmark now to make sure as I apply the other fixes, but likely looking good enough to release today.

@jlscheerer could you quickly try again on your dataset? There's now empty cluster handling implemented, so empty results shouldn't occur anymore.

Anmol6

🔥

bclavie · 2024-03-18T12:47:20Z

Stepping back on my enthusiasm a tiny bit: it OOMs pretty easily, trying to improve batching/memory usage without impacting performance 😄

feat: rework kmeans to be closer to FAISS chore: store kmeans functions as class attributes fix: method assignment chore: more memory efficient lint chore: lower bsize, resultd unaffected feat: better batching, slower max doc count chore: batch size safe for 8gb GPUs chore: more elaborate warning chore: use external lib to support minibatching, revert to homebrew later

bclavie added 3 commits March 15, 2024 13:43

fix: fix searcher always being reloaded

df05b5e

feat: implement torch kmeans

ad29d3a

chore: lower cutoff

dbfb19e

bclavie requested review from okhat and Anmol6 March 15, 2024 13:25

bclavie added 3 commits March 15, 2024 14:25

chore: move warning

0663b26

chore: higher kmeans batch size

0fa22ea

chore: argument support

455b55f

bclavie requested a review from adharm March 15, 2024 13:33

bclavie mentioned this pull request Mar 15, 2024

You have a GPU available, but only faiss-cpu is currently installed. #141

Closed

bclavie added 2 commits March 15, 2024 15:02

chore: restore all default behaviour when use_faiss is True after h…

29d1b24

…aving been false

chore: lint

ed8a30c

bclavie requested a review from jlscheerer March 15, 2024 16:56

Anmol6 reviewed Mar 15, 2024

View reviewed changes

bclavie added 2 commits March 15, 2024 18:14

chore: print exception if one occurs when using pytorch indexing

c656b04

chore: make _original_train_kmeans robust to subsequent calls

f6adedd

Anmol6 reviewed Mar 15, 2024

View reviewed changes

ragatouille/models/index.py Outdated Show resolved Hide resolved

jlscheerer reviewed Mar 15, 2024

View reviewed changes

ragatouille/models/torch_kmeans.py Outdated Show resolved Hide resolved

Anmol6 approved these changes Mar 18, 2024

View reviewed changes

bclavie force-pushed the feat/indexing_faissless branch from 98f6acf to d89656b Compare March 18, 2024 18:00

bclavie added 5 commits March 18, 2024 19:03

poetry lock

8b1d743

lint

b2c298f

Merge branch 'main' into feat/indexing_faissless

203bcd9

chore: better batch size

30604b4

0.0.8 dependency prep

b455429

bclavie enabled auto-merge (squash) March 18, 2024 19:20

bclavie added 3 commits March 18, 2024 20:20

chore: smaller batch size

dc2531d

fix: minor issue

845092e

clean cache

1d1697c

bclavie merged commit d27b693 into main Mar 18, 2024
2 checks passed

bclavie deleted the feat/indexing_faissless branch March 18, 2024 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/indexing faissless #173

Feat/indexing faissless #173

bclavie commented Mar 15, 2024

bclavie commented Mar 15, 2024

Anmol6 Mar 15, 2024

bclavie Mar 15, 2024

bclavie Mar 15, 2024

Anmol6 Mar 15, 2024

Anmol6 Mar 15, 2024

jlscheerer Mar 16, 2024

bclavie Mar 18, 2024 •

edited

jlscheerer left a comment •

edited

bclavie commented Mar 15, 2024

bclavie commented Mar 18, 2024

Anmol6 left a comment

bclavie commented Mar 18, 2024

Feat/indexing faissless #173

Feat/indexing faissless #173

Conversation

bclavie commented Mar 15, 2024

bclavie commented Mar 15, 2024

Anmol6 Mar 15, 2024

Choose a reason for hiding this comment

bclavie Mar 15, 2024

Choose a reason for hiding this comment

bclavie Mar 15, 2024

Choose a reason for hiding this comment

Anmol6 Mar 15, 2024

Choose a reason for hiding this comment

Anmol6 Mar 15, 2024

Choose a reason for hiding this comment

jlscheerer Mar 16, 2024

Choose a reason for hiding this comment

bclavie Mar 18, 2024 • edited

Choose a reason for hiding this comment

jlscheerer left a comment • edited

Choose a reason for hiding this comment

bclavie commented Mar 15, 2024

bclavie commented Mar 18, 2024

Anmol6 left a comment

Choose a reason for hiding this comment

bclavie commented Mar 18, 2024

bclavie Mar 18, 2024 •

edited

jlscheerer left a comment •

edited