-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Colbert purely as a re-ranker #6
Comments
That would be so cool! I have some code for this, @bclavie, I can get it. QQ for @hiranya911 , do you want the docs to be pre-encoded? Or supplied at query time? |
The |
@okhat I think I want to pass the documents as raw text. Kind of similar to how the MsMarco cross encoder API is set up. But I'm sure passing the pre-encoded docs is valid use case too. |
Hey @hiranya911, this is definitely something that I'll be adding to the roadmap (@okhat, please do share the code you have 😄), thanks for the suggestion! |
@bclavie When you're done, ping me here, and I'll PR weaviate/reranker-transformers to add RAGatouille there! |
Will do @timothepearce This is (probably) the last feature I'll push before spending some time on important housekeeping (setting up CI, tests, better documentation, tutorials for training on a new language, etc...), but I'm hoping to have it out next week (in beta, just like everything else in RAGatouille at the moment 😄) ! |
Hey @hiranya911 @timothepearce, closing this issue as it's now available in 0.0.4a1 #31 🥳 |
This is working like a charm. Thanks for the quick turnaround 🙏 Couple of questions when you have a moment:
|
This is a good question and could do with more explaining. It's non-normalised MaxSim scores, which is how ColBERT score documents: for each query token, compare cosine distance w/ all document tokens and keep the max score in memory, and the total score is the sum of all those individual scores. (a good, slightly longer explanation can be found here). This could be normalised to give a "relevance" estimate.
Anything up to your ColBERT's base model maximum length (for ColBERTv2, bert-base-uncased, so 512) is fine, but the longer the documents, the slower the process is. I think it's mostly about finding the sweet spot for you between doc length and efficiency constraints! |
Is there a way to use this library only as a re-ranker? Something like:
The text was updated successfully, but these errors were encountered: