Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Colbert purely as a re-ranker #6

Closed
hiranya911 opened this issue Jan 4, 2024 · 9 comments
Closed

Using Colbert purely as a re-ranker #6

hiranya911 opened this issue Jan 4, 2024 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@hiranya911
Copy link

Is there a way to use this library only as a re-ranker? Something like:

results = RAG.rerank(question = '...', docs = [...])
@okhat
Copy link
Collaborator

okhat commented Jan 4, 2024

That would be so cool! I have some code for this, @bclavie, I can get it.

QQ for @hiranya911 , do you want the docs to be pre-encoded? Or supplied at query time?

@santhnm2
Copy link

santhnm2 commented Jan 4, 2024

The search function in ColBERT accepts a pids argument which can be used to rank only the given documents.

@hiranya911
Copy link
Author

@okhat I think I want to pass the documents as raw text. Kind of similar to how the MsMarco cross encoder API is set up. But I'm sure passing the pre-encoded docs is valid use case too.

@bclavie bclavie added the enhancement New feature or request label Jan 5, 2024
@bclavie
Copy link
Collaborator

bclavie commented Jan 6, 2024

Hey @hiranya911, this is definitely something that I'll be adding to the roadmap (@okhat, please do share the code you have 😄), thanks for the suggestion!

@timothepearce
Copy link

@bclavie When you're done, ping me here, and I'll PR weaviate/reranker-transformers to add RAGatouille there!

@bclavie
Copy link
Collaborator

bclavie commented Jan 6, 2024

Will do @timothepearce

This is (probably) the last feature I'll push before spending some time on important housekeeping (setting up CI, tests, better documentation, tutorials for training on a new language, etc...), but I'm hoping to have it out next week (in beta, just like everything else in RAGatouille at the moment 😄) !

@bclavie bclavie added the ongoing Feature is currently being worked on label Jan 7, 2024
@bclavie bclavie self-assigned this Jan 7, 2024
@bclavie
Copy link
Collaborator

bclavie commented Jan 10, 2024

Hey @hiranya911 @timothepearce, closing this issue as it's now available in 0.0.4a1 #31 🥳

@bclavie bclavie closed this as completed Jan 10, 2024
@bclavie bclavie removed the ongoing Feature is currently being worked on label Jan 10, 2024
@hiranya911
Copy link
Author

hiranya911 commented Jan 29, 2024

This is working like a charm. Thanks for the quick turnaround 🙏

Couple of questions when you have a moment:

  1. What are the scores returned by the rerank() API? Are they logits (log probabilities) or some other scaled values?
  2. Are there any recommendations on the content length of documents passed into rerank()?

@bclavie
Copy link
Collaborator

bclavie commented Jan 29, 2024

What are the scores returned by the rerank() API? Are they logits (log probabilities) or some other scaled values?

This is a good question and could do with more explaining. It's non-normalised MaxSim scores, which is how ColBERT score documents: for each query token, compare cosine distance w/ all document tokens and keep the max score in memory, and the total score is the sum of all those individual scores. (a good, slightly longer explanation can be found here). This could be normalised to give a "relevance" estimate.

Are there any recommendations on the content length of documents passed into rerank()?

Anything up to your ColBERT's base model maximum length (for ColBERTv2, bert-base-uncased, so 512) is fine, but the longer the documents, the slower the process is. I think it's mostly about finding the sweet spot for you between doc length and efficiency constraints!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants