docs: Add docs for MultiRetriever and TextEmbeddingRetriever#11219
docs: Add docs for MultiRetriever and TextEmbeddingRetriever#11219
MultiRetriever and TextEmbeddingRetriever#11219Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
julian-risch
left a comment
There was a problem hiding this comment.
Looks good to me! The only thing I am wondering about is the deduplication, in particular the ranking. As a user, I would like to know more details about the deduplication. So adding something like "keeping the duplicate with the highest score if a score is present" to the docs could help. Related to that I thought that reciprocal rank fusion might actually make more sense here? For example, having a deduplicate_with_rff utility function in
haystack/haystack/utils/misc.py
Line 129 in 3cf91d9
|
Related to the above, I guess the ranking of results depends right now on which retriever in the ThreadPool returns results first? So not reproducible? Or are the results appended always in the order of how the retrievers were set in |
I think the comment about the highest score is misleading even if it's true. The scores coming from different retrievers can't be compared so I don't want to imply to the user that the score matters.
That's fair it could be worth adding some built-in ranking methods like rrf. I'll see about extracting a utility function that can be reused as you say. |
That's true the order would depend on which retrievers finish first. I wasn't concerned about a rank or an order for this component since users should use Rankers afterwards if they want definitive ranking of their documents. I do agree adding something like RRF can make sense in this component since it's not possible to run that ranking after merging all results into a single list. We can have this on by default, but if off I think random order is fine since ranking by dict key order in But for all other rankings I'd say we point users to our Rankers. |
|
Okay, thanks for looking into this! Feel free to open an issue or directly a PR. I would argue rff should be the default result of this new component. Mainly because of reproducibility, easy to calculate and any other ranking doesn't have a better meaning. |
👍 |
I agree setting it on by default makes sense. |
Related Issues
TextEmbeddingRetrieverandMultiRetriever#10872Proposed Changes:
Adds docs for
MultiRetrieverandTextEmbeddingRetrieverHow did you test it?
Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.