-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
Describe the bug
There is a huge retrieval degradation when working with high top-k values using OpensearchDocumentStore. In some situations, the issue make you get 20 quality results when you expected 100 (top_k).
Error message
Low number of top-k responses
Expected behavior
Haystack should return at approx. the number requested in top-k of quality responses.
Additional context
Today, ef_search parameter is being set up during index creation in a hard-coded way (value is being set for faiss and hnsw, as 20). For the default parameters, where top_k is 10, this is ok. But when you try to get wider results, the ef_search parameter will drop search quality.
ef_search parameter is a balance of quality results vs. search speed. But this should be let to the user to make the decision. On high-end cloud instances, ef_search values around 500 won't represent any issue. Providing good search speed and great results.
ef_search parameter can be set only during index creation. So, I would like to propose to allow user to change it via a parameter. And, when using top_k, if top-k is higher than ef_search, use a warning to say that results will suffer degradation.
Indeed, Malkov (author of nmslib) suggests ef_search to always be higher than top_k.
To Reproduce
Create an OpensearchDocumentStore, with default parameters. Try to use query_by_embedding setting top_k to a high value (for example, 100).
Collect the results.
FAQ Check
- Have you had a look at our new FAQ page?
System:
- OS: Linux
- GPU/CPU: Intel i7
- Haystack version (commit or version number): 1.7.2
- DocumentStore: OpensearchDocumentStore