You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On top-k queries, Lucene is now competitive with Tantivy/PISA on https://tantivy-search.github.io/bench/, but it's still quite slower on counting queries. This made me want to run a similar experiment as Tony-X/search-benchmark-game#44, though with a few more changes to how skipping works:
Single level of skip lists.
Skip data and impacts are inlined between blocks of postings.
Less overhead:
no separate SkipReader abstraction that gets lazily instantiated: the skipping logic is more lightweight and within the postings/impacts enum logic,
checking whether to skip and decode a new block is now a single check on BlockDocsEnum while it requires two different checks today.
It doesn't replace existings skip data, just adds additional skip data and impacts inlined between blocks.
Only BlockDocsEnum and BlockImpactsDocsEnum switched to this new skip data, other impls still use existing skip data. So term queries will see a change, but not phrase queries.
It's quite naive, we could probably do something that is a bit more efficient. Yet results on wikibigall are interesting:
CountAndHighHigh and CountAndHighMed became almost 20% faster! These are the main targets that I was targeting with this change, so it's good to see they are seeing a significant speedup. This confirms that we have some non-negligible overhead for skipping today though it's not easy to tell how much comes from the additional abstractions vs. multiple levels of skip lists.
OrNotHighLow and OrNotHighMed are faster. This is because the bottleneck of these queries is advancing the MUST_NOT clause, which are not scoring. So it's very similar to the speedup we're seeing on the counting queries.
AndHighLow and AndHighMed are 8%-11% faster. Again, I would attribute this to the faster skipping logic since this is about clauses that have different doc frequencies, so the higher frequency clause will need to do a lot of skipping to catch up with the leading clause. Interestingly, the fact that we are storing a single level of impact data doesn't hurt.
AndHighHigh and OrHighHigh are slightly slower (or is it noise?). I could believe that there is a small performance hit on this one due to having a single level of impact data. This forces Lucene to use the maximum score across the entire doc ID space as a score upper bound for the clause that has the higher cost. Maybe it could be enough to compute global impacts to have better performance on these queries by having slightly better score upper bounds for the following clause.
HighTerm, MedTerm, OrHighNotLow, OrHighNotMed, HighTerm, OrHighNotHigh, OrNotHighHigh are slower. This is expected as there are queries that have a single positive clause, which in-turn are queries where the score upper bounds that we compute are very close to the actual produced scores, which in-turn enables these queries to take advantage of the higher levels of impacts to skip more docs at once.
HighTermMonthSort is slower. This is because the sort dynamically introduces a filter that is so selective that the term query can take advantage of skip data on higher levels to skip more docs at once.
CountOrHighHigh is a bit slower because there's a bit more overhead to collect postings lists exhaustively now that skip data and impacts are inlined.
It's not a net win, but this suggests that we have some room for improvement here.
The text was updated successfully, but these errors were encountered:
Whoa, very cool @jpountz! This reminds me of this longstanding issue/paper which also inlined skip data directly in the postings, but maybe was still multi-level?
Description
On top-k queries, Lucene is now competitive with Tantivy/PISA on https://tantivy-search.github.io/bench/, but it's still quite slower on counting queries. This made me want to run a similar experiment as Tony-X/search-benchmark-game#44, though with a few more changes to how skipping works:
BlockDocsEnum
while it requires two different checks today.A hacky implementation of this can be found at https://github.com/apache/lucene/compare/main...jpountz:lucene:skip_experiment?expand=1:
BlockDocsEnum
andBlockImpactsDocsEnum
switched to this new skip data, other impls still use existing skip data. So term queries will see a change, but not phrase queries.It's quite naive, we could probably do something that is a bit more efficient. Yet results on wikibigall are interesting:
CountAndHighHigh
andCountAndHighMed
became almost 20% faster! These are the main targets that I was targeting with this change, so it's good to see they are seeing a significant speedup. This confirms that we have some non-negligible overhead for skipping today though it's not easy to tell how much comes from the additional abstractions vs. multiple levels of skip lists.OrNotHighLow
andOrNotHighMed
are faster. This is because the bottleneck of these queries is advancing theMUST_NOT
clause, which are not scoring. So it's very similar to the speedup we're seeing on the counting queries.AndHighLow
andAndHighMed
are 8%-11% faster. Again, I would attribute this to the faster skipping logic since this is about clauses that have different doc frequencies, so the higher frequency clause will need to do a lot of skipping to catch up with the leading clause. Interestingly, the fact that we are storing a single level of impact data doesn't hurt.AndHighHigh
andOrHighHigh
are slightly slower (or is it noise?). I could believe that there is a small performance hit on this one due to having a single level of impact data. This forces Lucene to use the maximum score across the entire doc ID space as a score upper bound for the clause that has the higher cost. Maybe it could be enough to compute global impacts to have better performance on these queries by having slightly better score upper bounds for the following clause.HighTerm
,MedTerm
,OrHighNotLow
,OrHighNotMed
,HighTerm
,OrHighNotHigh
,OrNotHighHigh
are slower. This is expected as there are queries that have a single positive clause, which in-turn are queries where the score upper bounds that we compute are very close to the actual produced scores, which in-turn enables these queries to take advantage of the higher levels of impacts to skip more docs at once.HighTermMonthSort
is slower. This is because the sort dynamically introduces a filter that is so selective that the term query can take advantage of skip data on higher levels to skip more docs at once.CountOrHighHigh
is a bit slower because there's a bit more overhead to collect postings lists exhaustively now that skip data and impacts are inlined.It's not a net win, but this suggests that we have some room for improvement here.
The text was updated successfully, but these errors were encountered: