Skip to content

Improve cost estimation in SortedSetDocValuesRangeQuery when using DocValuesSkipper#16061

Merged
iverase merged 10 commits into
apache:mainfrom
iverase:SortedSetDocValuesRangeQuery
May 14, 2026
Merged

Improve cost estimation in SortedSetDocValuesRangeQuery when using DocValuesSkipper#16061
iverase merged 10 commits into
apache:mainfrom
iverase:SortedSetDocValuesRangeQuery

Conversation

@iverase
Copy link
Copy Markdown
Contributor

@iverase iverase commented May 13, 2026

I noticed that the current implementation always return values.cost() because we are delaying the construction of the iterator until the last moment. This is because we need to do some lookups for resolving the ordinals of the query.

Still when there is a skipper and the index is dense and used as primary sort, we have a very fast way to compute the iterator and the exact cost so let's move that conditions when building the score supplier. This is inline to what we are doing in SortedNumericDocValuesRangeQuery.

@romseygeek
Copy link
Copy Markdown
Contributor

Might be me not grokking things properly but I don't see where this is changing the cost() implementation?

@iverase
Copy link
Copy Markdown
Contributor Author

iverase commented May 13, 2026

          final DocIdSetIterator psIterator =
              getDocIdSetIteratorForDensePrimarySort(
                  context.reader(), singleton, skipper, minOrd, maxOrd);
          return ConstantScoreScorerSupplier.fromIterator(
              psIterator, score(), scoreMode, context.reader().maxDoc());

This supplier will use the cost of the iterator which will be maxDocId - minDocId.

score(), scoreMode, context.reader().maxDoc());
}
final DocIdSetIterator psIterator =
getDocIdSetIteratorForDensePrimarySort(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bit that is called out as expensive below so I'm not sure if we should be doing it in the scorerSupplier call directly? A couple of weeks back we were discussing ways of improving cost estimates using skippers, so I wonder if that's a better way to go?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comment predates skippers and it is because the resolution of the ordinals. But sure, if we don't wan to to do any IO during the construction of the supplier we shouldn't do this.

My question is, should we then modify SortedNumericDocValuesRangeQuery?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking how to estimate the cost inside the ScorerSupplier without having to create the full iterator. The idea is that in cost we will only visit the skipper and estimate the cost, similar to what we do in PointValues. When building the iterator we will visit the doc values if necessary.

We need to keep a bit of state around but it is not too bad, let me know what you think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot! I wonder if its possible to use SkipBlockRangeIterator to simplify the code that finds bounds a bit?

Copy link
Copy Markdown
Contributor Author

@iverase iverase May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to use the skipper directly here, I noticed we are being a bit inefficient here. For example:

                skipper.advance(minOrd, Long.MAX_VALUE);
                skipperMinDocId = skipper.minDocID(0);
                skipperMinDocIdExact = false;

we are visiting always the doc values but it can be change by:

                skipper.advance(minOrd, Long.MAX_VALUE);
                skipperMinDocId = skipper.minDocID(0);
                skipperMinDocIdExact = skipper.minValue(0) == minOrd;

So we only visit the doc values if the block does not start from the minOrd. This should be true very often because of the way we are building the blocks. There are other tricks from the maxOrd too!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the optimizations I was thinking about in 2d89064

Copy link
Copy Markdown
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@iverase iverase merged commit 091b35a into apache:main May 14, 2026
13 checks passed
@iverase iverase deleted the SortedSetDocValuesRangeQuery branch May 14, 2026 15:13
iverase added a commit that referenced this pull request May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants