Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery #12003

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented Dec 8, 2022

  • Leverage DISI static factory methods more over custom DISI impl where possible
  • Assert points field is a single-dim in a couple places
  • Bound cost estimate by the cost of the doc values field (for sparse fields)

Description

This PR contains some minor cleanup I thought might be useful after recently spending a little time looking at this code.

* Leverage DISI static factory methods more over custom DISI impl where possible.
* Assert points field is a single-dim.
* Bound cost estimate by the cost of the doc values field (for sparse fields).
Copy link
Contributor

@vigyasharma vigyasharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes Greg, they seem to simplify the overall code flow. Changes look good to me, I only have a couple of questions for my understanding.

@@ -692,7 +697,7 @@ public int advance(int target) throws IOException {

@Override
public long cost() {
return lastDoc - firstDoc;
return Math.min(delegate.cost(), lastDoc - firstDoc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, why can't we just return delegate.cost() here? It seems to me that we're returning DocIdSetIterator.range() wherever cost would be equal to lastDoc - firstDoc. Is that not the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So delegate.cost() will approximate the number of documents that could be provided in total by the delegate (in this case, it's NumericDocValues). lastDoc - firstDoc of course will provide the correct number of docs if every doc has a value. In this case, delegate tells us what docs actually have values, and because we're using this, we know not all docs have values, so lastDoc - firstDoc may overestimate. If the number of docs containing a value are very sparse though, it's possible we'll over-estimate by a lot, so providing a ceiling using delegate.cost() could be useful in certain cases. Hope that makes sense?

if (matchAll(points, queryLowerPoint, queryUpperPoint)) {
int maxDoc = context.reader().maxDoc();
if (points.getDocCount() == maxDoc) {
return delegate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if we can return DocIdSetIterator.all(maxDoc) here. Does it break correctness is some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! That should be functionally correct (and more efficient than relying on delegate here). Thanks!

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is much better this way indeed! There is just one bit I'm not comfortable with regarding relying on cost() being accurate but maybe we can work around it by providing a match count separately?

return disi.lastDoc - disi.firstDoc;
DocIdSetIterator disi = getDocIdSetIteratorOrNull(context);
if (disi != null && disi instanceof BoundedDocIdSetIterator == false) {
return Math.toIntExact(disi.cost());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this might be a bit fragile since cost() has no guarantee to be accurate. I wonder if we could make getDocIdSetIteratorOrNull() return both a DocIdSetIterator and a number of matches (possibly -1 when unknown) to make this less trappy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point. I'll work on making this less fragile. Thanks!

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gsmiller gsmiller merged commit 8671e29 into apache:main Dec 10, 2022
@gsmiller gsmiller deleted the explore/IndexSortSortedNumericDocValuesRangeQuery-tweaks branch December 10, 2022 20:23
gsmiller added a commit that referenced this pull request Dec 10, 2022
…12003)

* Leverage DISI static factory methods more over custom DISI impl where possible.
* Assert points field is a single-dim.
* Bound cost estimate by the cost of the doc values field (for sparse fields).
@gsmiller
Copy link
Contributor Author

Thanks @jpountz. Merged/backported.

@LuXugang
Copy link
Member

Thanks @gsmiller , a new syntactic sugar record to me and first time appeard in lucene code.

@rmuir rmuir added this to the 9.5.0 milestone Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants