Skip to content

Implement method to add all stream elements into a PriorityQueue.#15823

Merged
gsmiller merged 23 commits into
apache:mainfrom
vsop-479:add_PriorityQueue#addNoShift
Mar 31, 2026
Merged

Implement method to add all stream elements into a PriorityQueue.#15823
gsmiller merged 23 commits into
apache:mainfrom
vsop-479:add_PriorityQueue#addNoShift

Conversation

@vsop-479
Copy link
Copy Markdown
Contributor

@vsop-479 vsop-479 commented Mar 14, 2026

In DisjunctionMaxBulkScorer's constructor all values to compare is 0, so we can skip the shift.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a specialized PriorityQueue insertion method that skips heap adjustment and uses it to optimize initialization of DisjunctionMaxBulkScorer, where all initial queue keys are equal (next == 0).

Changes:

  • Add PriorityQueue#addNoShift(T) to append elements without upHeap.
  • Use addNoShift when building the scorer queue in DisjunctionMaxBulkScorer’s constructor.
  • Minor local refactor in DisjunctionMaxBulkScorer.collect to reuse the computed delta.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java Adds addNoShift as a new insertion API that bypasses heap restoration.
lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxBulkScorer.java Uses addNoShift during queue initialization; small delta reuse cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

/**
* Adds an Object to a PriorityQueue without upHeap. Note: only use it when all elements have a
Comment on lines +193 to +194
* Adds an Object to a PriorityQueue without upHeap. Note: only use it when all elements have a
* same value.
Comment on lines +197 to +199
// TODO: Check element equalsTo lastElement, if we can get the comparator.

int index = size + 1;
* same value.
*/
public final void addNoShift(T element) {
// TODO: Check element equalsTo lastElement, if we can get the comparator.
* Adds an Object to a PriorityQueue without upHeap. Note: only use it when all elements have a
* same value.
*/
public final void addNoShift(T element) {
@gsmiller
Copy link
Copy Markdown
Contributor

I'm a bit uncomfortable with adding a "no shift" API to PQ. It has potential to be pretty trappy. Is there a better way to get at what you're trying to do? Maybe we could leverage #addAll from the DisjunctionMaxBulkScorer ctor? I get that we don't need the downHeap operation if everything is actually equal, but it doesn't seem particularly expensive?

@vsop-479
Copy link
Copy Markdown
Contributor Author

Thanks for your reply @gsmiller !

but it doesn't seem particularly expensive?

No, it doesn't.

Maybe we could leverage #addAll from the DisjunctionMaxBulkScorer ctor?

I will try.

…rityQueue, and leverage it in DisjunctionMaxBulkScorer's constructor.
@github-actions github-actions Bot added this to the 11.0.0 milestone Mar 19, 2026
@vsop-479
Copy link
Copy Markdown
Contributor Author

vsop-479 commented Mar 19, 2026

Maybe we could leverage #addAll from the DisjunctionMaxBulkScorer ctor?

I used a variant: add an ElementSupplier to avoid duplicate loop.

@vsop-479 vsop-479 changed the title Add PriorityQueue#addNoShift, and use it in DisjunctionMaxBulkScorer's constructor. Implement method to convert and add all collection elements to a PriorityQueue, and leverage it in DisjunctionMaxBulkScorer's constructor. Mar 19, 2026
@gsmiller
Copy link
Copy Markdown
Contributor

I used an variant: add an ElementSupplier to avoid duplicate loop.

Oh, that works too. That's a creative way to avoid the comparison check on initialization :)

Copy link
Copy Markdown
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm supportive of adding this but I'm a bit on-the-fence with whether-or-not it's worth expanding the API of PriorityQueue for this case. I'm leaning towards adding it though. It does seem like a nice-to-have feature for cases where calling code needs to do some "element conversion" when populating a PQ instance.

Added a few small comments. Thanks!

Comment thread lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java Outdated
}

/**
* Similar to {@link #addAll(Collection)}, but supply an {@link ElementSupplier} to convert origin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could be more explicit in the documentation that the value of this version of addAll is for cases where you need to convert between element types since it doesn't require the caller to do the conversion at the call site?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more documentation, not sure it is explicit enough.

* Similar to {@link #addAll(Collection)}, but supply an {@link ElementSupplier} to convert origin
* element to target element. This is useful when source elements does not fit target elements.
*/
public <S> void addAll(Collection<S> elements, ElementSupplier<T, S> elementSupplier) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test coverage for this method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

* Similar to {@link #addAll(Collection)}, but supply an {@link ElementSupplier} to convert origin
* element to target element. This is useful when source elements does not fit target elements.
*/
public <S> void addAll(Collection<S> elements, ElementSupplier<T, S> elementSupplier) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more natural for addAll to accept Iterable, followed by an utility method Iterables.transforming(Iterable, Function<Y,X>). Many libraries do this kind of view thing - we can't reuse them but it should be trivial to implement. Just a thought.

Copy link
Copy Markdown
Contributor Author

@vsop-479 vsop-479 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use Guava's Iterables implemented it, it works too (transform in iterating). Here is the code(uncommitted):

public <S> void addAll(
      Collection<S> elements, com.google.common.base.Function<S, T> elementConverter) {
    if (this.size + elements.size() > this.maxSize) {
      throw new ArrayIndexOutOfBoundsException(
          "Cannot add "
              + elements.size()
              + " elements to a queue with remaining capacity: "
              + (maxSize - size));
    }

    Iterable<T> transform = Iterables.transform(elements, elementConverter);

    // Heap with size S always takes first S elements of the array,
    // and thus it's safe to fill array further - no actual non-sentinel value will be overwritten.
    for (T element : transform) {
      this.heap[size + 1] = element;
      this.size++;
    }

    // The loop goes down to 1 as heap is 1-based not 0-based.
    for (int i = (size >>> 1); i >= 1; i--) {
      downHeap(i);
    }
  }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many libraries do this kind of view thing - we can't reuse them but it should be trivial to implement.

If we prefer this way(I think current committed way is ok too:), do you mean we should implement lucene's Iterables, Iterators, etc. Ranther than use these libraries directly?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the idea here would be:

  1. Not use a 3rd party library directly (we don't want that dependency in core)
  2. Have an addAll method that simply takes a java.lang.Iterable as a single input but create our own implementation of Iterable that applies the transformation internally. Much like what Guava's Iterables#transform method does. This is probably something we'd put in core (maybe in o.a.l.util)? We could probably deprecate the current addAll method that takes a Collection in favor of this new API as well.

This approach makes a lot of sense to me, and I don't think it would be too much work to do. (Hopefully I'm understanding the suggestion correctly).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not use a 3rd party library directly (we don't want that dependency in core).

Thanks for explicit that.

create our own implementation of Iterable that applies the transformation internally.

OK, I will try this approach.

@vsop-479
Copy link
Copy Markdown
Contributor Author

vsop-479 commented Mar 23, 2026

I implemented it with Iterables/Iterators (they only have transform function for now).

Please take a look when you get a chance @dweiss, @gsmiller .

I like the design of Iterables/Iterators. But, there are some notes i think we may care in Guava's Iterables:

Java 8+ users: several common uses for this class are now more comprehensively addressed by the new Stream library. Read the method documentation below for comparisons. This class is not being deprecated, but we gently encourage you to migrate to streams.

Note on Iterables#transform:

Stream equivalent: Stream.map.

So, I test it with Stream, it works too. The code with stream likes:

// Use stream to transform and iterate.
    elements.stream().map(elementTransformer).forEach(e -> {
      this.heap[size + 1] = e;
      this.size++;
    });

* param elements is different from the type of this {@link PriorityQueue}, since it doesn't
* require the caller to do the conversion at the call site.
*/
public <S> void addAll(Collection<S> elements, Function<S, T> elementTransformer) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the proposed idea is a little different actually. The idea would be for this method signature to simply be:
public void addAll(Iterable<T> elements).

Then, from your calling code in DisjunctionMaxBulkScorer, you'd call like this:
this.scorers.addAll(Iterables.transform(scorers, BulkScorerAndNext::new));

The goal is that PriorityQueue doesn't need to know anything about the translation logic. That all gets encapsulated behind the Iterator that is setup by the calling code. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is that PriorityQueue doesn't need to know anything about the translation logic. That all gets encapsulated behind the Iterator that is setup by the calling code. WDYT?

+1. There is a minor issue with public void addAll(Iterable<T> elements), the original addAll validate the size with Collection#size firstly. If we want keep this validation, maybe the caller need input the size of elements? like this:public void addAll(Iterable<T> elements, int size)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how important that validation is. It's an exceptional case (the caller really shouldn't be overflowing their PQ capacity). We'll hit the same AIOOB exception anyway in the for-each loop that's populating the array, so the outcome is really the same if you overflow the capacity, so I'm not sure that early check adds a ton of value. It's nice to do up-front, but I don't think I'd design the API around it. So I guess I don't think it's a problem to remove this up-front sizing check, but maybe you're considering something I'm not? Do you think it's critical? Thanks again for iterating!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll hit the same AIOOB exception anyway in the for-each loop that's populating the array

Indeed. That is what PriorityQueue#add do (when caller iterate their elements to call ). So i removed the validation and implmented public void addAll(Iterable<T> elements).

@vsop-479 vsop-479 changed the title Implement method to convert and add all collection elements to a PriorityQueue, and leverage it in DisjunctionMaxBulkScorer's constructor. Implement Iterables#transform function and change the input type of elements in PriorityQueue#addAll from Collection to Iterable. Mar 26, 2026
@uschindler
Copy link
Copy Markdown
Contributor

uschindler commented Mar 26, 2026

Hi,
I don't think we should add those self made Spliterators and all this internal collection wrapper stuff.

What I generally do when I want a filtered Iterator: get a stream from the collection/iterable (for iterables there's a trick to get it, but I'd prefer a collection instead of Iterables). The use all stream filtering or mapping and finally wrap the thing into an iterator.

@uschindler
Copy link
Copy Markdown
Contributor

So, -1 to add all this collection code that already has native support in jdk.

@vsop-479
Copy link
Copy Markdown
Contributor Author

Hi @uschindler , you prefer approach like this?

So, I test it with Stream, it works too. The code with stream likes:
// Use stream to transform and iterate.
elements.stream().map(elementTransformer).forEach(e -> {
this.heap[size + 1] = e;
this.size++;
});

@uschindler
Copy link
Copy Markdown
Contributor

Yes, or instead of foreach call iterator()

@vsop-479
Copy link
Copy Markdown
Contributor Author

This is how it's used in the disj scorer, so only allow to initialize. Later calling addAll would lead to trouble.
So basically a factory method like: produce a PQ with that size and take the elements from that stream.
AddAll should not be allowed when heap is already populated because code may misuse it and fail suddenly.

@uschindler - I don't completely follow your comment about addAll only working on a newly-constructed/empty heap. Can you elaborate a bit? I believe both of the addAll methods (existing and this proposed one) should work correctly on a heap that already contains some elements, but I might be missing something? Or maybe I'm misunderstanding your point? Oh wait, after re-reading, maybe you're describing a different option that would appropriately size a new heap for a provided stream? One that would handle the overflow case by essentially allocating a bigger PQ?

I am trying to understand this too :)

@uschindler
Copy link
Copy Markdown
Contributor

uschindler commented Mar 29, 2026

Hi,
Sorry. Add is also without overflow. So naming is fine.
In that case it's fine. Ignore my previous comment.
You can also revert the exception change. It's consistent with add, which also throws AIOOBE.

@vsop-479
Copy link
Copy Markdown
Contributor Author

vsop-479 commented Mar 29, 2026

I think there is a difference between add and addAll, with add the caller can use the element to compare to PQ’s top and maybe pop the top and reAdd the element after got an AIOOBE(maybe the normal usage is judge wether the queue is full firstly, and call updateTop?), but if we want addAll behave the same after AIOOBE, we need keep remaining elements in the stream, so the caller can add them intersect(maybe stream already works like this? I will verify it)

@uschindler
Copy link
Copy Markdown
Contributor

That's not working with streams. If it works it is an implementation detail and could break with different types of streams.
After the exception the stream is by definition no longer useable. I have to lookup the spec. The correct reuse case would be if addAll would return the number of added entries.

Then one could redo the add with a new stream using stream.skip(return value). In that case it should not throw exception. It is all not easy to decide. For now I think the current impl is fine if we add enough documentation.

I figured out that add() also throws AIOOBE. Maybe we should keep the exception; I would remove the catch block and only keep the finally intact.

Uwe

@uschindler
Copy link
Copy Markdown
Contributor

Quote from stream Java docs:

A stream should be operated on (invoking an intermediate or terminal stream operation) only once. This rules out, for example, "forked" streams, where the same source feeds two or more pipelines, or multiple traversals of the same stream. A stream implementation may throw IllegalStateException if it detects that the stream is being reused. However, since some stream operations may return their receiver rather than a new stream object, it may not be possible to detect reuse in all cases.

@vsop-479
Copy link
Copy Markdown
Contributor Author

Oh, I also noticed you're targeting 11.0 to release this. Is there any reason not to target 10.5 instead?

I figured out that add() also throws AIOOBE. Maybe we should keep the exception; I would remove the catch block and only keep the finally intact.

I moved the change entity to 10.5.0, and removed the catch block.

// Heap with size S always takes first S elements of the array,
// and thus it's safe to fill array further - no actual non-sentinel value will be overwritten.
try {
elements.forEach(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Better to use forEachOrdered(). This makes practically no difference unless stream is parallel, but we should be correct here. Especially when we have the skip(long) mentioned in the javadcos.

@gsmiller
Copy link
Copy Markdown
Contributor

This looks mostly good to me, thanks!

I left an earlier comment regarding consistency differences between this new addAll method and the existing addAll(Collection) method that I'd like to resolve (let's either, (a) decide the inconsistency is OK, (b) change the current addAll(Collection) behavior, or (c) open a follow-up issue to address it). The inconsistent behavior relates to how we handle an "overflow" case where more elements are provided than PQ capacity:

  • Current addAll(Collection) method: Detects the issue up-front and will not add any elements to the PQ.
  • New addAll(Stream) method: Because it cannot detect the issue up-front, it partially completes the operation by adding elements until capacity.

I'd propose: Modify the current addAll(Collection) method to be consistent with the newly added addAll(Stream) method (this is the only option for consistency). (I think it's OK to do this in a follow-up issue if we want to separate the changes).

I recognize this is fairly nitpicky, but I think it's helpful to maintain as much consistency as we can here. WDYT?

Comment thread lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java
int left = i << 1;
int right = left + 1;
if (right < heapArray.length) {
assert (Integer) heapArray[i] <= (Integer) heapArray[right];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This test will silently not work if asserts are disabled in the jvm. Let's use the idiomatic juint assert methods instead. I'd have this assertHeap method return a boolean then use assertTrue from the calling method.

() -> pq.addAll(elements.stream().map(String::hashCode)));
// Partly added.
assertEquals(10, pq.size());
assertHeap(pq);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach works, but you could also pop elements off the heap in a loop and assert the ordered nature of what comes out. It's nice that you avoid mutating the heap here with your check, but since it's the last thing we're doing in a unit test, it might be simpler to just pop stuff off the heap and check rather than writing custom heap traversal logic? I don't have strong opinions I suppose. Take-it-or-leave it :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you could also pop elements off the heap in a loop and assert the ordered nature of what comes out. It's nice that you avoid mutating the heap here with your check

Yes, pop and assert the order came to my mind firstly. And in order to avoid mutate the heap, I picked this apporach.

but since it's the last thing we're doing in a unit test, it might be simpler to just pop stuff off the heap and check rather than writing custom heap traversal logic?

But, you are right.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, totally fine by me either way :) Thanks!

Copy link
Copy Markdown
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the two methods consistent if great! Thanks.

@gsmiller
Copy link
Copy Markdown
Contributor

Thanks for making the two addAll methods consistent (and for the other minor revisions)! I'll go ahead and merge this since I don't see any other outstanding feedback. Thanks again!

@gsmiller gsmiller merged commit a68055a into apache:main Mar 31, 2026
13 checks passed
@gsmiller gsmiller modified the milestones: 11.0.0, 10.5.0 Mar 31, 2026
@gsmiller
Copy link
Copy Markdown
Contributor

Oh and @vsop-479, just to respond to your earlier comment ("I will move it to 10.5. To be honest, I am not sure where it should be."), I would generally release a change in the next minor version unless it has to wait for the next major version because of something like a backwards compatibility issue. There more here if you're interested: https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility

@vsop-479 vsop-479 deleted the add_PriorityQueue#addNoShift branch April 1, 2026 01:38
@vsop-479
Copy link
Copy Markdown
Contributor Author

vsop-479 commented Apr 1, 2026

Oh and @vsop-479, just to respond to your earlier comment ("I will move it to 10.5. To be honest, I am not sure where it should be."), I would generally release a change in the next minor version unless it has to wait for the next major version because of something like a backwards compatibility issue. There more here if you're interested: https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility

Thanks, it is helpful to me. And thanks everyone for the review!

@uschindler
Copy link
Copy Markdown
Contributor

I think there is a bit of backwards compatibility issue because the already existing Collection based version now behaves a bit different on exception. But I don't know if this is really an issue (does anybody adds element with a collection to an already full PQ)?

@uschindler
Copy link
Copy Markdown
Contributor

I had a similar issue with changing an Exception type for the caller checks #15877, but I decided then: It's not an issue, because nobody is allowed the method from externally so the thrown exception won't matter, because there is no way to recover from that issue.

Here it is similar: If you hit an exception the difference is now that heap is sane afterwards, before it was not sane. So it is better afterwards. Anybody who tried to cleanup from it in the past would have made the heap sane, but doing that twice is not an issue.

@vsop-479
Copy link
Copy Markdown
Contributor Author

vsop-479 commented Apr 1, 2026

I think there is a bit of backwards compatibility issue because the already existing Collection based version now behaves a bit different on exception.

Oh, I missed this.

Here it is similar: If you hit an exception the difference is now that heap is sane afterwards, before it was not sane. So it is better afterwards. Anybody who tried to cleanup from it in the past would have made the heap sane, but doing that twice is not an issue.

+1.

@gsmiller
Copy link
Copy Markdown
Contributor

gsmiller commented Apr 1, 2026

@uschindler do you know what our policy generally is on @lucene.internal tagged classes w.r.t. backwards compatibility? My understanding was that we can be a little more flexibility within reason on these tagged classes. Either way, I still think we should bring this to 10.5 personally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants