Implement method to add all stream elements into a PriorityQueue.#15823
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a specialized PriorityQueue insertion method that skips heap adjustment and uses it to optimize initialization of DisjunctionMaxBulkScorer, where all initial queue keys are equal (next == 0).
Changes:
- Add
PriorityQueue#addNoShift(T)to append elements withoutupHeap. - Use
addNoShiftwhen building the scorer queue inDisjunctionMaxBulkScorer’s constructor. - Minor local refactor in
DisjunctionMaxBulkScorer.collectto reuse the computeddelta.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java | Adds addNoShift as a new insertion API that bypasses heap restoration. |
| lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxBulkScorer.java | Uses addNoShift during queue initialization; small delta reuse cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| /** | ||
| * Adds an Object to a PriorityQueue without upHeap. Note: only use it when all elements have a |
| * Adds an Object to a PriorityQueue without upHeap. Note: only use it when all elements have a | ||
| * same value. |
| // TODO: Check element equalsTo lastElement, if we can get the comparator. | ||
|
|
||
| int index = size + 1; |
| * same value. | ||
| */ | ||
| public final void addNoShift(T element) { | ||
| // TODO: Check element equalsTo lastElement, if we can get the comparator. |
| * Adds an Object to a PriorityQueue without upHeap. Note: only use it when all elements have a | ||
| * same value. | ||
| */ | ||
| public final void addNoShift(T element) { |
|
I'm a bit uncomfortable with adding a "no shift" API to PQ. It has potential to be pretty trappy. Is there a better way to get at what you're trying to do? Maybe we could leverage |
|
Thanks for your reply @gsmiller !
No, it doesn't.
I will try. |
…rityQueue, and leverage it in DisjunctionMaxBulkScorer's constructor.
I used a variant: add an |
Oh, that works too. That's a creative way to avoid the comparison check on initialization :) |
gsmiller
left a comment
There was a problem hiding this comment.
In general, I'm supportive of adding this but I'm a bit on-the-fence with whether-or-not it's worth expanding the API of PriorityQueue for this case. I'm leaning towards adding it though. It does seem like a nice-to-have feature for cases where calling code needs to do some "element conversion" when populating a PQ instance.
Added a few small comments. Thanks!
| } | ||
|
|
||
| /** | ||
| * Similar to {@link #addAll(Collection)}, but supply an {@link ElementSupplier} to convert origin |
There was a problem hiding this comment.
Maybe we could be more explicit in the documentation that the value of this version of addAll is for cases where you need to convert between element types since it doesn't require the caller to do the conversion at the call site?
There was a problem hiding this comment.
I added more documentation, not sure it is explicit enough.
| * Similar to {@link #addAll(Collection)}, but supply an {@link ElementSupplier} to convert origin | ||
| * element to target element. This is useful when source elements does not fit target elements. | ||
| */ | ||
| public <S> void addAll(Collection<S> elements, ElementSupplier<T, S> elementSupplier) { |
There was a problem hiding this comment.
Can you add test coverage for this method?
| * Similar to {@link #addAll(Collection)}, but supply an {@link ElementSupplier} to convert origin | ||
| * element to target element. This is useful when source elements does not fit target elements. | ||
| */ | ||
| public <S> void addAll(Collection<S> elements, ElementSupplier<T, S> elementSupplier) { |
There was a problem hiding this comment.
Wouldn't it be more natural for addAll to accept Iterable, followed by an utility method Iterables.transforming(Iterable, Function<Y,X>). Many libraries do this kind of view thing - we can't reuse them but it should be trivial to implement. Just a thought.
There was a problem hiding this comment.
I use Guava's Iterables implemented it, it works too (transform in iterating). Here is the code(uncommitted):
public <S> void addAll(
Collection<S> elements, com.google.common.base.Function<S, T> elementConverter) {
if (this.size + elements.size() > this.maxSize) {
throw new ArrayIndexOutOfBoundsException(
"Cannot add "
+ elements.size()
+ " elements to a queue with remaining capacity: "
+ (maxSize - size));
}
Iterable<T> transform = Iterables.transform(elements, elementConverter);
// Heap with size S always takes first S elements of the array,
// and thus it's safe to fill array further - no actual non-sentinel value will be overwritten.
for (T element : transform) {
this.heap[size + 1] = element;
this.size++;
}
// The loop goes down to 1 as heap is 1-based not 0-based.
for (int i = (size >>> 1); i >= 1; i--) {
downHeap(i);
}
}
There was a problem hiding this comment.
Many libraries do this kind of view thing - we can't reuse them but it should be trivial to implement.
If we prefer this way(I think current committed way is ok too:), do you mean we should implement lucene's Iterables, Iterators, etc. Ranther than use these libraries directly?
There was a problem hiding this comment.
I suspect the idea here would be:
- Not use a 3rd party library directly (we don't want that dependency in core)
- Have an addAll method that simply takes a
java.lang.Iterableas a single input but create our own implementation ofIterablethat applies the transformation internally. Much like what Guava'sIterables#transformmethod does. This is probably something we'd put in core (maybe in o.a.l.util)? We could probably deprecate the current addAll method that takes a Collection in favor of this new API as well.
This approach makes a lot of sense to me, and I don't think it would be too much work to do. (Hopefully I'm understanding the suggestion correctly).
There was a problem hiding this comment.
Not use a 3rd party library directly (we don't want that dependency in core).
Thanks for explicit that.
create our own implementation of Iterable that applies the transformation internally.
OK, I will try this approach.
|
I implemented it with Please take a look when you get a chance @dweiss, @gsmiller . I like the design of
Note on
So, I test it with |
| * param elements is different from the type of this {@link PriorityQueue}, since it doesn't | ||
| * require the caller to do the conversion at the call site. | ||
| */ | ||
| public <S> void addAll(Collection<S> elements, Function<S, T> elementTransformer) { |
There was a problem hiding this comment.
So I think the proposed idea is a little different actually. The idea would be for this method signature to simply be:
public void addAll(Iterable<T> elements).
Then, from your calling code in DisjunctionMaxBulkScorer, you'd call like this:
this.scorers.addAll(Iterables.transform(scorers, BulkScorerAndNext::new));
The goal is that PriorityQueue doesn't need to know anything about the translation logic. That all gets encapsulated behind the Iterator that is setup by the calling code. WDYT?
There was a problem hiding this comment.
The goal is that PriorityQueue doesn't need to know anything about the translation logic. That all gets encapsulated behind the Iterator that is setup by the calling code. WDYT?
+1. There is a minor issue with public void addAll(Iterable<T> elements), the original addAll validate the size with Collection#size firstly. If we want keep this validation, maybe the caller need input the size of elements? like this:public void addAll(Iterable<T> elements, int size)?
There was a problem hiding this comment.
I wonder how important that validation is. It's an exceptional case (the caller really shouldn't be overflowing their PQ capacity). We'll hit the same AIOOB exception anyway in the for-each loop that's populating the array, so the outcome is really the same if you overflow the capacity, so I'm not sure that early check adds a ton of value. It's nice to do up-front, but I don't think I'd design the API around it. So I guess I don't think it's a problem to remove this up-front sizing check, but maybe you're considering something I'm not? Do you think it's critical? Thanks again for iterating!
There was a problem hiding this comment.
We'll hit the same AIOOB exception anyway in the for-each loop that's populating the array
Indeed. That is what PriorityQueue#add do (when caller iterate their elements to call ). So i removed the validation and implmented public void addAll(Iterable<T> elements).
|
Hi, What I generally do when I want a filtered Iterator: get a stream from the collection/iterable (for iterables there's a trick to get it, but I'd prefer a collection instead of Iterables). The use all stream filtering or mapping and finally wrap the thing into an iterator. |
|
So, -1 to add all this collection code that already has native support in jdk. |
|
Hi @uschindler , you prefer approach like this?
|
|
Yes, or instead of foreach call iterator() |
I am trying to understand this too :) |
|
Hi, |
|
I think there is a difference between add and addAll, with add the caller can use the element to compare to PQ’s top and maybe pop the top and reAdd the element after got an AIOOBE(maybe the normal usage is judge wether the queue is full firstly, and call |
|
That's not working with streams. If it works it is an implementation detail and could break with different types of streams. Then one could redo the add with a new stream using I figured out that Uwe |
|
Quote from stream Java docs:
|
I moved the change entity to 10.5.0, and removed the catch block. |
| // Heap with size S always takes first S elements of the array, | ||
| // and thus it's safe to fill array further - no actual non-sentinel value will be overwritten. | ||
| try { | ||
| elements.forEach( |
There was a problem hiding this comment.
Nitpick: Better to use forEachOrdered(). This makes practically no difference unless stream is parallel, but we should be correct here. Especially when we have the skip(long) mentioned in the javadcos.
|
This looks mostly good to me, thanks! I left an earlier comment regarding consistency differences between this new addAll method and the existing addAll(Collection) method that I'd like to resolve (let's either, (a) decide the inconsistency is OK, (b) change the current addAll(Collection) behavior, or (c) open a follow-up issue to address it). The inconsistent behavior relates to how we handle an "overflow" case where more elements are provided than PQ capacity:
I'd propose: Modify the current addAll(Collection) method to be consistent with the newly added addAll(Stream) method (this is the only option for consistency). (I think it's OK to do this in a follow-up issue if we want to separate the changes). I recognize this is fairly nitpicky, but I think it's helpful to maintain as much consistency as we can here. WDYT? |
| int left = i << 1; | ||
| int right = left + 1; | ||
| if (right < heapArray.length) { | ||
| assert (Integer) heapArray[i] <= (Integer) heapArray[right]; |
There was a problem hiding this comment.
nit: This test will silently not work if asserts are disabled in the jvm. Let's use the idiomatic juint assert methods instead. I'd have this assertHeap method return a boolean then use assertTrue from the calling method.
| () -> pq.addAll(elements.stream().map(String::hashCode))); | ||
| // Partly added. | ||
| assertEquals(10, pq.size()); | ||
| assertHeap(pq); |
There was a problem hiding this comment.
This approach works, but you could also pop elements off the heap in a loop and assert the ordered nature of what comes out. It's nice that you avoid mutating the heap here with your check, but since it's the last thing we're doing in a unit test, it might be simpler to just pop stuff off the heap and check rather than writing custom heap traversal logic? I don't have strong opinions I suppose. Take-it-or-leave it :)
There was a problem hiding this comment.
but you could also pop elements off the heap in a loop and assert the ordered nature of what comes out. It's nice that you avoid mutating the heap here with your check
Yes, pop and assert the order came to my mind firstly. And in order to avoid mutate the heap, I picked this apporach.
but since it's the last thing we're doing in a unit test, it might be simpler to just pop stuff off the heap and check rather than writing custom heap traversal logic?
But, you are right.
There was a problem hiding this comment.
Yeah, totally fine by me either way :) Thanks!
uschindler
left a comment
There was a problem hiding this comment.
Making the two methods consistent if great! Thanks.
|
Thanks for making the two addAll methods consistent (and for the other minor revisions)! I'll go ahead and merge this since I don't see any other outstanding feedback. Thanks again! |
|
Oh and @vsop-479, just to respond to your earlier comment ("I will move it to 10.5. To be honest, I am not sure where it should be."), I would generally release a change in the next minor version unless it has to wait for the next major version because of something like a backwards compatibility issue. There more here if you're interested: https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility |
Thanks, it is helpful to me. And thanks everyone for the review! |
|
I think there is a bit of backwards compatibility issue because the already existing Collection based version now behaves a bit different on exception. But I don't know if this is really an issue (does anybody adds element with a collection to an already full PQ)? |
|
I had a similar issue with changing an Exception type for the caller checks #15877, but I decided then: It's not an issue, because nobody is allowed the method from externally so the thrown exception won't matter, because there is no way to recover from that issue. Here it is similar: If you hit an exception the difference is now that heap is sane afterwards, before it was not sane. So it is better afterwards. Anybody who tried to cleanup from it in the past would have made the heap sane, but doing that twice is not an issue. |
Oh, I missed this.
+1. |
|
@uschindler do you know what our policy generally is on |
In DisjunctionMaxBulkScorer's constructor all values to compare is 0, so we can skip the shift.