-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach) #12141
Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach) #12141
Conversation
@@ -104,6 +105,16 @@ public PrefixCodedTerms finish() { | |||
} | |||
} | |||
|
|||
public static Collector<BytesRef, Builder, PrefixCodedTerms> collector(String field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I placed this here, because I found no other common place.
this(field, Arrays.stream(terms)); | ||
} | ||
|
||
private SortedSetDocValuesSetQuery(String field, Stream<BytesRef> stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need 3 ctors or could we just have a single package private ctor taking stream. The XXXField classes could just call .stream method? I just don't know if we need so much "sugar" for classes that are entirely package private.
Unfortunately, the TermInSetQuery is a public class, so maybe there are different concerns there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats fine. I just made this query with same ctors like TermInSetQuery. But yes as it is pkg private we can make it stupid simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the code, we have several callers now, some using the array, some the stream one. I'd like to keep the stream abstraction here.
If we change to use streams in general (when we have the possibility to improve IndexOrDocavaluesQuery by lazy initializing the child queries I am happy to refactor this.
If a user has points+dv, they may encounter the same issue. So we may want to consider looking at |
Before proceeding from here we should really do a benchmark here. Maybe Greg can test this with his code at Amzn. I don't know how to benchmark this without having a huge database and join queries. |
Right, leaving the PointInSetQuery unoptimized but optimizing the slower one (TermInSetQuery) illustrates the issues with "database features". Anywhere else in the lucene search engine, you'd use the faster, more appropriate datastructure (Points in this case) But with database features user has their |
PointInSetQuery has a lovely homemade "stream" in ctor already: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java#L66-L75 |
I am still not happy with this idea at all, because next to sorting we still build the PrefixCodedTerms instance 2 times on construction. I think the original idea of @gsmiller is good and in combination with It would be good to find a way to share the PrefixCodedTerms instance between both queries (also for memory usage). The problem is that both queries (TermInSet and SortedDocValuesSetQuery) live in different packages. If both would be in same package we could make a "pkg private" ctor vor both so they can share their PrefixCodedTerms without making it public. My idea would be to have TermInSetQuery in search package be deprecated and like for the other queries now create a factory only. Maybe just have a public subclass in the serach package. I will think a bit about it. |
i want to see any proposed solution work for Points, too. Will hold these database features to the fire! |
Thanks @uschindler for the alternate approach. It helped me understand your earlier suggestion to use streams, which I wasn't totally clear on (I thought you were originally suggesting to do away with prefix-encoding altogether and reference the streams directly inside the query implementations to iterate the terms, which was confusing). I'm not setup to re-run our internal benchmarks at the moment (where we see a large amount of time spent sorting terms), but I at least ran my simple "benchmark" test case that does some simple timing over the query initialization (see below). The results for this PR were as good as my initial proposal to share prefix-encoded terms. So, from a pure performance point-of-view, this appears to be just as efficient as what I'd come up with initially. Simple test case "benchmark":
|
+1 that it would be nice to avoid the duplicate prefix-encoding (and storing that duplicate data for two queries in memory). Ideally, it would be nice if Less drastic than essentially deprecating the public definition of |
Let me work on making a clone of TermInSetQuery in the store package, also pkg-private. We can make the public one still available in the search package but keep the weight hidden. |
more abstractions, copying queries, hope you take PointInSetQuery into account here. I dont want anyone to be surprised, if i see a goddamn mess that doesnt consistently fix the issue across point+dv, too, i'm gonna veto. sorry. |
and please, please, no Instead, if you add a new X, it can use as its code implementation, PrefixCodedTerms. that way the onus is backwards, on the join stuff to implement its own interface. |
Sorry I was not planning anything like that. Just move the query and leave the "public" backwards layer available. |
If you do not agree you can veto for sure, but arguing with that before you have seen anything is not acceptable. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
We have another solution for this already. |
This is an alternative approach for #12135
Like in @gsmiller's code, the removal of "clone()" is ok here, as the query only consumes the terms array, but does not modify it or saves it anywhere.