Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach) #12141

uschindler · 2023-02-10T13:44:49Z

This is an alternative approach for #12135

Like in @gsmiller's code, the removal of "clone()" is ok here, as the query only consumes the terms array, but does not modify it or saves it anywhere.

uschindler · 2023-02-10T13:45:42Z

lucene/core/src/java/org/apache/lucene/index/PrefixCodedTerms.java

@@ -104,6 +105,16 @@ public PrefixCodedTerms finish() {
    }
  }

+  public static Collector<BytesRef, Builder, PrefixCodedTerms> collector(String field) {


sorry I placed this here, because I found no other common place.

….Collectors)

rmuir · 2023-02-10T13:57:26Z

lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesSetQuery.java

+    this(field, Arrays.stream(terms));
+  }
+
+  private SortedSetDocValuesSetQuery(String field, Stream<BytesRef> stream) {


do we really need 3 ctors or could we just have a single package private ctor taking stream. The XXXField classes could just call .stream method? I just don't know if we need so much "sugar" for classes that are entirely package private.

Unfortunately, the TermInSetQuery is a public class, so maybe there are different concerns there.

thats fine. I just made this query with same ctors like TermInSetQuery. But yes as it is pkg private we can make it stupid simple.

I checked the code, we have several callers now, some using the array, some the stream one. I'd like to keep the stream abstraction here.

If we change to use streams in general (when we have the possibility to improve IndexOrDocavaluesQuery by lazy initializing the child queries I am happy to refactor this.

rmuir · 2023-02-10T14:01:56Z

If a user has points+dv, they may encounter the same issue. So we may want to consider looking at PointInSetQuery too. Maybe it should have a similar streams fix that TermInSetQuery has?

uschindler · 2023-02-10T14:07:21Z

Before proceeding from here we should really do a benchmark here. Maybe Greg can test this with his code at Amzn. I don't know how to benchmark this without having a huge database and join queries.

rmuir · 2023-02-10T14:22:24Z

Right, leaving the PointInSetQuery unoptimized but optimizing the slower one (TermInSetQuery) illustrates the issues with "database features". Anywhere else in the lucene search engine, you'd use the faster, more appropriate datastructure (Points in this case)

But with database features user has their varchar ID in some database somewhere they are using for joins, which will be slower, and no intention of fixing that :) Hopefully it helps to explain my frustration with such features.

rmuir · 2023-02-10T14:48:04Z

PointInSetQuery has a lovely homemade "stream" in ctor already: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java#L66-L75

uschindler · 2023-02-11T08:58:18Z

I am still not happy with this idea at all, because next to sorting we still build the PrefixCodedTerms instance 2 times on construction.

I think the original idea of @gsmiller is good and in combination with

It would be good to find a way to share the PrefixCodedTerms instance between both queries (also for memory usage). The problem is that both queries (TermInSet and SortedDocValuesSetQuery) live in different packages. If both would be in same package we could make a "pkg private" ctor vor both so they can share their PrefixCodedTerms without making it public.

My idea would be to have TermInSetQuery in search package be deprecated and like for the other queries now create a factory only. Maybe just have a public subclass in the serach package. I will think a bit about it.

rmuir · 2023-02-11T11:25:40Z

i want to see any proposed solution work for Points, too.

Will hold these database features to the fire!

gsmiller · 2023-02-11T14:44:31Z

Thanks @uschindler for the alternate approach. It helped me understand your earlier suggestion to use streams, which I wasn't totally clear on (I thought you were originally suggesting to do away with prefix-encoding altogether and reference the streams directly inside the query implementations to iterate the terms, which was confusing).

I'm not setup to re-run our internal benchmarks at the moment (where we see a large amount of time spent sorting terms), but I at least ran my simple "benchmark" test case that does some simple timing over the query initialization (see below). The results for this PR were as good as my initial proposal to share prefix-encoded terms. So, from a pure performance point-of-view, this appears to be just as efficient as what I'd come up with initially.

Simple test case "benchmark":

  public void testSortPerformance() {
    int len = 50000;
    BytesRef[] terms = new BytesRef[len];
    for (int i = 0; i < len; i++) {
      String s = TestUtil.randomSimpleString(random(), 10, 20);
      terms[i] = new BytesRef(s);
    }

    int iters = 300;
    for (int i = 0; i < iters; i++) {
      KeywordField.newSetQuery("foo", terms);
    }

    long minTime = Long.MAX_VALUE;
    for (int i = 0; i < iters; i++) {
      long t0 = System.nanoTime();
      KeywordField.newSetQuery("foo", terms);
      minTime = Math.min(minTime, System.nanoTime() - t0);
    }

    System.err.println("Time: " + minTime / 1_000_000);
  }

gsmiller · 2023-02-11T14:52:14Z

+1 that it would be nice to avoid the duplicate prefix-encoding (and storing that duplicate data for two queries in memory).

Ideally, it would be nice if TermInSetQuery where pkg-private, just acting as an implementation detail behind factory methods like KeywordField#newSetQuery. It might be a bit too extreme, but I wonder if we should consider making this change and asking our users to get their "term in set" query functionality behind exposed factory methods.

Less drastic than essentially deprecating the public definition of TermInSetQuery could be to create a new class definition for a "set of terms" that wraps PrefixCodedTerms and is used in the ctor. This could provide an abstraction on top of the PrefixCodedTerms implementation detail, allowing us to change that in the future without changing our interface.

uschindler · 2023-02-12T11:38:21Z

Let me work on making a clone of TermInSetQuery in the store package, also pkg-private. We can make the public one still available in the search package but keep the weight hidden.

rmuir · 2023-02-12T12:46:14Z

more abstractions, copying queries, hope you take PointInSetQuery into account here.

I dont want anyone to be surprised, if i see a goddamn mess that doesnt consistently fix the issue across point+dv, too, i'm gonna veto. sorry.

rmuir · 2023-02-12T12:51:39Z

and please, please, no PrefixCodedTerms extends X where X is some new abstraction used. that's wrong and it puts the onus on PrefixCodedTerms to now deliver on some interface for these crappy join queries.

Instead, if you add a new X, it can use as its code implementation, PrefixCodedTerms. that way the onus is backwards, on the join stuff to implement its own interface.

uschindler · 2023-02-12T13:32:31Z

Sorry I was not planning anything like that. Just move the query and leave the "public" backwards layer available.

uschindler · 2023-02-12T13:33:04Z

If you do not agree you can veto for sure, but arguing with that before you have seen anything is not acceptable.

github-actions · 2024-01-08T12:24:37Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

uschindler · 2024-01-08T12:57:07Z

We have another solution for this already.

This is an alternative approach for apache#12135

58ba54b

uschindler requested review from rmuir and gsmiller February 10, 2023 13:44

uschindler self-assigned this Feb 10, 2023

uschindler commented Feb 10, 2023

View reviewed changes

uschindler mentioned this pull request Feb 10, 2023

Avoid duplicate sorting in KeywordField#newSetQuery #12135

Closed

uschindler added 2 commits February 10, 2023 14:52

do not expose internal implementation class (like in java.util.stream…

9a1f3c4

….Collectors)

Add javadocs

94a3072

rmuir reviewed Feb 10, 2023

View reviewed changes

gsmiller mentioned this pull request Feb 22, 2023

Clone the BytesRef[] values in KeywordField#newSetQuery #12158

Merged

github-actions bot added the Stale label Jan 8, 2024

uschindler closed this Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach) #12141

Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach) #12141

uschindler commented Feb 10, 2023

uschindler Feb 10, 2023

rmuir Feb 10, 2023

uschindler Feb 10, 2023

uschindler Feb 10, 2023

rmuir commented Feb 10, 2023

uschindler commented Feb 10, 2023

rmuir commented Feb 10, 2023 •

edited

rmuir commented Feb 10, 2023

uschindler commented Feb 11, 2023

rmuir commented Feb 11, 2023

gsmiller commented Feb 11, 2023

gsmiller commented Feb 11, 2023

uschindler commented Feb 12, 2023

rmuir commented Feb 12, 2023

rmuir commented Feb 12, 2023

uschindler commented Feb 12, 2023

uschindler commented Feb 12, 2023

github-actions bot commented Jan 8, 2024

uschindler commented Jan 8, 2024

Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach) #12141

Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach) #12141

Conversation

uschindler commented Feb 10, 2023

uschindler Feb 10, 2023

Choose a reason for hiding this comment

rmuir Feb 10, 2023

Choose a reason for hiding this comment

uschindler Feb 10, 2023

Choose a reason for hiding this comment

uschindler Feb 10, 2023

Choose a reason for hiding this comment

rmuir commented Feb 10, 2023

uschindler commented Feb 10, 2023

rmuir commented Feb 10, 2023 • edited

rmuir commented Feb 10, 2023

uschindler commented Feb 11, 2023

rmuir commented Feb 11, 2023

gsmiller commented Feb 11, 2023

gsmiller commented Feb 11, 2023

uschindler commented Feb 12, 2023

rmuir commented Feb 12, 2023

rmuir commented Feb 12, 2023

uschindler commented Feb 12, 2023

uschindler commented Feb 12, 2023

github-actions bot commented Jan 8, 2024

uschindler commented Jan 8, 2024

rmuir commented Feb 10, 2023 •

edited