Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach) #12141

Closed

Conversation

uschindler
Copy link
Contributor

This is an alternative approach for #12135

Like in @gsmiller's code, the removal of "clone()" is ok here, as the query only consumes the terms array, but does not modify it or saves it anywhere.

@uschindler uschindler self-assigned this Feb 10, 2023
@@ -104,6 +105,16 @@ public PrefixCodedTerms finish() {
}
}

public static Collector<BytesRef, Builder, PrefixCodedTerms> collector(String field) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I placed this here, because I found no other common place.

this(field, Arrays.stream(terms));
}

private SortedSetDocValuesSetQuery(String field, Stream<BytesRef> stream) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need 3 ctors or could we just have a single package private ctor taking stream. The XXXField classes could just call .stream method? I just don't know if we need so much "sugar" for classes that are entirely package private.

Unfortunately, the TermInSetQuery is a public class, so maybe there are different concerns there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats fine. I just made this query with same ctors like TermInSetQuery. But yes as it is pkg private we can make it stupid simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code, we have several callers now, some using the array, some the stream one. I'd like to keep the stream abstraction here.

If we change to use streams in general (when we have the possibility to improve IndexOrDocavaluesQuery by lazy initializing the child queries I am happy to refactor this.

@rmuir
Copy link
Member

rmuir commented Feb 10, 2023

If a user has points+dv, they may encounter the same issue. So we may want to consider looking at PointInSetQuery too. Maybe it should have a similar streams fix that TermInSetQuery has?

@uschindler
Copy link
Contributor Author

Before proceeding from here we should really do a benchmark here. Maybe Greg can test this with his code at Amzn. I don't know how to benchmark this without having a huge database and join queries.

@rmuir
Copy link
Member

rmuir commented Feb 10, 2023

Right, leaving the PointInSetQuery unoptimized but optimizing the slower one (TermInSetQuery) illustrates the issues with "database features". Anywhere else in the lucene search engine, you'd use the faster, more appropriate datastructure (Points in this case)

But with database features user has their varchar ID in some database somewhere they are using for joins, which will be slower, and no intention of fixing that :) Hopefully it helps to explain my frustration with such features.

@rmuir
Copy link
Member

rmuir commented Feb 10, 2023

PointInSetQuery has a lovely homemade "stream" in ctor already: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java#L66-L75

@uschindler
Copy link
Contributor Author

I am still not happy with this idea at all, because next to sorting we still build the PrefixCodedTerms instance 2 times on construction.

I think the original idea of @gsmiller is good and in combination with

It would be good to find a way to share the PrefixCodedTerms instance between both queries (also for memory usage). The problem is that both queries (TermInSet and SortedDocValuesSetQuery) live in different packages. If both would be in same package we could make a "pkg private" ctor vor both so they can share their PrefixCodedTerms without making it public.

My idea would be to have TermInSetQuery in search package be deprecated and like for the other queries now create a factory only. Maybe just have a public subclass in the serach package. I will think a bit about it.

@rmuir
Copy link
Member

rmuir commented Feb 11, 2023

i want to see any proposed solution work for Points, too.

Will hold these database features to the fire!

@gsmiller
Copy link
Contributor

Thanks @uschindler for the alternate approach. It helped me understand your earlier suggestion to use streams, which I wasn't totally clear on (I thought you were originally suggesting to do away with prefix-encoding altogether and reference the streams directly inside the query implementations to iterate the terms, which was confusing).

I'm not setup to re-run our internal benchmarks at the moment (where we see a large amount of time spent sorting terms), but I at least ran my simple "benchmark" test case that does some simple timing over the query initialization (see below). The results for this PR were as good as my initial proposal to share prefix-encoded terms. So, from a pure performance point-of-view, this appears to be just as efficient as what I'd come up with initially.

Simple test case "benchmark":

  public void testSortPerformance() {
    int len = 50000;
    BytesRef[] terms = new BytesRef[len];
    for (int i = 0; i < len; i++) {
      String s = TestUtil.randomSimpleString(random(), 10, 20);
      terms[i] = new BytesRef(s);
    }

    int iters = 300;
    for (int i = 0; i < iters; i++) {
      KeywordField.newSetQuery("foo", terms);
    }

    long minTime = Long.MAX_VALUE;
    for (int i = 0; i < iters; i++) {
      long t0 = System.nanoTime();
      KeywordField.newSetQuery("foo", terms);
      minTime = Math.min(minTime, System.nanoTime() - t0);
    }

    System.err.println("Time: " + minTime / 1_000_000);
  }

@gsmiller
Copy link
Contributor

+1 that it would be nice to avoid the duplicate prefix-encoding (and storing that duplicate data for two queries in memory).

Ideally, it would be nice if TermInSetQuery where pkg-private, just acting as an implementation detail behind factory methods like KeywordField#newSetQuery. It might be a bit too extreme, but I wonder if we should consider making this change and asking our users to get their "term in set" query functionality behind exposed factory methods.

Less drastic than essentially deprecating the public definition of TermInSetQuery could be to create a new class definition for a "set of terms" that wraps PrefixCodedTerms and is used in the ctor. This could provide an abstraction on top of the PrefixCodedTerms implementation detail, allowing us to change that in the future without changing our interface.

@uschindler
Copy link
Contributor Author

Let me work on making a clone of TermInSetQuery in the store package, also pkg-private. We can make the public one still available in the search package but keep the weight hidden.

@rmuir
Copy link
Member

rmuir commented Feb 12, 2023

more abstractions, copying queries, hope you take PointInSetQuery into account here.

I dont want anyone to be surprised, if i see a goddamn mess that doesnt consistently fix the issue across point+dv, too, i'm gonna veto. sorry.

@rmuir
Copy link
Member

rmuir commented Feb 12, 2023

and please, please, no PrefixCodedTerms extends X where X is some new abstraction used. that's wrong and it puts the onus on PrefixCodedTerms to now deliver on some interface for these crappy join queries.

Instead, if you add a new X, it can use as its code implementation, PrefixCodedTerms. that way the onus is backwards, on the join stuff to implement its own interface.

@uschindler
Copy link
Contributor Author

Sorry I was not planning anything like that. Just move the query and leave the "public" backwards layer available.

@uschindler
Copy link
Contributor Author

If you do not agree you can veto for sure, but arguing with that before you have seen anything is not acceptable.

Copy link

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@uschindler
Copy link
Contributor Author

We have another solution for this already.

@uschindler uschindler closed this Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants