Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton #12310

Merged
merged 2 commits into from
May 25, 2023

Conversation

mikemccand
Copy link
Member

This is just a rote rename of this helpful class. I plan 10.0 only since it is technically a public API break. I added a line to MIGRATE.txt too.

Closes #12276

@mikemccand mikemccand added this to the 10.0.0 milestone May 18, 2023
Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Apologies for the merge conflict I created for you (but thanks for the review on that PR!).

@rmuir
Copy link
Member

rmuir commented May 18, 2023

Yeah we should explore a binary version. Even if it doesn't speedup TermInSetQuery.

TermInSetQuery has/had a super trappy visitor method that builds an automaton from it's sorted terms in a very slow way for some visitor method. Pretty sure I added a comment along the lines of "we should not do this". We could at least fix that trap. Check my assumptions and memory as I am on a phone

@mikemccand
Copy link
Member Author

LGTM. Apologies for the merge conflict I created for you (but thanks for the review on that PR!).

No worries! I'll resolve and merge soon! Thanks for the review @gsmiller.

@mikemccand
Copy link
Member Author

Yeah we should explore a binary version. Even if it doesn't speedup TermInSetQuery.

+1

Pretty sure I added a comment along the lines of "we should not do this"

Oh yeah! I think you are referring to this super scary code!

  @Override
  public void visit(QueryVisitor visitor) {
    if (visitor.acceptField(field) == false) {
      return;
    }
    if (termData.size() == 1) {
      visitor.consumeTerms(this, new Term(field, termData.iterator().next()));
    }
    if (termData.size() > 1) {
      visitor.consumeTermsMatching(this, field, this::asByteRunAutomaton);
    }
  }

  // TODO: this is extremely slow. we should not be doing this.
  private ByteRunAutomaton asByteRunAutomaton() {
    TermIterator iterator = termData.iterator();
    List<Automaton> automata = new ArrayList<>();
    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
      automata.add(Automata.makeBinary(term));
    }
    Automaton automaton =
        Operations.determinize(
            Operations.union(automata), Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
    return new CompiledAutomaton(automaton).runAutomaton;
  }

We should indeed make a BytesRefsToAutomaton! I'll open a spinoff.

@mikemccand
Copy link
Member Author

Actually, the terms must be sorted in Unicode code point order, and, we do have the builder for BytesRef already: public static Automaton build(Collection<BytesRef> input) {. So I think we should just fix TermInSetQuery's scary code to use this binary builder?

@gsmiller
Copy link
Contributor

gsmiller commented May 20, 2023

@mikemccand where's the build method you're referencing? I took a pass at creating a "direct to binary" version of the Daciuk-Mihov algorithm in #12312. We could fold that into this work if we want? Or treat it as a separate follow-up task. But maybe there's already some direct binary building logic I wasn't aware of and my changes aren't needed?

Edit: Ah, never mind @mikemccand. Found the method you're referencing.

@gsmiller
Copy link
Contributor

@mikemccand the build method you reference above in DaciukMihovAutomatonBuilder build an automaton with code points as transition labels, but I think we need a "compiled" binary automaton for this visitor? We can have CompiledAutomaton do this conversion, but it's pretty wasteful to build one way then convert it when we could build directly to binary? I opened a separate PR that can do a direct binary build over in #12320. I might be completely overlooking a simpler way to do this with code we already have though, so I'm happy to close that PR out in favor of a simpler approach if one exists. Thanks!

@gsmiller
Copy link
Contributor

Separately, in terms of renaming this class, I'm still in favor of it but I also wonder if we should just consider making it pkg-private and proxying the build functionality through Automata#makeStringUnion? That's a pretty clean entry-point for users, and I'm not sure there's any value in having this class be public? I created a spin-off issue #12321

@mikemccand mikemccand merged commit 7da7c43 into apache:main May 25, 2023
5 checks passed
@mikemccand mikemccand deleted the 12276 branch May 25, 2023 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maybe rename DaciukMihovAutomatonBuilder?
3 participants