Terminate automaton after matched the whole prefix for PrefixQuery. #13072

vsop-479 · 2024-02-05T07:55:18Z

For PrefixQuery, we can terminate the automaton on current term if we have matched the whole prefix, and match this term directly.
Furthermore, if there is a subBlock, we could match all its' sub terms.

Modify comment. Modify comment.

github-actions · 2024-02-20T00:16:44Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

vsop-479 · 2024-02-21T09:57:09Z

@jpountz Please take a look when you get a chance.

jpountz · 2024-02-23T16:15:16Z

IntersectsTermsEnum is a bit scary to me, maybe @mikemccand can take a look, I expect him to be more familiar with it.

mikemccand · 2024-02-27T15:58:22Z

I will have a look -- thanks for the ping @jpountz.

mikemccand

This is a clever optimization! You recognize that this Automaton will match all possible suffixes in this state, and so more efficiently enumerate all terms from block tree under that state.

I have concerns about storing this in Automaton itself, and the naming was confusing to me :) Could we instead store it in RunAutomaton? Or, possibly, do it on the fly in IntersectEnum by detecting a state that is both accept and has a .* transition back onto itself?

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java

lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java

vsop-479 · 2024-02-28T02:32:27Z

@mikemccand Thanks for your suggestion, I will try to implement it.

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I am working on this.

@jpountz Thanks for your reply.

rmuir · 2024-02-28T02:42:41Z

I think the optimization may be similar to the one done in AutomatonTermsEnum?
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java#L149-L153

When "ping-ponging" the term dictionary against the automaton, it tracks visited bitset and looks for such loops in the automaton. when it finds one, it temporarily acts like a TermRangeQuery.

I think, it works a bit more general than just prefixquery and also helps with regex and wildcard queries too.

vsop-479 · 2024-02-28T05:53:39Z

I think the optimization may be similar to the one done in AutomatonTermsEnum?

Thanks for reminding that, I will dig into AutomatonTermsEnum's optimization.

vsop-479 · 2024-03-01T09:46:16Z

There is still problem in the state of match all suffix of IntersectTermsEnumFrame. I am trying to figure it out.

On the other hand, I will dig into AutomatonTermsEnum's optimization.

vsop-479 · 2024-03-02T15:56:02Z

@mikemccand
I renamed the field used to indicate whether an accept state can match all suffixes, and detected it in RunAutomaton.
Please take a look when you get a chance.

vsop-479 · 2024-03-06T06:45:32Z

I think the optimization may be similar to the one done in AutomatonTermsEnum?

It seems the optimization of AutomatonTermsEnum is to improve iterating term mode, from seeking bytes from DFA and seek termsEnum(seekCeil), to simply sequential reads the termsEnum, after finding a loop(setLinear).
In both mode, AutomatonTermsEnum needs to check if the term is accepted by running automaton.

If I am not mistaken, this optimization is different from AutomatonTermsEnum's. It directly matches all reminding suffixes and sub blocks, after detecting an accept state with a .* transition back onto itself.

Or maybe you mean we can improve AutomatonTermsEnum's optimization, to implement this optimization's effect? @rmuir

mikemccand

Thanks @vsop-479 -- this looks closer. I like that the opto is now contained under RunAutomaton, but I'm confused/concerned about sometimes checking for 255 max label and other times 127 depending on which query.

lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnum.java

mikemccand · 2024-03-06T11:02:46Z

lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java

    transitions = new int[size * points.length];
    Arrays.fill(transitions, -1);
    Transition transition = new Transition();
    for (int n = 0; n < size; n++) {
      if (a.isAccept(n)) {
        accept.set(n);
+        if (canMatchAllSuffix(n)) {


Maybe rename to isMatchAllSuffix?

Maybe rename to isMatchAllSuffix?

I don't like that name too. But, there is a method named isMatchAllSuffix, which indicates whether a state can accept all remaining suffixes (similar to isAccept).
Maybe we can rename to another?

public final boolean isMatchAllSuffix(int state) { return matchAllSuffix.get(state); }

Maybe rename to isMatchAllSuffix?

Since there already is a method named isMatchAllSuffix, I renamed it to detectMatchAllSuffix.

lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java

mikemccand · 2024-03-06T11:05:43Z

lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java

+    assert automaton.isAccept(state);
+    int numTransitions = automaton.getNumTransitions(state);
+    // Apply to PrefixQuery, TermRangeQuery.
+    if (numTransitions == 1) {


Can we remove this special case? Just let the for loop below handle the 1-transition case too?

Edit: hmm, I see, it is subtly different: this is checking for max label 255 but the loop below is checking 127, hmmm. This is a bit messy -- this low level of code shouldn't be specializing to different automata that come from the high level queries. Can we use alphabetSize-1 as the transition.max check instead? But, separately, we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions? That is not even correct for matching UTF-8 encoded terms.

Perhaps we could also add tests cases for custom Automata passed to AutomatonQuery matching sometimes binary (non-UTF8) terms?

we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions?

These queries' (including AutomatonQuery)Automaton like this: 3 -> 3: [0, 127]; 3 -> 4: [194, 194]; 4 -> 3: [128, 191]. assume 3 is an accept state.
It is more complex to detect whether a state can accept all remaining suffixes for these queries, because its accept states are split into many transitions like: [0, 127], [194, 223], [224, 239], [240, 243], [244], etc.

I am still working on this, any suggestion is welcome @mikemccand.

Perhaps we could also add tests cases for custom Automata passed to AutomatonQuery matching sometimes binary (non-UTF8) terms?

Added.

These queries' (including AutomatonQuery)Automaton like this: 3 -> 3: [0, 127]; 3 -> 4: [194, 194]; 4 -> 3: [128, 191]. assume 3 is an accept state.

@mikemccand
I can track an accept state's other transitions, to check whether these transitions can finally ended on an accept (typically transited by [128, 191]). But i am not sure whether it is enough to judge an state can match all suffix, even not sure whether it is necessary, since maybe it is equivalent to just check the [0, 127] transition's dest is an accept state.

we need to figure out why Regexp/WildcardQuery are compiling down to 127 as their max on .* suffix transitions?

I think we split the transition([0, 1114111]) with utf8 edges in UTF32ToUTF8.convertOneEdge.

@mikemccand
I think I can detect a match all suffix state for Regexp/WildcardQuery, in UTF32ToUTF8.convert after convertOneEdge like this:

// Writes new transitions into pendingTransitions: convertOneEdge(utf8State, destUTF8, scratch.min, scratch.max); // Set match all suffix state. if(scratch.min == 0 && scratch.max == 1114111 && utf8.isAccept(utf8State) && utf8.isAccept(destUTF8)){ utf8.setMatchAllSuffix(utf8State, true); }

Which is simple and reliable, but will violate the rule below:

Everything else about Automaton today is fundamental (states, transitions, isAccept) and necessary, but this new member is more a best effort optimization?

Other plan: Checking whether a candidate state can finally ended on an accept by [128, 191], which is added in UTF32ToUTF8.all:

utf8.addTransition(lastN, end, 128, 191); // type = all*

lucene/core/src/java/org/apache/lucene/util/automaton/RunAutomaton.java

vsop-479 · 2024-03-07T10:01:46Z

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I measured it with current implementation with wikimedium1m:

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3      879.20      (9.1%)      995.37      (4.2%)   13.2% (   0% -   29%) 0.062
Prefix3      924.98      (9.9%)     1042.17      (7.9%)   12.7% (  -4% -   33%) 0.083

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3     1480.94      (8.8%)     1559.89      (6.4%)    5.3% (  -9% -   22%) 0.195
Prefix3     1242.80      (6.9%)     1307.30      (5.3%)    5.2% (  -6% -   18%) 0.299
Prefix3      177.54      (1.3%)      202.74      (6.3%)   14.2% (   6% -   22%) 0.000

github-actions · 2024-04-13T00:15:48Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

github-actions · 2024-04-30T00:17:32Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

Terminate after matched the whole prefix for PrefixQuery.

529399a

Modify comment. Modify comment.

github-actions bot added the Stale label Feb 20, 2024

Use terminable flag instead of value check.

a5f5e43

github-actions bot removed the Stale label Feb 22, 2024

Match sub block's entry directly.

181529e

Set minTermBlockSize to 2, maxTermBlockSize to 3, to generate subBlock

d6e69ca

mikemccand reviewed Feb 27, 2024

View reviewed changes

Detect accept state can match all suffix in RunAutomaton.

e997599

Reset frame's matchAllSuffix state.

b5e977e

mikemccand reviewed Mar 6, 2024

View reviewed changes

Add AutomatonQuery test case and improve code.

f447f8a

Merge branch 'main' into optimize_prefix_query

46036f6

github-actions bot added Stale and removed Stale labels Apr 13, 2024

github-actions bot added the Stale label Apr 30, 2024

vsop-479 added 2 commits May 8, 2024 14:13

Merge branch 'main' into optimize_prefix_query

9b40381

Set isBinary to true.

2579c57

vsop-479 added 2 commits May 8, 2024 17:07

Fix comment.

a1c587c

Add matchAllSuffix to equals and ramBytesUsed.

0199f4a

github-actions bot removed the Stale label May 9, 2024

Rename canMatchAllSuffix to detectMatchAllSuffix.

a5f30bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate automaton after matched the whole prefix for PrefixQuery. #13072

Terminate automaton after matched the whole prefix for PrefixQuery. #13072

vsop-479 commented Feb 5, 2024

github-actions bot commented Feb 20, 2024

vsop-479 commented Feb 21, 2024

jpountz commented Feb 23, 2024

mikemccand commented Feb 27, 2024

mikemccand left a comment

vsop-479 commented Feb 28, 2024 •

edited

rmuir commented Feb 28, 2024

vsop-479 commented Feb 28, 2024

vsop-479 commented Mar 1, 2024 •

edited

vsop-479 commented Mar 2, 2024 •

edited

vsop-479 commented Mar 6, 2024

mikemccand left a comment

mikemccand Mar 6, 2024

vsop-479 Mar 7, 2024 •

edited

vsop-479 May 9, 2024 •

edited

mikemccand Mar 6, 2024

vsop-479 Mar 7, 2024 •

edited

vsop-479 Mar 29, 2024 •

edited

vsop-479 Apr 15, 2024

vsop-479 Apr 15, 2024 •

edited

vsop-479 commented Mar 7, 2024 •

edited

github-actions bot commented Apr 13, 2024

github-actions bot commented Apr 30, 2024

Terminate automaton after matched the whole prefix for PrefixQuery. #13072

Are you sure you want to change the base?

Terminate automaton after matched the whole prefix for PrefixQuery. #13072

Conversation

vsop-479 commented Feb 5, 2024

github-actions bot commented Feb 20, 2024

vsop-479 commented Feb 21, 2024

jpountz commented Feb 23, 2024

mikemccand commented Feb 27, 2024

mikemccand left a comment

Choose a reason for hiding this comment

vsop-479 commented Feb 28, 2024 • edited

rmuir commented Feb 28, 2024

vsop-479 commented Feb 28, 2024

vsop-479 commented Mar 1, 2024 • edited

vsop-479 commented Mar 2, 2024 • edited

vsop-479 commented Mar 6, 2024

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand Mar 6, 2024

Choose a reason for hiding this comment

vsop-479 Mar 7, 2024 • edited

Choose a reason for hiding this comment

vsop-479 May 9, 2024 • edited

Choose a reason for hiding this comment

mikemccand Mar 6, 2024

Choose a reason for hiding this comment

vsop-479 Mar 7, 2024 • edited

Choose a reason for hiding this comment

vsop-479 Mar 29, 2024 • edited

Choose a reason for hiding this comment

vsop-479 Apr 15, 2024

Choose a reason for hiding this comment

vsop-479 Apr 15, 2024 • edited

Choose a reason for hiding this comment

vsop-479 commented Mar 7, 2024 • edited

github-actions bot commented Apr 13, 2024

github-actions bot commented Apr 30, 2024

vsop-479 commented Feb 28, 2024 •

edited

vsop-479 commented Mar 1, 2024 •

edited

vsop-479 commented Mar 2, 2024 •

edited

vsop-479 Mar 7, 2024 •

edited

vsop-479 May 9, 2024 •

edited

vsop-479 Mar 7, 2024 •

edited

vsop-479 Mar 29, 2024 •

edited

vsop-479 Apr 15, 2024 •

edited

vsop-479 commented Mar 7, 2024 •

edited