Skip to content

remove implicit determinization from RegexpQuery#15939

Merged
rmuir merged 5 commits into
apache:mainfrom
rmuir:regexp-query-remove-det
Apr 12, 2026
Merged

remove implicit determinization from RegexpQuery#15939
rmuir merged 5 commits into
apache:mainfrom
rmuir:regexp-query-remove-det

Conversation

@rmuir
Copy link
Copy Markdown
Member

@rmuir rmuir commented Apr 7, 2026

for RegexpQuery only, remove implicit determinization. If your regular expression happens to be a DFA already, then you get a DFA execution. Otherwise if it is NFA, you get NFA execution that only compiles as-needed.

If a user really wants to force a DFA execution, they can call determinize() themselves to "compile" it up-front, pass that automaton to AutomatonQuery.

NFARunAutomaton was previously thread-unsafe, and would not work with certain MultiTermQuery rewrite methods. @drempapis fixed it in this PR to be thread-safe.

We can followup with similar changes to WildcardQuery, intervals, analyzers, suggesters, whereever else it makes sense.

for RegexpQuery only, remove explicit determinization. If your regular
expression happens to be a DFA already, then you get a DFA execution.
Otherwise if it is NFA, you get NFA execution that only compiles
as-needed.

Some tests fail with assertion errors and may need to be debugged. There
is also a TODO about certain rewritemethods being incompatible, it might
be related to the failures.

So this change is not yet ready, but it demonstrates the idea.

If we can fix this in WildcardQuery too, we can remove the limit
parameter from QueryParsers completely, which is also a big bonus.
@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Apr 8, 2026

Feel free to toss this PR if anyone wants to iterate elsewhere, I mainly wanted to:

  • prototype what it would look like to remove these parameters from queries/parsers api
  • see how the tests behaved when I took away the DFA and try to understand what else needs to be done

If we could get this "implicit" determinization out of all places where it can go exponential, then it simplifies the problem: determinization only happens if the user themself does it.

e.g. for RegexpQuery, if a user really really wants a DFA for some special purpose, they can still do it, but it is because they are calling themselves calling determinize():

var automaton = regexp.toAutomation();
var dfa = Operations.determinize(automaton);
new AutomatonQuery(dfa, ...)

This simplifies the problem because then we can think about how determinize() should look without concerning ourselves with how it will "cascade" across a ton of other APIs.

@drempapis
Copy link
Copy Markdown
Contributor

@rmuir, before going further, I wanted to check with you, would you prefer that follow-up work happens on this branch/PR, or should I fork this and open a separate PR building on your approach? Happy to iterate either way, just want to align with how you’d like this to evolve.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Apr 8, 2026

@rmuir, before going further, I wanted to check with you, would you prefer that follow-up work happens on this branch/PR, or should I fork this and open a separate PR building on your approach? Happy to iterate either way, just want to align with how you’d like this to evolve.

However you like. This was mainly just a way to try to demonstrate the problem, that we already have a lot of "telescoping" in query/queryparser apis for this determinization, if we were to add more, it would be pretty wild.

It is unfortunate about the tests. If you make your own branch we can close this one out, just ping me and I can try to help debug some of those regexp tests.

@github-actions github-actions Bot added this to the 10.5.0 milestone Apr 9, 2026
@drempapis
Copy link
Copy Markdown
Contributor

@rmuir The prototype for RegexpQuery LGTM. I've made only one update in NFARunAutomaton. The previous implementation had shared mutable state without proper synchronization, so concurrent access could corrupt internal state and lead to random failures. This update makes (or is intended to make) NFARunAutomaton safe for concurrent use.

I also experimented with the WildcardQuery, but it caused test failures, so I decided to handle it in a separate PR.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Apr 9, 2026

@rmuir The prototype for RegexpQuery LGTM. I've made only one update in NFARunAutomaton. The previous implementation had shared mutable state without proper synchronization, so concurrent access could corrupt internal state and lead to random failures. This update makes (or is intended to make) NFARunAutomaton safe for concurrent use.

@drempapis thank you so much for tackling this! I'm gonna run the tests a few times and see if we can get them mostly stable.

I don't think we need to achieve perfection to move forwards and merge these changes, but I also don't want to cause a flurry of build failures if there are easy fixes. Some of these tests are pretty tough.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Apr 9, 2026

For this one I'd like to tackle main (11.0) initially. The change is a biggish one in its current form and impacts several key apis such as query/queryparser.

This isn't saying we shouldn't backport it to a 10.x, just not initially? Would also be good to bake it in the CI a bit, give these randomized tests some time to do their thing.

@rmuir rmuir marked this pull request as ready for review April 9, 2026 14:01
@github-actions github-actions Bot modified the milestones: 10.5.0, 11.0.0 Apr 9, 2026
@drempapis
Copy link
Copy Markdown
Contributor

For this one I'd like to tackle main (11.0) initially. The change is a biggish one in its current form and impacts several key apis such as query/queryparser.

You are right. I've also started prototyping the WildcardQuery and will share a draft PR shortly.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Apr 9, 2026

I ran 500 iterations of TestRegexpRandom2 successfully with these changes!

Separately, maybe we can dig into NFARunAutomaton as a followup at some later point and see if we can improve it for concurrent execution, but let's start with correctness. I think this PR is a big enough step already.

@rmuir rmuir changed the title remove explicit determinization from RegexpQuery remove implicit determinization from RegexpQuery Apr 9, 2026
@rmuir rmuir requested review from dweiss, mikemccand and zhaih April 9, 2026 14:45
@zhaih
Copy link
Copy Markdown
Contributor

zhaih commented Apr 9, 2026

Does this PR means NFA will become default (unless the user's regexp happens to be translated to a DFA directly)?

Copy link
Copy Markdown
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, several comments for performance, but I think we can fix it separately.

Comment on lines +239 to +252
synchronized (dState) {
dState.determinize();
int outgoingTransitions = -1;
t.transitionUpto = -1;
t.source = state;
while (outgoingTransitions < index && t.transitionUpto < points.length - 1) {
if (dState.transitions[++t.transitionUpto] != MISSING) {
outgoingTransitions++;
}
}
}
assert outgoingTransitions == index;
assert outgoingTransitions == index;

setTransitionAccordingly(t);
setTransitionAccordingly(t, dState);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sync block is probably not needed? dState.determinize(); is sync'd and once it's fully determinized there'll be no further modification necessary to this state?
Let's add a TODO for now? Since I guess correctness is more important for this PR

}

private int nextState(int charClass) {
private synchronized int nextState(int charClass) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably are able to have a finer grained sync control in this method as well in the future.

*/
private DState step(int c) {
statesSet.reset(); // TODO: fork IntHashSet from hppc instead?
StateSet statesSet = new StateSet(5);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make this per DState as anyway all operations using this one will be locked? Per step seems a bit too much allocation?

Comment thread lucene/CHANGES.txt Outdated
@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Apr 9, 2026

Does this PR means NFA will become default (unless the user's regexp happens to be translated to a DFA directly)?

That's correct. We fixed a lot of inefficient problems with Regexp parser and some of the automaton helpers functions though, so they make less NFAs than before. Previously you'd get an NFA for silly reasons.

If the user wants to FORCE a DFA, they should call determinize() themselves? I'm sure some use-cases might get slower, but other ones might get faster too. RegexpQuery is kinda like using a non-compiled "pattern", it is convenience for a one-off. If you have a single pattern that is going to be heavily reused, you can use AutomatonQuery.

Copy link
Copy Markdown
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread lucene/CHANGES.txt Outdated
Copy link
Copy Markdown
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rmuir rmuir merged commit 650684e into apache:main Apr 12, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants