remove implicit determinization from RegexpQuery#15939
Conversation
for RegexpQuery only, remove explicit determinization. If your regular expression happens to be a DFA already, then you get a DFA execution. Otherwise if it is NFA, you get NFA execution that only compiles as-needed. Some tests fail with assertion errors and may need to be debugged. There is also a TODO about certain rewritemethods being incompatible, it might be related to the failures. So this change is not yet ready, but it demonstrates the idea. If we can fix this in WildcardQuery too, we can remove the limit parameter from QueryParsers completely, which is also a big bonus.
|
Feel free to toss this PR if anyone wants to iterate elsewhere, I mainly wanted to:
If we could get this "implicit" determinization out of all places where it can go exponential, then it simplifies the problem: determinization only happens if the user themself does it. e.g. for RegexpQuery, if a user really really wants a DFA for some special purpose, they can still do it, but it is because they are calling themselves calling determinize(): var automaton = regexp.toAutomation();
var dfa = Operations.determinize(automaton);
new AutomatonQuery(dfa, ...)This simplifies the problem because then we can think about how |
|
@rmuir, before going further, I wanted to check with you, would you prefer that follow-up work happens on this branch/PR, or should I fork this and open a separate PR building on your approach? Happy to iterate either way, just want to align with how you’d like this to evolve. |
However you like. This was mainly just a way to try to demonstrate the problem, that we already have a lot of "telescoping" in query/queryparser apis for this determinization, if we were to add more, it would be pretty wild. It is unfortunate about the tests. If you make your own branch we can close this one out, just ping me and I can try to help debug some of those regexp tests. |
|
@rmuir The prototype for I also experimented with the |
@drempapis thank you so much for tackling this! I'm gonna run the tests a few times and see if we can get them mostly stable. I don't think we need to achieve perfection to move forwards and merge these changes, but I also don't want to cause a flurry of build failures if there are easy fixes. Some of these tests are pretty tough. |
|
For this one I'd like to tackle main (11.0) initially. The change is a biggish one in its current form and impacts several key apis such as query/queryparser. This isn't saying we shouldn't backport it to a 10.x, just not initially? Would also be good to bake it in the CI a bit, give these randomized tests some time to do their thing. |
You are right. I've also started prototyping the |
|
I ran 500 iterations of TestRegexpRandom2 successfully with these changes! Separately, maybe we can dig into NFARunAutomaton as a followup at some later point and see if we can improve it for concurrent execution, but let's start with correctness. I think this PR is a big enough step already. |
|
Does this PR means NFA will become default (unless the user's regexp happens to be translated to a DFA directly)? |
zhaih
left a comment
There was a problem hiding this comment.
LGTM, several comments for performance, but I think we can fix it separately.
| synchronized (dState) { | ||
| dState.determinize(); | ||
| int outgoingTransitions = -1; | ||
| t.transitionUpto = -1; | ||
| t.source = state; | ||
| while (outgoingTransitions < index && t.transitionUpto < points.length - 1) { | ||
| if (dState.transitions[++t.transitionUpto] != MISSING) { | ||
| outgoingTransitions++; | ||
| } | ||
| } | ||
| } | ||
| assert outgoingTransitions == index; | ||
| assert outgoingTransitions == index; | ||
|
|
||
| setTransitionAccordingly(t); | ||
| setTransitionAccordingly(t, dState); | ||
| } |
There was a problem hiding this comment.
This sync block is probably not needed? dState.determinize(); is sync'd and once it's fully determinized there'll be no further modification necessary to this state?
Let's add a TODO for now? Since I guess correctness is more important for this PR
| } | ||
|
|
||
| private int nextState(int charClass) { | ||
| private synchronized int nextState(int charClass) { |
There was a problem hiding this comment.
We probably are able to have a finer grained sync control in this method as well in the future.
| */ | ||
| private DState step(int c) { | ||
| statesSet.reset(); // TODO: fork IntHashSet from hppc instead? | ||
| StateSet statesSet = new StateSet(5); |
There was a problem hiding this comment.
Do we want to make this per DState as anyway all operations using this one will be locked? Per step seems a bit too much allocation?
That's correct. We fixed a lot of inefficient problems with Regexp parser and some of the automaton helpers functions though, so they make less NFAs than before. Previously you'd get an NFA for silly reasons. If the user wants to FORCE a DFA, they should call |
for
RegexpQueryonly, remove implicit determinization. If your regular expression happens to be a DFA already, then you get a DFA execution. Otherwise if it is NFA, you get NFA execution that only compiles as-needed.If a user really wants to force a DFA execution, they can call
determinize()themselves to "compile" it up-front, pass that automaton to AutomatonQuery.NFARunAutomatonwas previously thread-unsafe, and would not work with certain MultiTermQuery rewrite methods. @drempapis fixed it in this PR to be thread-safe.We can followup with similar changes to WildcardQuery, intervals, analyzers, suggesters, whereever else it makes sense.