Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NonBacktracking locking fixes and cleanup #71234

Merged
merged 3 commits into from
Jul 8, 2022

Conversation

olsaarik
Copy link
Contributor

@olsaarik olsaarik commented Jun 23, 2022

This PR addresses #70753. SymbolicRegexBuilder has mutable state that has to be protected by locks when used concurrently from SymbolicRegexMatcher. The builder has some locking, but a bug had crept in. This PR introduces the following to make the code more clearly correct:

  • Move matching-specific state such as arrays of DFA/NFA states and transitions from the builder to the matcher. This allows removing conditional initialization of these in uses of the builder outside the matcher. The state left in the builder is now caches for SymbolicRegexNode instances and memoization caches for functions in SymbolicRegexNode.
  • Remove the _builder variable from SymbolicRegexNode to avoid accidental concurrent use through concurrent calls to functions in the nodes. The builder now has to be passed into all functions that need it, which makes it obvious which functions are thread safe.
  • Move all synchronization from the builder to the matcher. Locking is used now to protect all concurrent usage of _builder in the matcher as well as state caches and transition arrays (as necessary: transition arrays are still read without locking or Volatile.Read for performance, see Document the memory model guaranteed by dotnet/runtime runtimes #63474 and Unnecessary LdelemaRef call not elided with Volatile.Read #65789).
  • Create a ConcurrentArrayResize function to replace Array.Resize usage with the difference that the array is copied to a larger one first and then published with Volatile.Write. This addresses a concern that other threads might see an inconsistent copy when reading from the arrays being resized without holding a lock. I think the code was correct without this as 1. all of these arrays were ones with atomic reads/writes and 2. any read that saw a zero value would acquire the lock and re-read the array after. However, the code could easily break if any of the arrays change, e.g., to a struct element type and generally it seems better to not cause other threads to spuriously have to acquire a lock.

Then the actual issue in #70753 is addressed by asserting that the lock is held in GetOrCreateState and calling a GetOrCreateStateUnsafe variant in the constructor, where locking is not required yet.

In addition to these changes, the following cleanup is done (mostly because I just happened to be looking at the code again):

  • Simplify logic related to figuring out character kinds. There is now much less duplication of logic for 1. mapping characters/minterms to character kinds, 2. handling an \n at the very end of input for the \Z anchor, and 3. giving the BeginningEnd kind to indices outside the input span. All of the logic is in the main SymbolicRegexMatcher code file. The logic for mapping minterms back to character kinds in DfaMatchingState is now gone.
  • Make state and transition arrays grow by factor of 2 instead of a fixed 1024 elements (this had a todo on it). This mirrors what List<T> does. The reason we're not using List<T> itself is concerns with concurrent usage and ability to resize cheaply.

One potentially performance-impactful change (among a few others) is that the "ascii characters to character kinds" array is now gone, in favor of a "minterm IDs to character kinds" array. I'll be evaluating/debugging any performance regressions before merging.

@ghost
Copy link

ghost commented Jun 23, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR addresses #70753. SymbolicRegexBuilder has mutable state that has to be protected by locks when used concurrently from SymbolicRegexMatcher. The builder has some locking, but a bug had crept in. This PR introduces the following changes to address the issue and make the code more clearly correct:

  • Move matching-specific state such as arrays of DFA/NFA states and transitions from the builder to the matcher. This allows removing conditional initialization of these in uses of the builder outside the matcher. The state left in the builder is now caches for SymbolicRegexNode instances and memoization caches for functions in SymbolicRegexNode.
  • Remove the _builder variable from SymbolicRegexNode to avoid accidental concurrent use through concurrent calls to functions in the nodes. The builder now has to be passed into all functions that need it, which makes it obvious which functions are thread safe.
  • Move all synchronization from the builder to the matcher. Locking is used now to protect all concurrent usage of _builder in the matcher as well as state caches and transition arrays (as necessary: transition arrays are still read without locking or Volatile.Read for performance, see Document the memory model guaranteed by dotnet/runtime runtimes #63474 and Unnecessary LdelemaRef call not elided with Volatile.Read #65789).

In addition to these changes, the following cleanup is done:

  • Char kinds and minterms
  • State length growth factor 2
Author: olsaarik
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@olsaarik olsaarik force-pushed the nonbacktracking-locking-cleanup branch 3 times, most recently from 58adf2e to d947ae4 Compare June 29, 2022 00:43
Removed builder reference from SymbolicRegexNode instances; builder now
has to be passed in. Since the builder is not thread safe this clarifies
the locking required in the matcher when using it.
Moved matching specific state from the builder to the matcher. This
includes state and transition arrays.
Simplify character kind code by eliminating duplication of logic.
@olsaarik olsaarik force-pushed the nonbacktracking-locking-cleanup branch from d947ae4 to 9fa528f Compare June 29, 2022 00:56
@olsaarik olsaarik marked this pull request as ready for review June 29, 2022 00:57
@olsaarik
Copy link
Contributor Author

Code should be in a good state now. I've checked the performance and there should be no significant performance regressions (I'm seeing minor improvements rather).

The char kind logic changes required some performance work, which motivated the IInputReader interface in the main matcher code file. The handling of the end-of-line as the last character is slightly less selective now, but also patterns with no end-of-line anchors get to skip the whole logic completely.

DfaMatchingState is now just MatchingState
Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks!

@stephentoub stephentoub merged commit 6ebcb3d into dotnet:main Jul 8, 2022
olsaarik added a commit that referenced this pull request Jul 8, 2022
Fixes issue (#71808) where deeply nested structures caused the start set
computation to overflow the stack.
Re-introduces bottom-up computation of start sets that (#71234) had reworked.
As an optimization, the Singleton node's set field is reused as the start set
field, since for Singleton nodes the two concepts coincide and other node types
do not need the set.
@ghost ghost locked as resolved and limited conversation to collaborators Aug 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants