-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readme for regex implementation #1945
Merged
Merged
Changes from 2 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
5cbb8b3
Readme for regex implementation
danmoseley 749c463
typo
danmoseley 497e8d7
Update src/libraries/System.Text.RegularExpressions/src/dotnet-regula…
danmoseley 8a643a3
Update src/libraries/System.Text.RegularExpressions/src/dotnet-regula…
danmoseley b3f853f
Update src/libraries/System.Text.RegularExpressions/src/dotnet-regula…
danmoseley 857ae83
Update src/libraries/System.Text.RegularExpressions/src/dotnet-regula…
danmoseley 8658a1b
Update src/libraries/System.Text.RegularExpressions/src/dotnet-regula…
danmoseley 9109759
Update src/libraries/System.Text.RegularExpressions/src/dotnet-regula…
danmoseley c50aeb8
Update src/libraries/System.Text.RegularExpressions/src/dotnet-regula…
danmoseley b09a8ec
Updates
danmoseley 8de1411
More
danmoseley 8c87c18
note
danmoseley 057a676
Merge branch 'docs3' of https://github.com/danmosemsft/runtime into d…
danmoseley 7523f65
rename
danmoseley File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
237 changes: 237 additions & 0 deletions
237
src/libraries/System.Text.RegularExpressions/src/dotnet-regular-expressions.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,237 @@ | ||
# Implementation of System.Text.RegularExpressions | ||
|
||
The implementation uses a typical NFA approach that supports back references. Patterns are parsed into a tree (`RegexTree`), translated into an abstract tree (`RegexCode`) by a writer (`RegexWriter`), and then either used in an interpreter (`RegexInterpreter`) or compiled to IL which is executed (`CompiledRegexRunner`). Both of these are instances of `RegexRunner`: in the case of the compiled runner, one must generate a new one from the `RegexCode` using a factory each time the pattern is to be executed. | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Regex engines have different features: .NET regular expressions have a couple that others do not have, such as `Capture`s (distinct from `Group`s). It does not support searching UTF-8 text, nor searching a Span over a buffer. | ||
|
||
Unlike some DFA based engines, patterns must be trusted. Text may be untrusted with the use of a timeout to prevent catastrophic backtracking. | ||
|
||
Performance is important and we welcome optimizations so long as they preserve the public contract. | ||
|
||
## Extensibility | ||
|
||
Key types have significant protected surface area. This is probably not intended as an general extensibility point, but rather as a detail of implementing saving a compiled regex to disk. Saving to disk is implemented by saving an assembly containing three types, one that derives from each of `Regex`, `RegexRunnerFactory`, and `RegexRunner`. This mechanism accounts for all the protected methods (and even protected fields) on these classes. If we were designing them today, we would likely more carefully limit their public surface, and possibly not rely on derived types. | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Protected members are part of the public API which cannot be broken, so they may potentially make some future optimizations more difficult. In particular, we must keep them stable in order to remain compatible with regexes saved by .NET Framework. | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
`RegexCompiler` is abstract for a different reason: to share implementation between `RegexLWCGCompiler` and `RegexAssemblyCompiler`: it based around a field of type `System.Reflection.Emit.ILGenerator` and has protected utility methods and fields to work with it. External extension by derivation of `RegexCompiler` would likely be clumsy as it contains knowledge of `RegexLWCGCompiler` and `RegexAssemblyCompiler`. | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Key types - General | ||
|
||
### Regex (public) | ||
|
||
* Represents an executable regular expression with some utility static methods | ||
* Several protected fields and methods but no derived classes exist in this implementation (see [Extensibility](#Extensibility) section above). | ||
* Constructor sets `RegexCode` using `RegexParser` and `RegexWriter`; then, if `RegexOptions.Compiled`, compiles and holds a `RegexRunnerFactory` and clears `RegexCode`; these steps only need to be done once for this `Regex` object | ||
* Various public entry points converge on `Run()` which uses the held `RegexRunner` if any; if none or in use, creates another with the held `RegexRunnerFactory` if any; if none, interprets with held `RegexCode` | ||
* All static methods (such as `Regex.Match`) attempt to find a pre-existing `Regex` object for the requested pattern and options in the `RegexCache`. This is legitimate, since after construction, `Regex` options are thread-safe. If there is a cache hit, execution can begin immediately. | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### RegexOptions (public) | ||
|
||
* `RightToLeft` is supported throughout, but as the less common case it is less optimized. | ||
* `ExplicitCapture` is off by default: this is relevant to performance, as often patterns contain parentheses as a useful grouping mechanism, for example `(something){1,3}` is easier to type than the non capturing form `(?:something){1,3}`. Because explicit capture is off by default, the engine in this case will capture `something` even if it was not needed. | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### MatchEvaluator (public) | ||
|
||
### RegexCompilationInfo (public) | ||
|
||
* Parameters to use for regex compilation to disk | ||
* Passed in by app to `Regex.CompileToAssembly(..)` - which is not currently implemented. | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Key types - Parsing | ||
|
||
### RegexParser | ||
|
||
* Converts pattern string to `RegexTree` of `RegexNode`s | ||
* Invoked with `RegexTree Parse(string pattern, RegexOptions options...) {}` | ||
* Also has `Escape(..)` and `Unescape(..)` methods, and parses into `RegexReplacement`s | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Does a partial prescan to prep capture slots | ||
* As each `RegexNode` is added, it attempts to reduce (optimize) the newly formed subtree. When parsing completes, there is a final optimization of the whole tree. | ||
|
||
### RegexReplacement | ||
|
||
* Parsed replacement pattern | ||
* Created by `RegexParser`, used in `Regex.Replace`/`Match.Result(..)` | ||
|
||
### RegexCharClass | ||
|
||
* Representation of single, range, or class | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Created by `RegexParser` | ||
* Creates packed string to be held on `RegexNode` | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Has utility methods for examining the packed string, in particular for testing membership of the class (`CharInClass(..)`) | ||
|
||
### RegexNode | ||
|
||
* Node in regex parse tree | ||
* Created by `RegexParser` | ||
* Some nodes represent subsequent optimizations, rather than individual elements of the pattern | ||
* Holds `Children` and `Next` | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Holds char or string (which may be char class), and `M` and `N` constants (eg loop bounds) | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Note: polymorphism was not used here: the interpretation of its fields depends on the integer Type field | ||
|
||
### RegexTree | ||
|
||
* Simple holder for root `RegexNode`, options, and a captures data structure | ||
* Created by `RegexParser` | ||
|
||
### RegexWriter | ||
|
||
* Responsible for translating a `RegexTree` to a `RegexCode` | ||
* Invoked by `Regex` | ||
* Creates itself `RegexCode Write(RegexTree tree){}` | ||
|
||
### RegexFCD | ||
|
||
* Responsible for static pattern prefixes | ||
* Created by `RegexWriter` | ||
* Creates `RegexFC`s | ||
* `FirstChars()` creates `RegexPrefix` from `RegexTree` | ||
* FC means "First chars": not clear what D means... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's just rename the type ;) |
||
|
||
### RegexPrefix | ||
|
||
* Literal string that match must begin with | ||
|
||
### RegexBoyerMoore | ||
|
||
* Supports searching the text for literals | ||
* Constructed by `RegexWriter` | ||
* Singleton held on `RegexCode` | ||
* `RegexInterpreter` uses it to perform Boyer-Moore search | ||
* `RegexCompiler` uses the tables from this object, but generates its own code for the Boyer-Moore search | ||
|
||
### RegexCode | ||
|
||
* Abstract representation of the "program" for a particular pattern | ||
* Created by `RegexWriter` | ||
* Code is an array of integers. Within the array, op-codes' types are indicated by integer consts analogous to those on `RegexNode`. | ||
* Has several related data structures such as a string table, a captures table, and prefixes | ||
|
||
## Key types - Compilation (if not interpreted) | ||
|
||
### RegexCompiler (public abstract) | ||
|
||
* Responsible for compiling `RegexCode` to a `RegexRunnerFactory` | ||
* As implemented, uses `RegexLWCGCompiler` | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Ha utility method `CompileToAssembly` that invokes `RegexParser` and `RegexWriter` directly then uses `RegexAssemblyCompiler` (see note for that type) | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Key protected methods are `GenerateFindFirstChar()` and `GenerateGo()` | ||
* Created and used only from `RegexRunnerFactory Regex.Compile(RegexCode code, RegexOptions options...)` | ||
* Implements `RegexRunnerFactory RegexCompiler.Compile(RegexCode code, RegexOptions options...)` | ||
|
||
### RegexLWCGCompiler (is a RegexCompiler) | ||
|
||
* Creates a `CompiledRegexRunnerFactory` using `RegexRunnerFactory FactoryInstanceFromCode(RegexCode .. )` | ||
|
||
### RegexRunnerFactory (public pure abstract) | ||
|
||
* Reuseable: creates `RegexRunner`s on demand with `RegexRunner CreateInstance()` | ||
* Not relevant to interpreted mode | ||
* Must be thread-safe, as each `Regex` holds one, and `Regex` is thread-safe | ||
|
||
### CompiledRegexRunnerFactory (is a RegexRunnerFactory) | ||
|
||
* Created by `RegexLWCGCompiler` | ||
* Creates `CompiledRegexRunner` on request | ||
|
||
### RegexAssemblyCompiler | ||
|
||
* Created and used by `RegexCompiler.CompileToAssembly(...)` to write compiled regex to disk: at present, writing to disk is not implemented, because Reflection.Emit does not support it. | ||
|
||
## Key types - Execution | ||
|
||
### RegexRunner (public abstract) | ||
|
||
* Responsible for executing a regular expression: not thread-safe | ||
* Resueable: each call to `Scan(..)` begins a new execution | ||
* Lots of protected members: tracking position, execution stacks, and captures: | ||
* `protected abstract void Go()` | ||
* `protected abstract bool FindFirstChar()` | ||
* `public Match? Scan(System.Text.RegularExpressions.Regex regex, string text...)` calls `FindFirstChar()` and `Go()` | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Has a "quick" mode that does not instantiate any captures: used by `Regex.IsMatch(..)` which does not expose captures to the caller | ||
* Concrete instances created by `Match? Regex.Run(...)` calling either `RegexRunner CompiledRegexRunnerFactory.CreateInstance()` or newing up a `RegexInterpreter` | ||
|
||
### RegexInterpreter (is a RegexRunner) | ||
|
||
* See above. Note that this is sealed. | ||
|
||
### CompiledRegexRunner (is a RegexRunner) | ||
|
||
* See above. | ||
|
||
## Results | ||
|
||
### Match (public, is a Group) | ||
|
||
* Represents one match of the pattern: there may be several | ||
* Holds a `Regex` in order to call `NextMatch()` | ||
* Created by `RegexRunner` | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Group (public, is a Capture) | ||
|
||
* Represents one capturing group from the match | ||
* Simple data holder | ||
|
||
### Capture (public) | ||
|
||
* Represents one of the potentially several captures from a capturing group; this is a .NET-only concept. | ||
* Simple data holder | ||
|
||
### MatchCollection (public) | ||
|
||
* Created by `Regex.Matches` | ||
* Lazily provides `Match`es | ||
|
||
### GroupCollection (public) | ||
|
||
* Created by `Match.Groups` | ||
* Lazily creates `Group`s | ||
|
||
### CaptureCollection (public) | ||
|
||
* Created by `Group.Captures` | ||
* Lazily creates `Capture`s | ||
|
||
### RegexParseException (is a ArgumentException) | ||
|
||
* Thrown when pattern is invalid | ||
* Contains `RegexParseError` | ||
|
||
### RegexMatchTimeoutException (public) | ||
|
||
* Thrown when timeout expires | ||
|
||
## Optimizations | ||
|
||
### Tree optimization | ||
|
||
* Every `RegexNode.AddChild()` calls `Reduce()` to attempt to optimize subtree as it is being assembled, and parsing ends with call to `RegexNode.FinalOptimize()` for some optimizations that require the entire tree. The goal is to make a functionally equivalent tree that can produce a more efficient program. With more detailed analysis of the tree and some creativity, more could be done here. | ||
|
||
### Testing character classes | ||
|
||
* Testing a character for membership of a character class can take a significant time in aggregate. Numerous optimizations have been made here. For example, originally it used a binary search, and now it attempts to use a bitmap where possible. More improvements here would likely be worthwhile. | ||
|
||
### Prefix matching | ||
|
||
* If the pattern begins with a literal, `FindFirstChar()` is used to run quickly to the next point in the text that matches that literal, without using the engine. If the literal is a single character, this can use `IndexOf()` which is vectorized; otherwise it uses `RegexBoyerMoore`. Future optimizations could, for example, handle an alternation of leading literals using the Aho–Corasick algorithm; or use `IndexOf` to find a low-probability char before matching the whole literal. These optimizations are likely to most help in the case of a large text, perhaps with few matches, and a pattern with leading large literals. | ||
danmoseley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// TODO - more here | ||
|
||
# Tracing and dumping output | ||
|
||
If the engine is built in debug configuration, and `RegexOptions.Debug` is passed, some internal datastructures will be written out with `Debug.Write()`. This includes the pattern itself, then `RegexWriter` will write out the input `RegexTree` with its nodes, and the output `RegexCode`. The `RegexBoyerMoore` dumps its tables - this would likely be relevant only if there was a bug in that class. `RegexRunner`s also dump their state as they execute the pattern. `Match` also has the ability to dump state. | ||
|
||
For example, if you are working to optimize the `RegexTree` generated from a pattern, this can be a convenient way to visualize the tree without concerning yourself with the subsequent execution. | ||
|
||
When you compile your test program, `RegexOptions.Debug` may not be visible to the compiler: you can use `(RegexOptions)0x0080` instead. | ||
|
||
# Debugging | ||
|
||
// TODO | ||
|
||
# Profiling and benchmarks | ||
|
||
// TODO | ||
|
||
# Test strategy | ||
|
||
// TODO |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why did you choose dotnet-regular-expressions.md as the file name rather than, say, README.md?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed