Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using Aho-Corasick in Regex's FindFirstChar #62447

Closed
stephentoub opened this issue Dec 6, 2021 · 20 comments
Closed

Explore using Aho-Corasick in Regex's FindFirstChar #62447

stephentoub opened this issue Dec 6, 2021 · 20 comments

Comments

@stephentoub
Copy link
Member

Given an alternation like "abc|def|ghi", we will currently do an IndexOfAny to find the next position for a possible match. At that point, we try to avoid false positives by doing a few set lookups, e.g. we IndexOfAny('a', 'd', 'g'), and then we test whether the next character is in the set [beh]. But specifically for expressions that are small alternations, or that begin with small alternations, or that begin with constructs we can convert to small alternations, we could instead use Aho-Corasick to actually validate whether any of the strings in the alternation are a prefix of the text at the current location. Rust uses a vectorized implementation named Teddy.

@stephentoub stephentoub added this to the 7.0.0 milestone Dec 6, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Dec 6, 2021
@ghost
Copy link

ghost commented Dec 6, 2021

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Given an alternation like "abc|def|ghi", we will currently do an IndexOfAny to find the next position for a possible match. At that point, we try to avoid false positives by doing a few set lookups, e.g. we IndexOfAny('a', 'd', 'g'), and then we test whether the next character is in the set [beh]. But specifically for expressions that are small alternations, or that begin with small alternations, or that begin with constructs we can convert to small alternations, we could instead use Aho-Corasick to actually validate whether any of the strings in the alternation are a prefix of the text at the current location. Rust uses a vectorized implementation named Teddy.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label Dec 15, 2021
@danmoseley
Copy link
Member

@stephentoub I remember we discussed whether a string.IndexOfAny(string[] ...) would make sense, rather than being regex internal. If I recall correctly, a concern is that the interesting algorithms have some setup cost, such as a table or fingerprint, which might show up as a perf trap if the text is quite short. (They may also benefit from 'policy' such as a table of probable character frequencies which I suppose could possibly show up as a perf deoptimization depending on text.). Possible ways to address this could include (1) different paths depending on pattern/text size (2) accepting multiple texts to amortize the setup cost, or (3) possibly even an API that generates a reusable blob representing the setup (likely not on string). (1) seems appealing to me. One could imagine cases where such an API could be faster even over a single text, than the simple scan that string.IndexOf(string, ..) does. I'm guessing this API could have fairly wide use, beyond ourselves, and could be a great place where we could add value (eg, vectorization) to speed up a common pattern. Do you have any other thoughts? I'm inclined to create an issue to start a discussion as I don't see an existing one.

@stephentoub
Copy link
Member Author

If I recall correctly, a concern is that the interesting algorithms have some setup cost, such as a table or fingerprint, which might show up as a perf trap if the text is quite short.

Yes.

Do you have any other thoughts? I'm inclined to create an issue to start a discussion as I don't see an existing one.

I'd rather focus on getting something working and validated inside of Regex. Regex can do all the precomputation and optimize the provided expression with whatever mechanism it can throw at it. Creating a custom API for a specific pattern of pattern would necessitate data showing it's so common on such hot paths that any additional overhead associated with Regex is prohibitory.

@danmoseley
Copy link
Member

@teo-tsirpanis if you're looking for a meaty project that could give us significant regex perf improvements in 7.0 -- perhaps you're interested in creating a vectorized implementation of Aho-Corasick. Teddy is apparently permissively licensed, so it could be a starting point. We would like to do this, but it doesn't fit on the schedule right now.

@teo-tsirpanis
Copy link
Contributor

Thanks for letting me know @danmoseley. It is indeed a very interesting project, but I am pretty busy these days with pressing obligations, included my university classes that started this week and the work on my thesis.

Speaking of this, I had the thought to make this my thesis' subject. I talked with some professors (that was the reason for my delay) and were willing to accept it, but I was warned that "this is much more than an undergraduate thesis, maybe even a postgraduate thesis" and that "it's not something that will finish in 1-2 months" (I linked your message but forgot to explicitly tell them about Teddy though).

To better understand the magnitude of this undertaking, a more detailed list of what needs to be done, as well as a more informed guess of the required time by adding a Cost: label would be very helpful for me. If it turns out to be too much work, I will start studying the algorithm and the Regex codebase on a casual basis until my schedule gets unburdened (perhaps after the 7.0 snap) and when that time comes, I would be more than glad to take a more serious look into this.

@danmoseley
Copy link
Member

Yes to be clear, no pressure whatsoever, just flagging as maybe interesting to you or someone. Hmm, it's a significant project but I do not think this should be huge work. For the most part it would be isolated from Regex: it would be implementing an (internal) API essentially like String.IndexOf(string, string, string...) using an already understood algorithm, with a reference implementation. That can be worked on and tested in isolation. The tricky bit will be finding the best way to express it in terms of .NET's vectorization API's as well as performance measurements and tuning. Such changes often need iteration to try different approaches. Once all that is done, it will need wiring up for Regex to call it but that should be relatively localized work and we would give pointers.

@stephentoub thoughts?

@stephentoub
Copy link
Member Author

Once all that is done, it will need wiring up for Regex to call it but that should be relatively localized work and we would give pointers.

Yes...ish. Source generation is a wrinkle. For that, we'd either need to emit the code for the algorithm, which I don't think we want to do, or we would need to expose some public API the generated code could use. And while we could leave such a thing to just be for the non-source generated regexes, that would complicate the story for when to pick source generation over not.

I think the actual story would need to be some sort of API for it, which we'd then use from all the regex engines.

But, just getting to the point where we could test and demonstrate the potential would be valuable. Figuring out the API from there shouldn't be a big deal. It would likely be something where you pass to a factory all the target strings, and then the resulting object would provide a search method that took a span, returned an index, and potentially also provided the target string (or its id) that was found.

@danmoseley
Copy link
Member

That makes sense -- thinking of this as an isolated experiment to implement a high performance String.IndexOf(string, string, string...). Once that is proven to work well, we can stop and choose next steps. It may be that we simply add such an API for the benefit of all.

@danmoseley
Copy link
Member

(obviously not that signature)

@danmoseley
Copy link
Member

danmoseley commented May 9, 2022

Making this concrete in discussion with @stephentoub we would this shape:

internal class MultiStringSearcher
{
    public MultiStringSearcher(string[] strings);
    public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
}

Our first implementation would be to use Aho-Corasick with the whole DFA precomputed (looks like #45697 was a start). There would be no grow-up NFA mode: we would presumably fall back to naive search if there are "too many" search strings. We assume that this would be beneficial enough to Regex to stand alone.

Teddy is a natural next step cutting over from Aho-Corasick when hardware supports and there are relatively few search strings (Rust regex seems to cap at 32). As a detail, the Rust Teddy crate itself uses Rabin-Karp for very short strings or (I guess) leftovers.

Note, great credit is due to @BurntSushi and others for their work on the impressive Rust crates which help us chart our basic path here.

@danmoseley danmoseley self-assigned this May 9, 2022
@BurntSushi
Copy link

Since I was pinged, I'll share some very high level thoughts. Please take these with a heap of salt because I have approximately zero familiarity with .NET's regex engine or constraints. :-)

  • Rust's regex crate Teddy implementation is part of the aho-corasick crate. The teddy crate itself is not used.
  • Remember that I am not the author of Teddy. I pulled it out of the Hyperscan project (which is where it was born). Its original author is Geoffrey Langdale.
  • In terms of various constants I used (like when exactly to switch over from Teddy to Aho-Corasick or when to use Rabin-Karp), I'd definitely recommend doing your own litigation there and finding the thresholds that work for you. I do recommend having a diverse benchmark set. It is really easy to get stuck in local optima with things like Teddy for specific benchmarks only to see that speed has regressed quite significantly on some other benchmark.
  • It seems quite plausible to me that Aho-Corasick's NFA formulation would be useful to you as well. The DFA formulation will be faster, but also eat up quite a bit of memory. The NFA is still likely to be faster than any non-DFA regex engine though.
  • Beware that an implementation based on the textbook description of Aho-Corasick will report all possible matches, and those matches will not necessarily be consistent with what would be reported by a backtracking regex engine. For example, consider the regex samwise|sam against the haystack samwise. A backtracking regex engine will report samwise as the match, but Aho-Corasick's standard formulation will report sam (or even sam in addition to samwise if you permit overlapping matches). You can actually change the construction of the Aho-Corasick automaton itself to report matches that are consistent with how a backtracking regex engine will report them. See aho-corasick::MatchKind for more details. The source has more details in terms of implementation. (Perl also uses Aho-Corasick, although I've been unable to make heads or tails of the code.)
  • The above commentary about how Aho-Corasick reports matches is also relevant for Teddy.

That's all I've got for now. Happy to answer questions!

@teo-tsirpanis
Copy link
Contributor

I've started working on this. I will first create a separate project that implements the Teddy algorithm using the cross-platform vector intrinsics and fallbacks to Aho-Corasick. Once benchmarks show an improvement over Regexes, we will look at incorporating the changes.

@teo-tsirpanis teo-tsirpanis self-assigned this May 21, 2022
@danmoseley
Copy link
Member

@teo-tsirpanis sounds great. For the sake of maximizing the chance of success and making digestible PR’s I wonder whether it would make sense to put up a change with just A-C first, which I expect would be fast enough to stand alone.

A general purpose API is proposed above. If it’s to satisfy the needs of regex it will presumably need to return leftmost match (as BurntSushi alludes to above; you’ll need to figure out from the Rust create how he achieves this leftmost match behavior since it’s not the textbook A-C behavior.) Presumably leftmost match is a reasonable behavior for a general purpose API.

Also, regex (at least as currently exists) only needs the start index of the leftmost match. It doesn’t need the matched string, and it certainly doesn’t need any other strings (overlapping) matched at that position. Given that, a general purpose API might best only return the index, (the sketch API above suggested both index and match). If I understand A-C correctly, this means building it can be quicker/ simpler - the automaton will not need “suffix links” (aka “dictionary links”) only the “error links”.

So the ordering I’m imagining would be something like

  1. Implement API like above, but only returning index (leftmost index).

  2. demonstrate it is fast enough for some interesting subset of inputs (where the construction time is amortized such that it is significantly better than looping over IndexOf(string) as one must do today) to be worth adding.

  3. Propose the API and get it approved. Merge the new API and implementation.

  4. if time allows, now update regex to take advantage. Demonstrate it’s an improvement. Merge that.

  5. If time and interest allows, continue to improve the implementation behind this API, either by experimenting with DFA vs NFA, with Teddy, or whatever.

Getting the public API in first is a nice sub step but it also avoids the need to emit the implementation from the regex source generator, which can only call public API.

@danmoseley danmoseley removed their assignment May 22, 2022
@danmoseley
Copy link
Member

(Of course feel free to disagree with suggestion above if you think there’s a better ordering that could still avoid one big PR)

@teo-tsirpanis
Copy link
Contributor

teo-tsirpanis commented May 22, 2022

Yes, starting with AC and leaving Teddy for later is a good idea.

If it’s to satisfy the needs of regex it will presumably need to return leftmost match (as BurntSushi alludes to above; you’ll need to figure out from the Rust create how he achieves this leftmost match behavior since it’s not the textbook A-C behavior.)

My implementation remembers the match's location upon encountering one and keeps looping over the characters, accepting subsequent matches only if they occur before the last we accepted. If we return to the root node and have seen a match, we stop the loop and return it. The Rust implementation is doing this by modifying the trie construction; I will investigate if it's worth it.

Given that, a general purpose API might best only return the index, (the sketch API above suggested both index and match).

I think that in a general purpose API returning the matched string is useful; it will be included in my proposal which will be filed soon.


I also ran some benchmarks. I took a random Project Gutenberg book and counted the first $N \in {10, 100, 500, 1000}$ most common English words using interpreted and compiled regexes, and Aho-Corasick. The results are promising:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1706 (21H2)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.5.22267.11
  [Host]     : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT
Method WordCount Mean Error StdDev Median Ratio RatioSD
CountWordsRegex 10 4,043.2 us 42.82 us 40.05 us 4,048.1 us 1.00 0.00
CountWordsCompiledRegex 10 843.0 us 15.47 us 14.47 us 838.6 us 0.21 0.00
CountWordsRegal 10 2,158.4 us 42.78 us 109.67 us 2,109.3 us 0.53 0.03
CountWordsRegex 100 22,434.8 us 443.59 us 865.19 us 22,075.0 us 1.00 0.00
CountWordsCompiledRegex 100 4,408.6 us 86.45 us 146.80 us 4,350.1 us 0.20 0.01
CountWordsRegal 100 3,042.6 us 57.39 us 110.57 us 3,044.9 us 0.14 0.01
CountWordsRegex 500 25,349.7 us 460.47 us 384.51 us 25,325.1 us 1.00 0.00
CountWordsCompiledRegex 500 8,939.9 us 108.08 us 101.10 us 8,928.6 us 0.35 0.01
CountWordsRegal 500 3,526.6 us 70.30 us 65.75 us 3,508.4 us 0.14 0.00
CountWordsRegex 1000 24,369.6 us 167.28 us 130.60 us 24,360.5 us 1.00 0.00
CountWordsCompiledRegex 1000 53,773.9 us 768.32 us 681.10 us 53,771.9 us 2.21 0.03
CountWordsRegal 1000 3,598.2 us 67.88 us 63.49 us 3,587.1 us 0.15 0.00

@BurntSushi
Copy link

BurntSushi commented May 22, 2022

My implementation remembers the match's location upon encountering one and keeps looping over the characters, accepting subsequent matches only if they occur before the last we accepted. If we return to the root node and have seen a match, we stop the loop and return it. The Rust implementation is doing this by modifying the trie construction; I will investigate if it's worth it.

How does it handle the case when you have the patterns [oob, foobar] and search against foobar? In that case, oob is found as a match before foobar is, but foobar is the correct "leftmost longest" match. So you have to know to keep looking in cases like that. Maybe it works because you aren't returning to the root node in this case?

Another thing to consider is whether you plan on building the DFA variant of AC. In that case, you can't really reason about root nodes and failure transitions at search time.

@teo-tsirpanis
Copy link
Contributor

Maybe it works because you aren't returning to the root node in this case?

Yes, exactly, it returns if it reaches the root or input ends. We can reconsider this once we come at implementing the DFA variant.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 2, 2022
@joperezr
Copy link
Member

Even though this is being worked on now, it is not tracked as a must-have for 7.0, so changing the milestone accordingly.

@joperezr joperezr modified the milestones: 7.0.0, Future Jul 18, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 19, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 14, 2022
@joperezr joperezr modified the milestones: Future, 8.0.0 Sep 30, 2022
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Oct 26, 2022
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Jan 23, 2023
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Feb 23, 2023
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Mar 25, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 11, 2023
@steveharter
Copy link
Member

Moving to 9.0

@steveharter steveharter modified the milestones: 8.0.0, 9.0.0 Aug 7, 2023
@stephentoub
Copy link
Member Author

superceded by #85693

@github-actions github-actions bot locked and limited conversation to collaborators Jan 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants