Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Efficiently matching multiple strings. #69682

Closed
teo-tsirpanis opened this issue May 23, 2022 · 17 comments
Closed

[API Proposal]: Efficiently matching multiple strings. #69682

teo-tsirpanis opened this issue May 23, 2022 · 17 comments
Assignees
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.RegularExpressions
Milestone

Comments

@teo-tsirpanis
Copy link
Contributor

Background and motivation

As explained in #62447, we want to use more sophisticated string searching algorithms such as Aho-Corasick and the vectorized Teddy in regexes.

Since these algorithms are quite complex, to avoid duplicating their logic in source-generated Regex code, I propose to add a class that performs such searching efficiently, with the more complex pre-processing being performed at construction-time, which will essentially serve the role of an optimized string.IndexOf(string[]).

API Proposal

namespace System.Text;

public sealed class MultiStringMatcher
{
    public ReadOnlySpan<string> Strings { get; }

    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(params string[] strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, ReadOnlySpan<string> strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, params string[] strings);

    public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
    public (int Index, int StringNumber) Find(string text);
}

[Flags]
public enum MultiStringMatcherOptions
{
    None = 0,
    CaseInsensitive = 1
}

Creating a MultiStringMatcher accepts an array or span of the strings it will recognize. The Find method accepts a string or read-only span of characters, and returns the position of the longest leftfost match of one of the strings passed to the constructor, and which of the strings it found. If it does not find anything, it will return (-1, -1). These strings are also available for later inspection through the Strings property.

The constructors are overloaded to accept a MultiStringMatcherOptions enum. Currently it allows case-insensitive searching (can be easily done in Aho-Corasick by adding additional edges to the trie, but Teddy won't be used if enabled), and in the future it may be used to enable dynamically generated code for the Aho-Corasick state transitions, similar to RegexOptions.Compiled.

If an empty array of strings is passed to the constructors, the resulting matcher's Find method will always return (-1, -1). If the same string appears in the constructor many times, the Find method could either return the index of the first, the index of the last, or throw at construction time (I'm not a fan of this option).

API Usage

var matcher = new MultiStringMatcher("foo", "bar", "baz");

Console.WriteLine(matcher.Find("foobar")); // Will print (0, 0)

Alternative Designs

An important design decision lies on whether we want this API to be general-purpose and usable by itself, or to provide only what source-generated Regexes need. I imagine the API reviewers want to pursue the latter direction. In this case the API would be simplified to:

// The namespace changed.
namespace System.Text.RegularExpressions;

public sealed class MultiStringMatcher
{
    // It does not cost much to provide this API but could be removed as well.
    public ReadOnlySpan<string> Strings { get; }

    // We remove the user-friendly superfluous overloads.
    // If we support code-generated Aho-Corasick in the future, this overload will allow trimming it away.
    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(ReadOnlySpan<string> strings, MultiStringMatcherOptions options);

    // Regexes only need the index, not the specific string that was matched.
    public int Find(ReadOnlySpan<char> text);
}

/// MultiStringMatcherOptions stays the same.

If we want this to be a general-purpose API, we have to also think of the following:

  • Is System.Text really a good fit for this? It mostly has to do with text encoding and formatting. My other thought is System but it's already pretty bloated.
  • How about supporting UTF-8? Teddy will particularly enjoy it since twice the characters could fit in the SIMD registers, but it would make supporting case-insensitive comparisons more complicated though.

And do we actually need case-insensitivity?

Risks

The idea that we would construct an object that performs expensive initialization in source-generated code seems a bit paradoxical, given that a purpose of source generators is to avoid expensive initialization. There's no good answer to this, we have to make sure it's worth the performance benefits and tune it appropriately. It would be great if Roslyn allowed generators adding files private for themselves, but that's a big if.

@teo-tsirpanis teo-tsirpanis added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 23, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 23, 2022
@ghost
Copy link

ghost commented May 23, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

As explained in #62447, we want to use more sophisticated string searching algorithms such as Aho-Corasick and the vectorized Teddy in regexes.

Since these algorithms are quite complex, to avoid duplicating their logic in source-generated Regex code, I propose to add a class that performs such searching efficiently, with the more complex pre-processing being performed at construction-time, which will essentially serve the role of an optimized string.IndexOf(string[]).

API Proposal

namespace System.Text;

public sealed class MultiStringMatcher
{
    public ReadOnlySpan<string> Strings { get; }

    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(params string[] strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, ReadOnlySpan<string> strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, params string[] strings);

    public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
    public (int Index, int StringNumber) Find(string text);
}

[Flags]
public enum MultiStringMatcherOptions
{
    None = 0,
    CaseInsensitive = 1
}

Creating a MultiStringMatcher accepts an array or span of the strings it will recognize. The Find method accepts a string or read-only span of characters, and returns the position of the longest leftfost match of one of the strings passed to the constructor, and which of the strings it found. If it does not find anything, it will return (-1, -1). These strings are also available for later inspection through the Strings property.

The constructors are overloaded to accept a MultiStringMatcherOptions enum. Currently it allows case-insensitive searching (can be easily done in Aho-Corasick by adding additional edges to the trie, but Teddy won't be used if enabled), and in the future it may be used to enable dynamically generated code for the Aho-Corasick state transitions, similar to RegexOptions.Compiled.

If an empty array of strings is passed to the constructors, the resulting matcher's Find method will always return (-1, -1). If the same string appears in the constructor many times, the Find method could either return the index of the first, the index of the last, or throw at construction time (I'm not a fan of this option).

API Usage

var matcher = new MultiStringMatcher("foo", "bar", "baz");

Console.WriteLine(matcher.Find("foobar")); // Will print (0, 0)

Alternative Designs

An important design decision lies on whether we want this API to be general-purpose and usable by itself, or to provide only what source-generated Regexes need. I imagine the API reviewers want to pursue the latter direction. In this case the API would be simplified to:

// The namespace changed.
namespace System.Text.RegularExpressions;

public sealed class MultiStringMatcher
{
    // It does not cost much to provide this API but could be removed as well.
    public ReadOnlySpan<string> Strings { get; }

    // We remove the user-friendly superfluous overloads.
    // If we support code-generated Aho-Corasick in the future, this overload will allow trimming it away.
    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(ReadOnlySpan<string> strings, MultiStringMatcherOptions options);

    // Regexes only need the index, not the specific string that was matched.
    public int Find(ReadOnlySpan<char> text);
}

/// MultiStringMatcherOptions stays the same.

If we want this to be a general-purpose API, we have to also think of the following:

  • Is System.Text really a good fit for this? It mostly has to do with text encoding and formatting. My other thought is System but it's already pretty bloated.
  • How about supporting UTF-8? Teddy will particularly enjoy it since twice the characters could fit in the SIMD registers, but it would make supporting case-insensitive comparisons more complicated though.

And do we actually need case-insensitivity?

Risks

The idea that we would construct an object that performs expensive initialization in source-generated code seems a bit paradoxical, given that a purpose of source generators is to avoid expensive initialization. There's no good answer to this, we have to make sure it's worth the performance benefits and tune it appropriately. It would be great if Roslyn allowed generators adding files private for themselves, but that's a big if.

Author: teo-tsirpanis
Assignees: -
Labels:

api-suggestion, area-System.Text.RegularExpressions, untriaged

Milestone: -

@teo-tsirpanis
Copy link
Contributor Author

c.c. @stephentoub @danmoseley

@teo-tsirpanis teo-tsirpanis self-assigned this May 23, 2022
@danmoseley
Copy link
Member

My preference would be the Regex motivated API, as that's what we have the clearest motivation for. We will be able to immediately use and validate that API. It doesn't prevent us extending later if necessary.

One other thought, Find may not be the best name for the method since you're passing in the "thing to be found in". FindIn, perhaps? IndexIn?

@danmoseley
Copy link
Member

danmoseley commented May 23, 2022

Is there a way we can make this generic, so I could search for multiple byte sequences in a span of bytes? Internally, it would not be able to use any algorithm vectorized for char, but Aho-Corasick could work the same.

I haven't thought this through, not considered whether it would be used, just wanted to throw it out there before we bake string into the type name.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented May 24, 2022

How about supporting UTF-8? Teddy will particularly enjoy it since twice the characters could fit in the SIMD registers, but it would make supporting case-insensitive comparisons more complicated though.

I don't think we should add UTF-8 support at this time. As you point out, UTF-8 casing support is far more complicated than UTF-16 casing support, and there are significant performance considerations with it. We should trial balloon other simpler UTF-8 APIs (like regular case mapping) before taking on something as ambitious as this. That will give us the public feedback we need as to whether the usability and performance is acceptable for our consumers, and it allows us to get a feel for what API shapes have the greatest success rate.

@danmoseley
Copy link
Member

How are things looking @teo-tsirpanis ?

@teo-tsirpanis
Copy link
Contributor Author

I have done an initial implementation of AC and benchmarks are very promising. I've struggled over these days to implement the optimized leftmost matching done in the Rust aho-corasick package, but that could come later (I have implemented in a different way as explained in the linked comment above).

Until this API gets approved, we can try it in non-source-generated regexes. Thankfuly there is #45697 to see how to integrate it in the regex engine.

@danmoseley
Copy link
Member

Thanks for the update @teo-tsirpanis. To confirm, you plan to continue with your next step to wire it up to regex? If so that sounds good.
Do you think we should get this API into review meantime, or wait?

BTW the team in this repo will be transitioning into mostly bug fix mode in mid July (this need not affect community contributions, but ideally anything that needs stabilization time will be in by then), then main will become a .NET 8 branch in mid August - the same plan we followed last year. It's totally fine if this ends up having to be .NET 8, just wanted you to be aware of the timelines in case we can get part of it into .NET 7!

@teo-tsirpanis
Copy link
Contributor Author

Yes, I'm studying the Regex code to figure out how to integrate it. The proposal can be marked as ready for review in the meantime.

@danmoseley
Copy link
Member

Thanks @teo-tsirpanis . The API review will be interested to see what the consumption in regex looks like, in order to demonstrate the API is the right shape. When you have that code working, we can mark ready for review, and expedite if it's necessary.

@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label Jun 29, 2022
@joperezr joperezr added this to the Future milestone Jun 29, 2022
@teo-tsirpanis
Copy link
Contributor Author

Thinking of this again, this API might not be actually needed, at least for Aho-Corasick. The algorithm's search logic is pretty simple and the regex source generator could emit something like this:

private readonly struct TrieNode {
    public Dictionary<char, int> Children { get; init; }
    public int SuffixLink { get; init; }
    public int DictionaryLink { get; init; }
}

private static readonly TrieNode[] s_trieNodes = new TrieNode[] {
    new TrieNode() {
        Children = new Dictionary() {
            ['a'] = 1,
            ['b'] = 2
        }
    },
    // ...
};

and emit a standard algorithm code that works over s_trieNodes. We also reduce the runtime initialization cost. I haven't looked very carefully into Teddy and can't say if the same applies to it as well.


As a sidenote, the non-source-generated Aho-Corasick implementation is going very well, and I am in the stage of testing and fixing bugs. I expect to open a PR by the end of the week.

@stephentoub
Copy link
Member

stephentoub commented Jun 30, 2022

I expect to open a PR by the end of the week.

Thanks for working on this. It might be counterintuitive, but I would like to avoid adding optimizations to the interpreter that aren't also in Compiled and the source generator, and similarly optimizations to Compiled that aren't in the source generator. It otherwise complicates the story and decision tree about which to use when. With such optimizations we've delayed adding them until they can be added everywhere, even if that means delaying their inclusion in some.

in the stage of testing and fixing bugs

Have you run any perf tests? Just from skimming your linked commits, I expect this will actually regress some patterns due to the lack of any vectorization; given a pattern like "abc|def" today we'll do something like IndexOfAny('c', 'f') to jump ahead, and if those are relatively infrequent in the input text, that will process much faster than processing a trie every character.

@teo-tsirpanis
Copy link
Contributor Author

Thanks for the feedback. Compiled is very easy to support and I'm currently adding this, and the source generator work might take a bit longer. I will submit the PR after all three are supported.

As explained in #62447 (comment), I have compared the existing Regex engine with a separate AC implementation. I yesterday ran it with 2, 5, 8 and 10 strings, and here are the results:

Method WordCount Mean Error StdDev Median Ratio RatioSD
CountWordsRegex 2 1,199.5 us 17.43 us 28.14 us 1,202.1 us 1.00 0.00
CountWordsCompiledRegex 2 180.4 us 4.93 us 14.54 us 175.6 us 0.16 0.01
CountWordsRegal 2 1,443.4 us 28.34 us 38.79 us 1,440.0 us 1.20 0.05
CountWordsRegex 5 2,479.7 us 42.85 us 37.98 us 2,489.3 us 1.00 0.00
CountWordsCompiledRegex 5 589.3 us 11.77 us 31.61 us 587.9 us 0.23 0.01
CountWordsRegal 5 1,284.4 us 25.11 us 44.63 us 1,276.1 us 0.53 0.03
CountWordsRegex 8 4,069.5 us 140.35 us 413.83 us 3,994.8 us 1.00 0.00
CountWordsCompiledRegex 8 825.2 us 16.30 us 16.74 us 820.8 us 0.22 0.01
CountWordsRegal 8 2,005.1 us 39.90 us 99.36 us 2,015.3 us 0.50 0.06
CountWordsRegex 10 4,673.7 us 226.28 us 638.22 us 4,440.4 us 1.00 0.00
CountWordsCompiledRegex 10 874.6 us 18.00 us 53.06 us 872.9 us 0.19 0.02
CountWordsRegal 10 1,985.6 us 37.63 us 33.36 us 1,978.2 us 0.45 0.01

Based on that, if the trie has less than five matches, it will not be used. The trie will be traversed in case we find a common prefix and use it in the LeadingString_*To* mode (at the time of writing this the relevant code has a bug which will be fixed), and if there isn't we try the rest of the RegexFindOptimizations modes.

@stephentoub
Copy link
Member

stephentoub commented Jun 30, 2022

Based on that, if the trie has less than five matches, it will not be used

Have you tried a pattern like (ab|cd|ef|gh|ij|kl)[mn]? The current code should vectorize the search for that [mn], at least for Compiled / source generator, e.g.

private string _str = new string(' ', 1_000_000);

[Benchmark(Baseline = true)]
public bool IsMatch1() => Regex.IsMatch(_str, "(ab|cd|ef|gh|ij|kl)", RegexOptions.Compiled);

[Benchmark]
public bool IsMatch2() => Regex.IsMatch(_str, "(ab|cd|ef|gh|ij|kl)[mn]", RegexOptions.Compiled);
Method Mean Error StdDev Ratio
IsMatch1 730.25 us 11.901 us 13.705 us 1.00
IsMatch2 54.87 us 1.095 us 1.737 us 0.08

@stephentoub
Copy link
Member

stephentoub commented Jun 30, 2022

(My general point is anything currently vectorized should typically be favored over the trie. That's more than just starting sets.)

@teo-tsirpanis
Copy link
Contributor Author

#71588 implements Aho-Corasick without this API by directly emitting the algorithm's logic, bringing the initialization cost to zero and the matching performance to a level this API could never reach. I haven't investigated how easy it would be to do the same with Teddy, but a public API with a surface area specific to regexes that searches over multiple strings does not seem like the right answer, both design-wise and because we wouldn't get away with the initialization cost. Closing.

@teo-tsirpanis teo-tsirpanis closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.RegularExpressions
Projects
None yet
Development

No branches or pull requests

5 participants