Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Regex.EnumerateSplits #100369

Open
stephentoub opened this issue Mar 27, 2024 · 3 comments
Open

[API Proposal]: Regex.EnumerateSplits #100369

stephentoub opened this issue Mar 27, 2024 · 3 comments
Assignees
Labels
api-ready-for-review API is ready for review, it is NOT ready for implementation area-System.Text.RegularExpressions
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Mar 27, 2024

Background and motivation

In .NET 7, we added the EnumerateMatches methods to enable ammortized allocation-free support for matching. However, the Regex.Split method is handy for finding the gaps between matches, and using EnumerateMatches to achieve that is non-trivial; developers then use the more expensive Split.

API Proposal

namespace System.Collections.Generic;

public class Regex
{
    public static ValueSplitEnumerator EnumerateSplits(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern);
    public static ValueSplitEnumerator EnumerateSplits(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex, nameof(options))] string pattern, RegexOptions options);
    public static ValueSplitEnumerator EnumerateSplits(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex, nameof(options))] string pattern, RegexOptions options, TimeSpan matchTimeout);

    public ValueSplitEnumerator EnumerateSplits(ReadOnlySpan<char> input);
    public ValueSplitEnumerator EnumerateSplits(ReadOnlySpan<char> input, int count);
    public ValueSplitEnumerator EnumerateSplits(ReadOnlySpan<char> input, int count, int startat);
    
    public ref struct ValueSplitEnumerator
    {
        public readonly ValueSplitEnumerator GetEnumerator();
        public bool MoveNext();
        public readonly Range Current { get; }
    }
}

API Usage

Regex regex = ...;
ReadOnlySpan<char> input = ...;
foreach (Range range in regex.EnumerateSplits(input))
{
    ReadOnlySpan<char> word = input[range];
    ...
}

Alternative Designs

  • Whereas EnumerateMatches yields custom ValueMatch instances, this yields Ranges. We designed ValueMatch to accomodate a future where it could expose capture information (it doesn't today), but for splits there's no such additional info, just the range between matches. Using Range also matches the new span.Split methods added in .NET 8.
  • The overloads exactly match what's exposed for EnumerateMatches, just returning a ValueSplitEnumerator instead of a ValueMatchEnumerator.
  • There are two behavioral differences from Split. 1) For some reason, Split not only includes the splits between matches, but it also includes any capture groups from the matches; that is both unintuitive and adds a lot of overhead and complication for the span/enumerator-based API that's ammortized allocation-free, so it's not included in EnumerateSplits. And 2) if RightToLeft is specified, Split reverses the array so that the results are still left-to-right, but as EnumerateSplits is yielding the splits as they're found, its results are still right-to-left with such options.
  • The overloads accepting int count are less important with EnumerateSplits, as a caller can always choose to stop iterating. However, they're included for two reasons: 1) to keep the overload shape the same with Split, so that someone calling the input, count overload switching to use EnumerateSplits doesn't implicitly start calling a input, startat overload, and 2) to keep the behavior the same for the last split, which when the count is smaller than the actual number will end up including all of the remainder of the input.

Risks

No response

@stephentoub stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.RegularExpressions labels Mar 27, 2024
@stephentoub stephentoub added this to the 9.0.0 milestone Mar 27, 2024
@stephentoub stephentoub self-assigned this Mar 27, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

@iSazonov
Copy link
Contributor

Looking at this example, I feel that this range (1) makes sense only for this input, (2) lives no longer than the input, (3) and is only needed to get a piece of the input, and finally it's a loop over all the elements. So is the following possible:

Regex regex = ...;
ReadOnlySpan<char> input = ...;
foreach (ReadOnlySpan<char> word in regex.EnumerateSplits(input))
{
    ...
}

@stephentoub
Copy link
Member Author

That unnecessarily restricts what you can do if the input is a string, a ReadOnlyMemory<char>, etc. This is a Range for the same reasons https://learn.microsoft.com/en-us/dotnet/api/system.memoryextensions.split?view=net-8.0 uses Range.

@stephentoub stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-ready-for-review API is ready for review, it is NOT ready for implementation area-System.Text.RegularExpressions
Projects
None yet
Development

No branches or pull requests

2 participants