Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized repeated string search API #44794

Closed
GSPP opened this issue Nov 17, 2020 · 12 comments
Closed

Optimized repeated string search API #44794

GSPP opened this issue Nov 17, 2020 · 12 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime
Milestone

Comments

@GSPP
Copy link

GSPP commented Nov 17, 2020

Problem:

It is a fairly common operation to search for the same string many times. For example, you might want so search 1 million text documents for the same substring.

Currently, this is easy to do by simply calling IndexOf repeatedly. But there is a way to do this much faster: There are string search algorithms which are super fast but which require a preprocessing step specific for the string to be found. Some examples:

This is especially relevant since .NET Core switched to ICU for string processing. ICU has a high per-search cost with reduced per-character cost. This has led to performance regressions (#40942). With the right API, it could lead to performance gains instead.

Proposal:

I propose adding a new API that acknowledges this two phase prepare/search model:

public abstract class StringSearcher
{
    //Prepare search algorithm:
    public static StringSearcher Create(ReadOnlySpan<char> stringToFind, StringComparison stringComparison);
    
    //Execute search:
    public abstract int IndexOf(ReadOnlySpan<char> stringToSearch);
    public abstract int LastIndexOf(ReadOnlySpan<char> stringToSearch);
    public abstract bool Contains(ReadOnlySpan<char> stringToSearch);
    public abstract bool StartsWith(ReadOnlySpan<char> stringToSearch);
    public abstract bool EndsWith(ReadOnlySpan<char> stringToSearch);
}

There would be no functional change. This is purely a performance benefit. This benefit can be large when the same string is to be found many times. In particular, the performance loss from switching to ICU can be alleviated in certain cases.

I made the StringSearcher class abstract so that user code can provide its own algorithms. There could be rich community innovation.

Possible extension: It could be useful to be able to search for many strings at once. Certain algorithms can do that efficiently (e.g. trie based ones).

Finding all matches in a document is a fairly common task. This can be made fast and convenient:

//Framework code:
public abstract class StringSearcher {
    //OccurrenceEnumerator equivalent to IEnumerable<(int Index, int Length)>
    public OccurrenceEnumerator EnumerateOccurrences(ReadOnlySpan<char> stringToSearch);
}


//Initialize in application code:
static readonly StringSearcher searcher = ...; 


//Run search algorithm:
foreach ((int index, int length) in searcher.EnumerateOccurrences("abc"))
    ...

I believe that string searching is such a common and performance-sensitive task that it would warrant a new optimized API.

@GSPP GSPP added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 17, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Nov 17, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@huoyaoyuan
Copy link
Member

It is a fairly common operation to search for the same string many times.

I'd argue about this. Huge text could be structured or tokenized.

Have you examined compiled regex for this case? This is a subset of regex.

@GSPP
Copy link
Author

GSPP commented Nov 17, 2020

Regular expressions are a logical suggestion, but:

  1. I believe that compiled regular expressions are not as fast as IndexOf by any means.
  2. There are escaping issues.
  3. Constructing a compiled regex is extremely expensive (does it not create and load a fresh assembly?). I imagine StringSearcher to be far cheaper to instantiate.
  4. Operations such as StartsWith (and the others I listed) are not directly supported. They require generating regex syntax.

@KalleOlaviNiemitalo
Copy link

public int IndexOf(ReadOnlySpan<char> stringToSearch);
public int LastIndexOf(ReadOnlySpan<char> stringToSearch);
public bool Contains(ReadOnlySpan<char> stringToSearch);
public bool StartsWith(ReadOnlySpan<char> stringToSearch);
public bool EndsWith(ReadOnlySpan<char> stringToSearch);

Some of these should presumably be abstract.

I'd like out int matchLength or equivalent, as in CompareInfo.IndexOf. That is not needed for ordinal search but can be useful for linguistic search.

@GSPP
Copy link
Author

GSPP commented Nov 17, 2020

@KalleOlaviNiemitalo Correct, all of them should be abstract. I fixed it.

You are also right about the out int matchLength. There might well be more methods and overloads needed. For now, I'll keep this proposal simple so that it's easy to understand. If the team signals interest, there can be a formal API proposal.

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jul 12, 2021
@tannergooding tannergooding added this to the Future milestone Jul 12, 2021
@stephentoub
Copy link
Member

stephentoub commented Mar 14, 2022

I believe that compiled regular expressions are not as fast as IndexOf by any means.

RegexCompiler / source generator just use IndexOf now, e.g. for:

[RegexGenerator(@"abc")]
private static partial Regex SearchAbc();

you get:

private sealed class Runner : RegexRunner
{
    // Description:
    // ○ Match the string "abc".

    /// <summary>Scan the <paramref name="inputSpan"/> starting from base.runtextstart for the next match.</summary>
    /// <param name="inputSpan">The text being scanned by the regular expression.</param>
    protected override void Scan(ReadOnlySpan<char> inputSpan)
    {
        if (TryFindNextPossibleStartingPosition(inputSpan))
        {
            // The search in TryFindNextPossibleStartingPosition performed the entire match.
            int start = base.runtextpos;
            int end = base.runtextpos = start + 3;
            base.Capture(0, start, end);
        }
    }

    /// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
    /// <param name="inputSpan">The text being scanned by the regular expression.</param>
    /// <returns>true if a possible match was found; false if no more matches are possible.</returns>
    private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
    {
        int pos = base.runtextpos;

        // Validate that enough room remains in the input to match.
        // Any possible match is at least 3 characters.
        if (pos <= inputSpan.Length - 3)
        {
            // The pattern begins with a literal "abc". Find the next occurrence.
            // If it can't be found, there's no match.
            int i = inputSpan.Slice(pos).IndexOf("abc");
            if (i >= 0)
            {
                base.runtextpos = pos + i;
                return true;
            }
        }

        // No match found.
        NoStartingPositionFound:
        base.runtextpos = inputSpan.Length;
        return false;
    }
}

and RegexCompiler spits out almost an identical implementation, just with IL instead of C#.

Constructing a compiled regex is extremely expensive (does it not create and load a fresh assembly?).

It does not create and load a new assembly. It uses DynamicMethod.

I don't currently see enough value in this proposal over what's already possible.

@danmoseley
Copy link
Member

Also the regex generator is free to take advantage in future of string searching algorithms that have a significant set up cost, since it can pay that cost at compilation time.

I think this use case is satisfied by the new generator. I will close.

@danmoseley
Copy link
Member

For fairness I should have said: I'm assuming a constant string, otherwise the generator of course won't help. @GSPP is that a fair assumption?

@stephentoub
Copy link
Member

assuming a constant string, otherwise the generator of course won't help

But RegexCompiler would, as would RegexInterpreter, which also uses IndexOf.

@GSPP
Copy link
Author

GSPP commented Mar 16, 2022

@danmoseley It could be a constant, a once-initialized string, or a rarely changing string. Of course, it's also possible to use this "create once use often" idea to search for multiple strings (this was not proposed here).

If the team does not want to pursue this I'm perfectly fine with closing this issue. I have noticed many closures over the past few days and it seems like a cleanup is underway.

Although I have to say, I found this to be a pretty useful idea. I have so often needed to repeatedly search for the same string (or the same set of strings). I've done lots of text analysis over the years.

@danmoseley
Copy link
Member

@GSPP thanks. I believe we should wait until we get clear use cases where regex is not sufficient.

@danmoseley
Copy link
Member

@GSPP it doesn't fit your scenario here but you may be interested in #62447.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime
Projects
None yet
Development

No branches or pull requests

8 participants