Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge performance regression in Regex.Replace(string, string) #44808

Closed
iSazonov opened this issue Nov 17, 2020 · 29 comments
Closed

Huge performance regression in Regex.Replace(string, string) #44808

iSazonov opened this issue Nov 17, 2020 · 29 comments

Comments

@iSazonov
Copy link
Contributor

Description

In PowerShell repository we get a report that an user script works significant slower on PowerShell 7.1 (based on .Net 5.0) vs PowerShell 7.0 (based on .Net 3.1).

After some investigations we see the script makes many calls of Regex.Replace(string, string) method and chokes GC with ReadOnlyMemory.

Please see PowerShell/PowerShell#14087 for details. PerfView files are attached there.

Regression?

Yes, it is regression in .Net 5.0 from .Net 3.1.

@iSazonov iSazonov added the tenet-performance Performance related issue label Nov 17, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner labels Nov 17, 2020
@ghost
Copy link

ghost commented Nov 17, 2020

Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

Issue Details
Description:

Description

In PowerShell repository we get a report that an user script works significant slower on PowerShell 7.1 (based on .Net 5.0) vs PowerShell 7.0 (based on .Net 3.1).

After some investigations we see the script makes many calls of Regex.Replace(string, string) method and chokes GC with ReadOnlyMemory.

Please see PowerShell/PowerShell#14087 for details. PerfView files are attached there.

Regression?

Yes, it is regression in .Net 5.0 from .Net 3.1.

Author: iSazonov
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance, untriaged

Milestone: -

@stephentoub
Copy link
Member

How many replacements are being performed? If we're seeing tons of ReadOnlyMemory<char>[] being created from Regex.Replace, that suggests the size of the array is too big for ArrayPool to cache, which suggests over a million replacements are being performed as part of a single Regex.Replace?

@stephentoub
Copy link
Member

stephentoub commented Nov 17, 2020

Also, can you put together a standalone, simple C# repro focused on direct use of Regex.Replace? Such investigations are much easier when they come in such a form.

@iSazonov
Copy link
Contributor Author

For the PerfVew results I used a data file with 100000 lines (it is first lines from original file shared in PowerShell issue by author) and the test script has 54 replace operators called per line. Line size is < 70 chars.

I could try to prepare C# code but only tomorrow (my time zone is UTC+5).

@stephentoub
Copy link
Member

I could try to prepare C# code but only tomorrow (my time zone is UTC+5).

That'd be helpful. Thanks.

@stephentoub
Copy link
Member

stephentoub commented Nov 17, 2020

the test script has 54 replace operators called per line. Line size is < 70 chars

It's calling Regex.Replace per line, so the input to each Replace call is at most 70 characters? If so this is very surprising. Is there any parallelism being employed, or is all use of regex sequential?

@iSazonov
Copy link
Contributor Author

the test script has 54 replace operators called per line. Line size is < 70 chars

It's calling Regex.Replace per line, so the input to each Replace call is at most 70 characters?

Here statistics for the input file:

Count Length
----- ----
    1 0
    2 1
   14 5
  135 6
  226 7
  539 8
 1029 9
 1487 10
 2058 11
 3904 12
 4415 13
 4255 14
 5490 15
 4947 16
 5059 17
 4703 18
 5490 19
 6525 20
 4354 21
 7114 22
 3285 23
 4032 24
 2443 25
 1532 26
 8881 27
 2661 28
 1747 29
 1498 30
 1179 31
 2989 32
  684 33
  917 34
 1195 35
  502 36
  439 37
  313 38
  364 39
  296 40
  237 41
  147 42
  405 43
   85 44
  185 45
  263 46
   73 47
   83 48
   85 49
   62 50
   54 51
   99 52
  221 53
   40 54
   33 55
   90 56
   95 57
   22 58
   41 59
  106 60
  275 61
    8 62
   24 63
   10 64
   14 65
   58 66
   14 67
    8 68
   11 69
   12 70
   13 71
    6 72
   76 73
    3 74
   11 75
  252 76
    1 77
    6 78
    2 79
    1 80
    4 81
    5 82
    5 83
    4 84
    3 85
    3 86
    1 88
    1 89
    1 90
    2 91
    1 92
    1 93
    1 94
    1 95
    1 100
    2 105
    2 106
    1 107
    1 108
    1 109
    1 110
    1 111
    2 112
    3 113
    4 114
    1 115
    1 118
    3 121
    1 124
    3 125
    1 126
    1 132
    1 141
    1 253

@stephentoub
Copy link
Member

Here statistics for the input file:

Thanks, but to clarify my question, it's invoking Regex.Replace once per input line per replacement pattern, and every string input to Regex.Replace is going to be relatively short, e.g. less than a few hundred characters? I suspect that's not actually the case, or if it is, something is very wrong.

@iSazonov
Copy link
Contributor Author

The script read the data file a line by line, for each line it applies 54 replace operations - each operation is applied to the result of the previous one.
Most of lines is short (max size 253 for 1 line).

@daxian-dbw
Copy link
Contributor

@stephentoub The repro script calls -replace (which calls to regex.Replace(string, string)) 54 times per input line with different replacement patterns.

        # Removes junk lines of text
        $data = Get-Content $aioTextFile -Encoding UTF8

        # Removes some substrings within the data
        $data = $data | ForEach-Object { `
                $_ -replace '127\.0\.0\.1', '' -replace '0\.0\.0\.0', '' `
                -replace ' ', '' -replace '\t', '' -replace '\|\|', '' `
                -replace '\^', '' -replace 'www\.', '' `
                -replace '#Shock-Porn', '' -replace '#AAAEcommerce', '' `
                -replace '#(Vietnam)', '' -replace '#Acoustic', '' `
                -replace '#Albert', '' -replace '#Admedo', '' `
                -replace '#AdTriba', '' -replace '#AgentStudio', '' `
                -replace '#Algolia', '' -replace '#Race', '' `
                -replace '#Email,SMS,Popups', '' -replace '#Appier', '' `
                -replace '#ZineOne', '' -replace '#DeadBabies(Anti-Abortion)', '' `
                -replace '#AdobeDigitalMarketingSuite', '' -replace '#Race', '' `
                -replace '#Attentive', '' -replace '#Auctiva', '' `
                -replace '#Beelert', '' -replace '#Granify', '' `
                -replace '#Gore', '' -replace '#ExtremistGroup(Uncertain)', '' `
                -replace '#SixBit', '' -replace '#DigitalMarketing', '' `
                -replace '#Race/Gender(SimonSheppard)', '' -replace '#BoldCommerce', '' `
                -replace '#BridgeWell', '' -replace '#BuckSense', '' `
                -replace '#Candid.io', '' -replace '#SailThru', '' `
                -replace '#PCM,Inc.', '' -replace '#Frooition', '' `
                -replace '#RichContext', '' -replace '#SearchSpring', '' `
                -replace '#SendPulse', '' -replace '#Kibo', '' `
                -replace '#Satanism', '' -replace '#ClearLink', '' `
                -replace '#CleverTap', '' -replace '#ClickFirstMarketing', '' `
                -replace '\&visitor=\*\&referrer=', '' -replace '://', '' `
                -replace '\:\:1ip6-localhostip6-loopbacklocalhost6localhost6\.localdomain6', '' `
                -replace '\:\:1localhost', '' -replace '\?action=log_promo_impression', '' `
                -replace '\[hidethisheader(viewthelistonitsown)', '' `
                -replace '\[viewthelistinadifferentformat\|aboutthislist\|viewtheblocklistofIPaddresses', ''
        }

@GrabYourPitchforks
Copy link
Member

Looking at the patterns, most of them (except where parens are involved) seem to be looking for exact string literals? I wonder if a possible optimization could be for Regex to realize that the input pattern doesn't contain anything "interesting" from a regex matching perspective and to forward down to the normal string.Replace API in those cases.

@danmoseley
Copy link
Member

That would also be a case where it would be nice to have an API to avoid allocating intermediate strings.

@stephentoub stephentoub self-assigned this Nov 17, 2020
@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Nov 17, 2020
@stephentoub stephentoub added this to the 6.0.0 milestone Nov 17, 2020
@stephentoub
Copy link
Member

if it is, something is very wrong.

Yeah, something was very wrong: when there were no replacements to be made, we were neglecting to return a buffer to the pool. I'll put up a PR.

@stephentoub
Copy link
Member

Fix is at #44833

@stephentoub
Copy link
Member

Looking at the patterns, most of them (except where parens are involved) seem to be looking for exact string literals? I wonder if a possible optimization could be for Regex to realize that the input pattern doesn't contain anything "interesting" from a regex matching perspective and to forward down to the normal string.Replace API in those cases.

It's possible, if the regex parse tree contains just a single character or a "multi" (just a text string), if the options are configured appropriately (default would be appropriate), and with the string-based overload (rather than the callback-based overload). I did a quick hack to see what it would save, and for this specific repro, it would appx double throughput.

@pgovind
Copy link
Contributor

pgovind commented Nov 17, 2020

if the regex parse tree contains just a single character or a "multi"

I was going through this just yesterday. This would go down the Boyer Moore path right? I'm guessing the speedup is because vectorized string search is much better than non-vectorized BM. I'm wondering if our BM implementation can be optimized/vectorized further. Anyway, this is just an aside, I'm looking at your PR now :)

@stephentoub
Copy link
Member

I was going through this just yesterday. This would go down the Boyer Moore path right?

It does, yes.

I'm guessing the speedup is because vectorized string search is much better than non-vectorized BM.

That's likely a contributor, though for many of these the advance distance is pretty large, so I suspect it's not the bulk of the difference. There's non-trivial code to traverse through to get from the public Regex.Replace down to the inner Boyer-Moore search loop in FindFirstChar; if I had to guess, I'd wager a non-trivial chunk of the difference came from my just replacing Regex.Replace with string.Replace in my hack, and thus skipping all of that overhead just getting to the place where the search happens.

@iSazonov
Copy link
Contributor Author

@stephentoub I see you found and fixed one problem. Thanks!
In my original test, the result was 112 seconds in all 5.0 preview versions, but in the final version it was over 300 seconds. I'm afraid there was something else changed that we still haven't covered here.

@stephentoub
Copy link
Member

In my original test, the result was 112 seconds in all 5.0 preview versions, but in the final version it was over 300 seconds. I'm afraid there was something else changed that we still haven't covered here.

What test?

@iSazonov
Copy link
Contributor Author

I mean PowerShell/PowerShell#14087 (comment) - the same test on all preview versions ~112sec, on GA ~307sec.

@stephentoub
Copy link
Member

Does PowerShell execute the regex exactly as provided, or does it prepend anything to it, like .* ?

@iSazonov
Copy link
Contributor Author

I don't see in PowerShell code any modifications of input regex string for replace operator.

@stephentoub
Copy link
Member

In that case I don't know why there would be any difference btw previews and release. If you can come up with a repro for a different regression from 3.1, please open a new issue. Thanks.

@iSazonov
Copy link
Contributor Author

@stephentoub When can we get # 44833? Will it be in 6.0 Preview1? After that we could re-run the test in PowerShell and feedback if needed.

@stephentoub
Copy link
Member

@iSazonov, it's already merged into master, so any 6.0 preview would have it; it should also be in nightly builds starting today if you wanted to try running with one of those. And #44837 is for possibly getting it into 5.0.x servicing.

@danmoseley
Copy link
Member

Returning to my original comment -- if this pattern of heavily chained replacements is a Powershell idiom, I could imagine Powershell could recognize it and use some API optimized for this purpose (to avoid intermediates). Perhaps that would look like part of #23602 which we already discussed.

@iSazonov
Copy link
Contributor Author

@danmosemsft Most PowerShell users are not experienced developers (not even PowerShell) and can create the most incredible constructs. Of course, a chain of 54 substitution operators is not a typical case, but from my experience I would say that such chains are used - this is one of the simple and pleasant features of PowerShell. If we think that this is how users process modern logs, which are tens of gigabytes in size, then performance is undoubtedly important.
If there will be a new API that allows chaining, then PowerShell will not be able to immediately benefit from it - we will need to either improve the PowerShell engine or the operator itself. Also we could think about search a string by one pattern and then replace in the result string by another pattern.

I'd speculate that Regex in common is too complex for end users and most of Regex pattern PowerShell users create is very, very simple. If we look the 54 patterns they all is simple. I guess .Net team could use the fact to create allocationless Regex optimizations.

@stephentoub
Copy link
Member

I guess .Net team could use the fact to create allocationless Regex optimizations.

The allocation here is the string returned from Regex.Replace; if anything is replaced, it has to return a new string (it returns the original string if nothing is replaced).

@iSazonov
Copy link
Contributor Author

I guess .Net team could use the fact to create allocationless Regex optimizations.

The allocation here is the string returned from Regex.Replace; if anything is replaced, it has to return a new string (it returns the original string if nothing is replaced).

My note was for chained scenario.


I can confirm the fix works. (I don't compile PowerShell with nightly .Net 6.0 build and only copy-paste the dlls - after that the original script works as fast as with .Net 5.0.)

@ghost ghost locked as resolved and limited conversation to collaborators Dec 23, 2020
@stephentoub stephentoub modified the milestones: 6.0.0, 5.0.1 Jan 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants