Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The .NET regular expression engine's capturing behavior is not the same as the ECMAScript standard. #24

Open
otac0n opened this issue Apr 2, 2011 · 7 comments
Labels

Comments

@otac0n
Copy link
Collaborator

otac0n commented Apr 2, 2011

For regular expressions such as this:
((a+)?(b+)?c+)*

There are 3 capturing groups (one for each left-parenthesis).

If this is matched against a string like the following:
bbbccaac

The .NET implementation will list the following capture groups:
((a+)?(b+)?c) = "aac"
(a+) = "aa"
(b+) = "bbb"

Whereas the ECMAScript spec specifies the following capturing behavior:
((a+)?(b+)?c) = "aac"
(a+) = "aa"
(b+) = undefined

The .NET implementation gives no indication that the (b+) capturing group did not participate in its most recent match attempt.

@hakanson
Copy link

hakanson commented Apr 2, 2011

@otac0n
Copy link
Collaborator Author

otac0n commented Apr 2, 2011

@hakanson: We are already using the ECMAScript option, which works well for the most part. It is just this little piece that is different.

@fholm
Copy link
Owner

fholm commented Apr 3, 2011

I think this is something we'll have to live with for now, doing a custom regular expression implementation for this small detail is too much for too little gain currently. I'll leave the ticket open, and we'll look into it eventually.

@hakanson
Copy link

-1 for me for not looking in the code in Core.fs

    let options = (options ||| RegexOptions.ECMAScript) &&& ~~~RegexOptions.Compiled
    let key = (options, pattern)
    this.RegExp <- env.RegExpCache.Lookup key (fun () -> new Regex(pattern, options ||| RegexOptions.Compiled))

I'm new to F#; does this mean you are implementing your own compiled RegExp cache? I ask because there is a Regex.CacheSize Property that controls an internal cache of compiled regular expressions. I assume it gave you more control to have your own cache, but thought I would add for completeness (as the risk of looking uninformed a second time on the same issue).

http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.cachesize.aspx

@fholm
Copy link
Owner

fholm commented Apr 24, 2011

Yes we do maintain our own regexp cache, we found it to be faster actually.

@otac0n
Copy link
Collaborator Author

otac0n commented Apr 25, 2011

We found that in a loop like this...

while (true)
{
    var r = new RegExp("...");
}

...that .NET's regex cache was not helping.

When we implemented the regexp cache shown above, we saw a 50% reduction in the time on the SunSpider regexp test.

@ChaosPandion
Copy link
Collaborator

@otac0n - From the looks of it the BCL only caches for static methods on the Regex object so the increase in performance makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants