Skip to content
This repository was archived by the owner on Dec 15, 2022. It is now read-only.

Enable unicode support in pcre#61

Merged
smashwilson merged 4 commits intomasterfrom
aw/unicode
Mar 29, 2018
Merged

Enable unicode support in pcre#61
smashwilson merged 4 commits intomasterfrom
aw/unicode

Conversation

@smashwilson
Copy link
Copy Markdown
Contributor

@smashwilson smashwilson commented Mar 27, 2018

Enable Unicode support in pcre and compile regular expressions with PCRE2_UTF.

Fixes #56.

@smashwilson
Copy link
Copy Markdown
Contributor Author

This does have a not insignificant performance and size impact, hence the failing test:

Without SUPPORT_UNICODE
  • Size of superstring.node: 768K
  • Duration of "can be called repeatedly between buffer mutations without harming performance": 121ms
  • Size of browser.js: 808K
  • Duration of "can be called repeatedly between buffer mutations without harming performance" with browser tests: 715ms
With SUPPORT_UNICODE
  • Size of superstring.node: 896K
  • Duration of "can be called repeatedly between buffer mutations without harming performance": 440ms
  • Size of browser.js: 1.1M
  • Duration of "can be called repeatedly between buffer mutations without harming performance" with browser tests: 3059ms

All recorded on macOS, naturally.

I'm not sure what to do with this one from here. That's a pretty big hit and this is a performance-critical part of Atom, but not handling unicode surrogates and character attributes at all is bad for internationalization.

One idea I have is to remember which Text instances contain surrogate pairs and which do not, and only compile a Regex with PCRE_UTF if it will actually be necessary. That would at least limit the penalty to cases where it's actually necessary... although if this applies to a user it likely apples to most of the buffers they open.

Another option might be to handle this during pattern preprocessing: replace á with [áÁ], for example. I suspect the end game there is us reimplementing what PCRE_UTF does, but poorly.

@nathansobo @maxbrunsfeld: What do you think is the right thing to do here? Is this why SUPPORT_UNICODE was off in the first place?

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

maxbrunsfeld commented Mar 28, 2018

@smashwilson I don't recall intentionally turning unicode support off; I think I just wasn't aware of (and didn't realize that we weren't handling) regexes with /u.

I'm assuming that the performance degradation only occurs when the PCRE2_UTF flag is set at runtime (as opposed to occurring no matter what just because SUPPORT_UNICODE was set during compilation)? If that's the case, then can we only set that flag when the JavaScript RegExp has the /u flag and contains non-ascii characters?

@smashwilson
Copy link
Copy Markdown
Contributor Author

I'm assuming that the performance degradation only occurs when the PCRE2_UTF flag is set?

Verifying that presently.

If that's the case, then can we only set that flag when the JavaScript regex has the /u flag and contains non-ascii characters?

Ah! I didn't know JavaScript regexes had a /u flag. Okay, I'll go that route. I might put the "detect when to add /u to the regex" logic in find-and-replace then.

@maxbrunsfeld
Copy link
Copy Markdown
Contributor

Actually it's a bit more complicated than detecting non-ascii characters. It seems like /u affects the behavior of negated character classes and the . operator, and adds a \u syntax.

https://mathiasbynens.be/notes/es6-unicode-regex

@smashwilson
Copy link
Copy Markdown
Contributor Author

I'm assuming that the performance degradation only occurs when the PCRE2_UTF flag is set?

Looks good: about 100ms with node, 1000ms in browser.

https://mathiasbynens.be/notes/es6-unicode-regex

I just found that too 😄

Well that's a bit of a mess. I'd suggest adding another toggle to find-and-replace to flip /u but that would just confuse everyone.

Of the /u changes I'd guess that we would generally want the new . behavior and the \u syntax. The negated character classes and some of the wrinkles of canonicalization and case folding feel riskier, but they at least seem to be the same thing that pcre2 does: https://www.pcre.org/current/doc/html/pcre2unicode.html.

@smashwilson
Copy link
Copy Markdown
Contributor Author

Offhand I'm leaning toward "set /u on find-and-replace regexes (and therefore pcre2 regexes)" when the buffer text or the pattern source contain astral-plane characters or \u{..} sequences. That seems like a reasonable rule of thumb, and tracking when a buffer contains or does not contain astral-plane characters should be doable as part of any other O(n) scan we have to do on the content anyway.

@smashwilson
Copy link
Copy Markdown
Contributor Author

tracking when a buffer contains or does not contain astral-plane characters

I'm punting this to a separate PR because "respecting /u" stands alone well and because I'm much less confident about the implementation there 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants