Enable unicode support in pcre by smashwilson · Pull Request #61 · atom/superstring

smashwilson · 2018-03-27T21:37:06Z

Enable Unicode support in pcre and compile regular expressions with PCRE2_UTF.

Fixes #56.

smashwilson · 2018-03-28T19:14:14Z

This does have a not insignificant performance and size impact, hence the failing test:

Without `SUPPORT_UNICODE`

Size of superstring.node: 768K
Duration of "can be called repeatedly between buffer mutations without harming performance": 121ms
Size of browser.js: 808K
Duration of "can be called repeatedly between buffer mutations without harming performance" with browser tests: 715ms

With `SUPPORT_UNICODE`

Size of superstring.node: 896K
Duration of "can be called repeatedly between buffer mutations without harming performance": 440ms
Size of browser.js: 1.1M
Duration of "can be called repeatedly between buffer mutations without harming performance" with browser tests: 3059ms

All recorded on macOS, naturally.

I'm not sure what to do with this one from here. That's a pretty big hit and this is a performance-critical part of Atom, but not handling unicode surrogates and character attributes at all is bad for internationalization.

One idea I have is to remember which Text instances contain surrogate pairs and which do not, and only compile a Regex with PCRE_UTF if it will actually be necessary. That would at least limit the penalty to cases where it's actually necessary... although if this applies to a user it likely apples to most of the buffers they open.

Another option might be to handle this during pattern preprocessing: replace á with [áÁ], for example. I suspect the end game there is us reimplementing what PCRE_UTF does, but poorly.

@nathansobo @maxbrunsfeld: What do you think is the right thing to do here? Is this why SUPPORT_UNICODE was off in the first place?

maxbrunsfeld · 2018-03-28T19:20:59Z

@smashwilson I don't recall intentionally turning unicode support off; I think I just wasn't aware of (and didn't realize that we weren't handling) regexes with /u.

I'm assuming that the performance degradation only occurs when the PCRE2_UTF flag is set at runtime (as opposed to occurring no matter what just because SUPPORT_UNICODE was set during compilation)? If that's the case, then can we only set that flag when the JavaScript RegExp has the /u flag and contains non-ascii characters?

smashwilson · 2018-03-28T19:31:05Z

I'm assuming that the performance degradation only occurs when the PCRE2_UTF flag is set?

Verifying that presently.

If that's the case, then can we only set that flag when the JavaScript regex has the /u flag and contains non-ascii characters?

Ah! I didn't know JavaScript regexes had a /u flag. Okay, I'll go that route. I might put the "detect when to add /u to the regex" logic in find-and-replace then.

maxbrunsfeld · 2018-03-28T19:34:46Z

Actually it's a bit more complicated than detecting non-ascii characters. It seems like /u affects the behavior of negated character classes and the . operator, and adds a \u syntax.

https://mathiasbynens.be/notes/es6-unicode-regex

smashwilson · 2018-03-28T19:47:26Z

I'm assuming that the performance degradation only occurs when the PCRE2_UTF flag is set?

Looks good: about 100ms with node, 1000ms in browser.

https://mathiasbynens.be/notes/es6-unicode-regex

I just found that too 😄

Well that's a bit of a mess. I'd suggest adding another toggle to find-and-replace to flip /u but that would just confuse everyone.

Of the /u changes I'd guess that we would generally want the new . behavior and the \u syntax. The negated character classes and some of the wrinkles of canonicalization and case folding feel riskier, but they at least seem to be the same thing that pcre2 does: https://www.pcre.org/current/doc/html/pcre2unicode.html.

smashwilson · 2018-03-28T19:51:11Z

Offhand I'm leaning toward "set /u on find-and-replace regexes (and therefore pcre2 regexes)" when the buffer text or the pattern source contain astral-plane characters or \u{..} sequences. That seems like a reasonable rule of thumb, and tracking when a buffer contains or does not contain astral-plane characters should be doable as part of any other O(n) scan we have to do on the content anyway.

smashwilson · 2018-03-29T12:04:52Z

tracking when a buffer contains or does not contain astral-plane characters

I'm punting this to a separate PR because "respecting /u" stands alone well and because I'm much less confident about the implementation there 😄

Enable unicode support in pcre

e72589f

smashwilson added 3 commits March 28, 2018 16:35

Set NEWLINE_DEFAULT instead of using a compile context

c8995ae

Add a unicode argument to Regex construction

d790cf5

Use /u for our test fixture regexp

415cae5

maxbrunsfeld approved these changes Mar 28, 2018

View reviewed changes

smashwilson merged commit eb4a1ff into master Mar 29, 2018

smashwilson deleted the aw/unicode branch March 29, 2018 12:05

smashwilson mentioned this pull request Mar 29, 2018

Create unicode regexps when necessary atom/find-and-replace#1009

Merged

maxbrunsfeld mentioned this pull request Jun 2, 2018

Find and Replace throws an invalid regex escape, even when regex is turned off atom/find-and-replace#1022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable unicode support in pcre#61

Enable unicode support in pcre#61
smashwilson merged 4 commits intomasterfrom
aw/unicode

smashwilson commented Mar 27, 2018 •

edited

Loading

Uh oh!

smashwilson commented Mar 28, 2018

Uh oh!

maxbrunsfeld commented Mar 28, 2018 •

edited

Loading

Uh oh!

smashwilson commented Mar 28, 2018

Uh oh!

maxbrunsfeld commented Mar 28, 2018

Uh oh!

smashwilson commented Mar 28, 2018

Uh oh!

smashwilson commented Mar 28, 2018

Uh oh!

smashwilson commented Mar 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smashwilson commented Mar 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smashwilson commented Mar 28, 2018

Without SUPPORT_UNICODE

With SUPPORT_UNICODE

Uh oh!

maxbrunsfeld commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smashwilson commented Mar 28, 2018

Uh oh!

maxbrunsfeld commented Mar 28, 2018

Uh oh!

smashwilson commented Mar 28, 2018

Uh oh!

smashwilson commented Mar 28, 2018

Uh oh!

smashwilson commented Mar 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

smashwilson commented Mar 27, 2018 •

edited

Loading

Without `SUPPORT_UNICODE`

With `SUPPORT_UNICODE`

maxbrunsfeld commented Mar 28, 2018 •

edited

Loading