New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle regexes with unicode escape sequences in .find and .findAll #43

Merged
merged 2 commits into from Nov 8, 2017

Conversation

Projects
None yet
3 participants
@maxbrunsfeld
Contributor

maxbrunsfeld commented Nov 8, 2017

Fixes atom/atom#16126

Background

The TextBuffer.find family of methods use PCRE to perform regex matching directly on the buffer's native contents. This has a number of advantages: it eliminates a lot of string copying, removes to need for conversions between raw character indices and row/column coordinates, and allows searching to be done on a background thread if desired. See #5, #35.

Problem

There are some differences between PCRE's and ECMAScript's regex syntax. One difference is that PCRE does not support the \u00df syntax for specifying UTF16 character codes.

Solution

Luckily, it's trivial to convert these character sequences into their actual UTF16 values. I've added that conversion in this PR.

Future Work

There may be other incompatibilities between PCRE and ECMAScript regexes. One other example I know of is the unicode code point escape sequence, which are slightly different than the UTF16 character code escape sequences I've dealt with here. They should be easy to support as well, though I have not done so here.

/cc @t9md

@maxbrunsfeld maxbrunsfeld merged commit 6694dd5 into master Nov 8, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@maxbrunsfeld maxbrunsfeld deleted the mb-handle-unicode-escape-sequences branch Nov 8, 2017

@nathansobo

This comment has been minimized.

Show comment
Hide comment
@nathansobo

nathansobo Nov 8, 2017

Contributor

💖

Contributor

nathansobo commented Nov 8, 2017

💖

@leroix

This comment has been minimized.

Show comment
Hide comment
@leroix

leroix Nov 8, 2017

Contributor

It would be nice if there was a standalone c++ module version of v8's Irregexp.

Contributor

leroix commented Nov 8, 2017

It would be nice if there was a standalone c++ module version of v8's Irregexp.

@nathansobo

This comment has been minimized.

Show comment
Hide comment
@nathansobo

nathansobo Nov 8, 2017

Contributor

It would be nice if there was a standalone c++ module version of v8's Irregexp.

I have lamented this as well.

Contributor

nathansobo commented Nov 8, 2017

It would be nice if there was a standalone c++ module version of v8's Irregexp.

I have lamented this as well.

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Nov 8, 2017

Contributor

Yeah, we looked into that a little bit, but concluded that Irregexp is totally coupled to V8's internals - I think it generates native code using the same JIT machinery that V8 uses for compiling JS code.

Luckily, PCRE is a pretty lightweight dependency.

Contributor

maxbrunsfeld commented Nov 8, 2017

Yeah, we looked into that a little bit, but concluded that Irregexp is totally coupled to V8's internals - I think it generates native code using the same JIT machinery that V8 uses for compiling JS code.

Luckily, PCRE is a pretty lightweight dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment