New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle quirk in how PHP parses identifiers #302

Merged
merged 3 commits into from Dec 7, 2017

Conversation

Projects
None yet
2 participants
@50Wliu
Member

50Wliu commented Nov 29, 2017

Requirements

  • Filling out the template is required. Any pull request that does not include enough information to be reviewed in a timely manner may be closed at the maintainers' discretion.
  • All new code requires tests to ensure against regressions

Description of the Change

While the PHP manual documents the range of valid identifiers as [a-z_\x{7f}-\x{ff}][a-z0-9_\x{7f}-\x{ff}]*, due to a quirk with how PHP parses identifiers it looks at each byte rather than characters. Therefore while سن is technically not in that range, its hex representation is d8 b3 d9 86, and PHP recognizes it as a valid identifier. To handle this behavior in Atom, we use the modified regex [a-z_\x{7f}-\x{7fffffff}][a-z0-9_\x{7f}-\x{7fffffff}]*. Thanks to @Ingramz for the fix.

Alternate Designs

Just don't fix this as it's really an odd quirk of PHP.

Benefits

Arabic characters will be tokenized as identifiers where valid.

Possible Drawbacks

Unknown.

Applicable Issues

Fixes #301

Specs coming later.

@Ingramz

This comment has been minimized.

Show comment
Hide comment
@Ingramz

Ingramz Nov 29, 2017

Contributor

To clarify, why unicode code points work without the unicode support in PHP, is quite simple: UTF-8 encodes all unicode code points past 0x7F in bytes from range 0x80-0xFF, leaving the original ASCII part (0x00-0x7F, defined same way in all locales) intact.

Whoever decided that PHP parser should still match characters 0x7F-0xFF was quite smart to do so or got lucky. Regardless of which character we mean by a specific byte or byte sequence from range 0x80-0xFF (usually defined by locale), PHP simply consumes them. This means that if we try matching ä in ISO-8859-1 encoding (e4) or UTF-8 encoding (c3 a4), it wouldn't make much of a difference to PHP as in both cases they are bytes from range 0x80-0xFF.

The regular expression, found from PHP manual, [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]* is then correct if matching is done per byte.

Where we went wrong is that the text we match using regular expressions is in Unicode, meaning that we are not presented a byte sequence c3 a4, as that is automatically converted to e4 (U+00E4). When that idea is extended to code points greater than 0xFF, then in case of for instance U+2620 (☠️), PHP sees 3 bytes e2 98 a0 but Oniguruma sees one code point \x2620, that cannot be matched using \xe2\x98\xa0 (\x{2620} has to be used instead). Notice that the 3 byte version uses bytes from range 0x80-0xFF.

Because unicode code points that do not belong to ASCII are always constructed from bytes belonging to range 0x80-0xFF (when encoded in UTF-8), we can match these arbitrary length byte sequences by matching also all of the unicode code points past 0x7F. Per documentation, Oniguruma's highest code point is 7fffffff, something that UTF-8 cannot encode, but if it ever becomes an issue, we can fix it by adjusting the range to a lower value.

Now you might ask, what about UTF-16 or UTF-32 or any other obscure encoding, then I have some good news. When zend.multibyte is enabled (prerequisite for such encodings), you can still use unicode code points in identifiers. If that ever changes, we might need to rethink our strategy, but considering this is almost always turned off and disabling unicode is undesired, which is also why we don't have to worry about this anytime soon.

It should be noted though, just because you can use emojis or other unicode code points in your variable and function names, doesn't mean you should. There is some locale dependent downcasing being done, which can result in the code not being portable. It's still a neat party trick though. (Or a way to localize variables/function names for educational use).

Contributor

Ingramz commented Nov 29, 2017

To clarify, why unicode code points work without the unicode support in PHP, is quite simple: UTF-8 encodes all unicode code points past 0x7F in bytes from range 0x80-0xFF, leaving the original ASCII part (0x00-0x7F, defined same way in all locales) intact.

Whoever decided that PHP parser should still match characters 0x7F-0xFF was quite smart to do so or got lucky. Regardless of which character we mean by a specific byte or byte sequence from range 0x80-0xFF (usually defined by locale), PHP simply consumes them. This means that if we try matching ä in ISO-8859-1 encoding (e4) or UTF-8 encoding (c3 a4), it wouldn't make much of a difference to PHP as in both cases they are bytes from range 0x80-0xFF.

The regular expression, found from PHP manual, [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]* is then correct if matching is done per byte.

Where we went wrong is that the text we match using regular expressions is in Unicode, meaning that we are not presented a byte sequence c3 a4, as that is automatically converted to e4 (U+00E4). When that idea is extended to code points greater than 0xFF, then in case of for instance U+2620 (☠️), PHP sees 3 bytes e2 98 a0 but Oniguruma sees one code point \x2620, that cannot be matched using \xe2\x98\xa0 (\x{2620} has to be used instead). Notice that the 3 byte version uses bytes from range 0x80-0xFF.

Because unicode code points that do not belong to ASCII are always constructed from bytes belonging to range 0x80-0xFF (when encoded in UTF-8), we can match these arbitrary length byte sequences by matching also all of the unicode code points past 0x7F. Per documentation, Oniguruma's highest code point is 7fffffff, something that UTF-8 cannot encode, but if it ever becomes an issue, we can fix it by adjusting the range to a lower value.

Now you might ask, what about UTF-16 or UTF-32 or any other obscure encoding, then I have some good news. When zend.multibyte is enabled (prerequisite for such encodings), you can still use unicode code points in identifiers. If that ever changes, we might need to rethink our strategy, but considering this is almost always turned off and disabling unicode is undesired, which is also why we don't have to worry about this anytime soon.

It should be noted though, just because you can use emojis or other unicode code points in your variable and function names, doesn't mean you should. There is some locale dependent downcasing being done, which can result in the code not being portable. It's still a neat party trick though. (Or a way to localize variables/function names for educational use).

Show outdated Hide outdated grammars/php.cson Outdated
@Ingramz

I took a look at the rest of the word boundaries that touch the modified parts. Left ones that have nothing to do with the pull request as they are now.

Show outdated Hide outdated grammars/php.cson Outdated
Show outdated Hide outdated grammars/php.cson Outdated
Show outdated Hide outdated grammars/php.cson Outdated
Show outdated Hide outdated grammars/php.cson Outdated

@50Wliu 50Wliu merged commit 11b2057 into master Dec 7, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@50Wliu 50Wliu deleted the wl-identifier-quirks branch Dec 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment