Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Handle quirk in how PHP parses identifiers #302
Description of the Change
While the PHP manual documents the range of valid identifiers as
Just don't fix this as it's really an odd quirk of PHP.
Arabic characters will be tokenized as identifiers where valid.
Specs coming later.
To clarify, why unicode code points work without the unicode support in PHP, is quite simple: UTF-8 encodes all unicode code points past 0x7F in bytes from range 0x80-0xFF, leaving the original ASCII part (0x00-0x7F, defined same way in all locales) intact.
Whoever decided that PHP parser should still match characters 0x7F-0xFF was quite smart to do so or got lucky. Regardless of which character we mean by a specific byte or byte sequence from range 0x80-0xFF (usually defined by locale), PHP simply consumes them. This means that if we try matching
The regular expression, found from PHP manual,
Where we went wrong is that the text we match using regular expressions is in Unicode, meaning that we are not presented a byte sequence
Because unicode code points that do not belong to ASCII are always constructed from bytes belonging to range 0x80-0xFF (when encoded in UTF-8), we can match these arbitrary length byte sequences by matching also all of the unicode code points past 0x7F. Per documentation, Oniguruma's highest code point is
Now you might ask, what about UTF-16 or UTF-32 or any other obscure encoding, then I have some good news. When zend.multibyte is enabled (prerequisite for such encodings), you can still use unicode code points in identifiers. If that ever changes, we might need to rethink our strategy, but considering this is almost always turned off and disabling unicode is undesired, which is also why we don't have to worry about this anytime soon.
It should be noted though, just because you can use emojis or other unicode code points in your variable and function names, doesn't mean you should. There is some locale dependent downcasing being done, which can result in the code not being portable. It's still a neat party trick though. (Or a way to localize variables/function names for educational use).
I took a look at the rest of the word boundaries that touch the modified parts. Left ones that have nothing to do with the pull request as they are now.