Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ES6-style identifier characters #215

Closed
marijnh opened this issue Mar 4, 2015 · 16 comments
Closed

Support for ES6-style identifier characters #215

marijnh opened this issue Mar 4, 2015 · 16 comments

Comments

@marijnh
Copy link
Member

@marijnh marijnh commented Mar 4, 2015

Our current big fat regexps only recognize ES5-style identifier characters. See https://mathiasbynens.be/notes/javascript-identifiers-es6 . We need a separate set of, even bigger, regexps to properly parse ES6 identifiers. I am not really happy about just pasting those in and bloating up the library size. A clever alternative trick would be nice.

@mathiasbynens
Copy link
Contributor

@mathiasbynens mathiasbynens commented Mar 4, 2015

In terms of BMP characters in identifiers, the only difference between ES5 and ES6 is the one in the first example here: https://mathiasbynens.be/notes/javascript-identifiers-es6#acceptable-unicode-symbols

// Valid in ES5 & Unicode v5.1.0+, but invalid in ES6:
var ; // U+2E2F VERTICAL TILDE
var \u2E2F; // U+2E2F VERTICAL TILDE

In other words, once you add the ES6 regex you don’t need the ES5 regex anymore. You could just special-case U+2E2F and astral symbols.

@marijnh
Copy link
Member Author

@marijnh marijnh commented Mar 4, 2015

Okay, cool. I'll look into compressing down your ES6 regexp by extracting duplicated pieces.

marijnh referenced this issue Mar 4, 2015
We still can't properly recognize code points as ES6-style
identifier chars.

Issue #214
@mathiasbynens
Copy link
Contributor

@mathiasbynens mathiasbynens commented Mar 4, 2015

You could apply the same trick you’re using in Acorn at the moment, i.e. having separate IdentifierStart and IdentifierPartOnly parts and then dynamically concatenating the regex at runtime. This can easily be done by tweaking https://gist.github.com/mathiasbynens/6334847 (cfr. https://github.com/marijnh/acorn/blob/891d5d07ddbe587d605ff4e00538bc05ed482dd3/tools/generate-identifier-regex.js). However, keep in mind that by doing so, the resulting regex will be larger (in memory) than the current output. You’re trading library size for performance / memory usage.

@marijnh
Copy link
Member Author

@marijnh marijnh commented Mar 4, 2015

If I didn't make any mistake in constructing the regexps (see below), there appear to be more differences than just the one in the example you give. These identifiers from the tests are no longer considered valid once I drop in the new regexps:

T‿ = []
price_9̶9̶_89

Here's how I'm generating them:

var regenerate = require('regenerate');
// Which Unicode version should be used?
var version = '7.0.0';
var ID_Start = require('unicode-' + version + '/properties/ID_Start/code-points');
var ID_Continue = require('unicode-' + version + '/properties/ID_Continue/code-points');

var identifierStart = regenerate(ID_Start).add("$", "_");
var identifierPart = regenerate(ID_Continue).add("\u200c", "\u200d").remove(identifierStart);

console.log(identifierStart.toString());
console.log(identifierPart.toString());

@mathiasbynens
Copy link
Contributor

@mathiasbynens mathiasbynens commented Mar 4, 2015

Both those examples are still valid in ES6. Here’s my validator that uses the regular expression I mentioned:

https://mothereff.in/js-variables#T%E2%80%BF
https://mothereff.in/js-variables#price%5f9%CC%B69%CC%B6%5f89

Your script looks good to me, so there must be something wrong with the way you use the regexps. Can I view that code anywhere?

@marijnh
Copy link
Member Author

@marijnh marijnh commented Mar 4, 2015

You're quite right, I still had the assumption that these were simple character ranges elsewhere in the code, though they now use the | operator. Fixing that, the new regexps work. Now to update the identifier reader to understand multibyte characters.

@mathiasbynens
Copy link
Contributor

@mathiasbynens mathiasbynens commented Mar 4, 2015

Ah yes, wrapping each part in (?:x) is now needed. Sorry for not mentioning that explicitly.

marijnh added a commit that referenced this issue Mar 5, 2015
Add a data structure to recognize astral identifier chars. Parse whole
code points when looking for identifiers.

Issue #215
@marijnh
Copy link
Member Author

@marijnh marijnh commented Mar 5, 2015

Attached patch fixes this, using a non-regexp approach for testing the astral characters.

@mathiasbynens
Copy link
Contributor

@mathiasbynens mathiasbynens commented Jan 13, 2016

Looks like the implementation is broken:

$ cat foo.js
// Random example from https://mathiasbynens.be/notes/javascript-identifiers-es6
// Invalid in ES5, but valid in ES6:
var 𐊧; // U+102A7 CARIAN LETTER A2
var \u{102A7}; // U+102A7 CARIAN LETTER A2

$ acorn foo.js
Unexpected character '𐊧' (2:4)

https://mathiasbynens.be/notes/javascript-identifiers-es6 has more examples that may be useful when testing.

@RReverser
Copy link
Member

@RReverser RReverser commented Jan 13, 2016

@mathiasbynens No, it's just that Acorn parses ES5 by default. acorn --ecma6 foo.js seems to work.

@mathiasbynens
Copy link
Contributor

@mathiasbynens mathiasbynens commented Jan 13, 2016

Thanks! That’s a strange default… Anyway, sorry for not RTFM’ing :)

@marijnh
Copy link
Member Author

@marijnh marijnh commented Jan 13, 2016

ES6 wasn't really a thing (and was certainly not supported) when the API was initially designed. Hence.

@RReverser
Copy link
Member

@RReverser RReverser commented Jan 13, 2016

@marijnh Btw, I think now that it's a released & established spec, we can bump defaults in a major update.

@mathiasbynens
Copy link
Contributor

@mathiasbynens mathiasbynens commented Jan 13, 2016

ES6 wasn't really a thing (and was certainly not supported) when the API was initially designed. Hence.

Sure, but it’s supposed to be backwards compatible — ES5 code should still parse per ES6.

we can bump defaults in a major update.

👍

@marijnh
Copy link
Member Author

@marijnh marijnh commented Jan 13, 2016

Sure, but it’s supposed to be backwards compatible

Not entirely, as was pointed out to me recently (see for example let[1]), and also you'll get nodes in your AST that you may not have been expecting, so this is definitely not a backwards-compatible change.

@marijnh
Copy link
Member Author

@marijnh marijnh commented Jan 13, 2016

Discussion moved to #374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants