Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't try and minify props that contain unicode #416

Closed
wants to merge 1 commit into from
Closed

don't try and minify props that contain unicode #416

wants to merge 1 commit into from

Conversation

danprince
Copy link

@danprince danprince commented Feb 10, 2017

Firefox 50.1.0 (one behind latest stable) and Safari 10.0.2 (latest stable) are unable to parse object literal properties that contain unicode characters which are not valid ES5 identifiers.

The minifier saves bytes by converting objects such as { '\u2118':'wp' } into {℘:'wp'}, which only works when the key is a valid identifier.

The fix involves bypassing any properties with string representation containing characters outside of the a-zA-Z0-9_$ range.

Closes #415.

@mathiasbynens
Copy link
Contributor

mathiasbynens commented Feb 11, 2017

Not all browsers can parse unicode in raw property names

This description is kind of vague. Which browsers?

Anyway, here’s the underlying cause: https://mothereff.in/js-variables#%5Cu2118

That’s a valid identifier according to ECMAScript 6 / Unicode 8.0.0.

However, it is not a valid identifier as per ES5.

For example, probably works in the browsers you were referring to, since it was already a valid identifier in ES5.

See https://mathiasbynens.be/notes/javascript-identifiers and https://mathiasbynens.be/notes/javascript-identifiers-es6 for background.

Can we amend the description accordingly?

We could narrow down the containsUnicode regex from /[^a-z0-9$_]/i to a regex that matches post-ES5 identifier chars only (when ES5 output is desired).

@mathiasbynens
Copy link
Contributor

mathiasbynens commented Feb 11, 2017

We could narrow down the containsUnicode regex from /[^a-z0-9$_]/i to a regex that matches post-ES5 identifier chars only (when ES5 output is desired).

This script generates a regex that matches ES2015 IdentifierPart except for ES5 IdentifierPart:

// Note: run `npm install regenerate unicode-9.0.0` first!
// Based on https://gist.github.com/mathiasbynens/6334847.

const fs = require('fs');
const regenerate = require('regenerate');

// Which Unicode version should be used?
const version = '9.0.0';

// Set up a shorthand function to import Unicode data.
const get = function(what) {
	return require(`unicode-${ version }/${ what }/code-points.js`);
};

// Get the Unicode categories needed to construct the ES5 regex.
const Lu = get('General_Category/Uppercase_Letter');
const Ll = get('General_Category/Lowercase_Letter');
const Lt = get('General_Category/Titlecase_Letter');
const Lm = get('General_Category/Modifier_Letter');
const Lo = get('General_Category/Other_Letter');
const Nl = get('General_Category/Letter_Number');
const Mn = get('General_Category/Nonspacing_Mark');
const Mc = get('General_Category/Spacing_Mark');
const Nd = get('General_Category/Decimal_Number');
const Pc = get('General_Category/Connector_Punctuation');

// Get the Unicode properties needed to construct the ES6 regex.
const ID_Start = get('Binary_Property/ID_Start');
const ID_Continue = get('Binary_Property/ID_Continue');
const Other_ID_Start = get('Binary_Property/Other_ID_Start');

// https://mathiasbynens.be/notes/javascript-identifiers#valid-identifier-names
const es5IdentifierStart = regenerate('$', '_')
	.add(Lu, Ll, Lt, Lm, Lo, Nl)
	.removeRange(0x010000, 0x10FFFF); // Remove astral symbols.
const es5IdentifierPart = es5IdentifierStart.clone()
	.add('\u200C', '\u200D', Mn, Mc, Nd, Pc)
	.removeRange(0x010000, 0x10FFFF); // Remove astral symbols.

// http://ecma-international.org/ecma-262/6.0/#sec-identifier-names-static-semantics-early-errors
// http://unicode.org/reports/tr31/#Default_Identifier_Syntax
// https://bugs.ecmascript.org/show_bug.cgi?id=2717#c0
const es6IdentifierStart = regenerate(ID_Start)
	// Note: this already includes `Other_ID_Start`. http://git.io/wRCAfQ
	.add(
		'$',
		'_'
	);
const es6IdentifierPart = regenerate(ID_Continue)
	// Note: `ID_Continue` already includes `Other_ID_Continue`. http://git.io/wRCAfQ
	.add(Other_ID_Start)
	.add(
		'$',
		'_',
		'\u200C',
		'\u200D'
	);

const useUnicodeFlag = false; // set to `true` to get simpler but (in this case) larger output (see below)
const diff = es6IdentifierPart.clone()
	.remove(es5IdentifierPart);
const pattern = diff.toString({
	'hasUnicodeFlag': useUnicodeFlag
});

console.log(`/${ pattern }/${ useUnicodeFlag ? 'u' : ''}`);

The output is 3,320 bytes:

/[\xB7\u0387\u1369-\u1371\u19DA\u2118\u212E\u309B\u309C]|\uD800[\uDC00-\uDC0B\uDC0D-\uDC26\uDC28-\uDC3A\uDC3C\uDC3D\uDC3F-\uDC4D\uDC50-\uDC5D\uDC80-\uDCFA\uDD40-\uDD74\uDDFD\uDE80-\uDE9C\uDEA0-\uDED0\uDEE0\uDF00-\uDF1F\uDF30-\uDF4A\uDF50-\uDF7A\uDF80-\uDF9D\uDFA0-\uDFC3\uDFC8-\uDFCF\uDFD1-\uDFD5]|\uD801[\uDC00-\uDC9D\uDCA0-\uDCA9\uDCB0-\uDCD3\uDCD8-\uDCFB\uDD00-\uDD27\uDD30-\uDD63\uDE00-\uDF36\uDF40-\uDF55\uDF60-\uDF67]|\uD802[\uDC00-\uDC05\uDC08\uDC0A-\uDC35\uDC37\uDC38\uDC3C\uDC3F-\uDC55\uDC60-\uDC76\uDC80-\uDC9E\uDCE0-\uDCF2\uDCF4\uDCF5\uDD00-\uDD15\uDD20-\uDD39\uDD80-\uDDB7\uDDBE\uDDBF\uDE00-\uDE03\uDE05\uDE06\uDE0C-\uDE13\uDE15-\uDE17\uDE19-\uDE33\uDE38-\uDE3A\uDE3F\uDE60-\uDE7C\uDE80-\uDE9C\uDEC0-\uDEC7\uDEC9-\uDEE6\uDF00-\uDF35\uDF40-\uDF55\uDF60-\uDF72\uDF80-\uDF91]|\uD803[\uDC00-\uDC48\uDC80-\uDCB2\uDCC0-\uDCF2]|\uD804[\uDC00-\uDC46\uDC66-\uDC6F\uDC7F-\uDCBA\uDCD0-\uDCE8\uDCF0-\uDCF9\uDD00-\uDD34\uDD36-\uDD3F\uDD50-\uDD73\uDD76\uDD80-\uDDC4\uDDCA-\uDDCC\uDDD0-\uDDDA\uDDDC\uDE00-\uDE11\uDE13-\uDE37\uDE3E\uDE80-\uDE86\uDE88\uDE8A-\uDE8D\uDE8F-\uDE9D\uDE9F-\uDEA8\uDEB0-\uDEEA\uDEF0-\uDEF9\uDF00-\uDF03\uDF05-\uDF0C\uDF0F\uDF10\uDF13-\uDF28\uDF2A-\uDF30\uDF32\uDF33\uDF35-\uDF39\uDF3C-\uDF44\uDF47\uDF48\uDF4B-\uDF4D\uDF50\uDF57\uDF5D-\uDF63\uDF66-\uDF6C\uDF70-\uDF74]|\uD805[\uDC00-\uDC4A\uDC50-\uDC59\uDC80-\uDCC5\uDCC7\uDCD0-\uDCD9\uDD80-\uDDB5\uDDB8-\uDDC0\uDDD8-\uDDDD\uDE00-\uDE40\uDE44\uDE50-\uDE59\uDE80-\uDEB7\uDEC0-\uDEC9\uDF00-\uDF19\uDF1D-\uDF2B\uDF30-\uDF39]|\uD806[\uDCA0-\uDCE9\uDCFF\uDEC0-\uDEF8]|\uD807[\uDC00-\uDC08\uDC0A-\uDC36\uDC38-\uDC40\uDC50-\uDC59\uDC72-\uDC8F\uDC92-\uDCA7\uDCA9-\uDCB6]|\uD808[\uDC00-\uDF99]|\uD809[\uDC00-\uDC6E\uDC80-\uDD43]|[\uD80C\uD81C-\uD820\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872][\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E]|\uD811[\uDC00-\uDE46]|\uD81A[\uDC00-\uDE38\uDE40-\uDE5E\uDE60-\uDE69\uDED0-\uDEED\uDEF0-\uDEF4\uDF00-\uDF36\uDF40-\uDF43\uDF50-\uDF59\uDF63-\uDF77\uDF7D-\uDF8F]|\uD81B[\uDF00-\uDF44\uDF50-\uDF7E\uDF8F-\uDF9F\uDFE0]|\uD821[\uDC00-\uDFEC]|\uD822[\uDC00-\uDEF2]|\uD82C[\uDC00\uDC01]|\uD82F[\uDC00-\uDC6A\uDC70-\uDC7C\uDC80-\uDC88\uDC90-\uDC99\uDC9D\uDC9E]|\uD834[\uDD65-\uDD69\uDD6D-\uDD72\uDD7B-\uDD82\uDD85-\uDD8B\uDDAA-\uDDAD\uDE42-\uDE44]|\uD835[\uDC00-\uDC54\uDC56-\uDC9C\uDC9E\uDC9F\uDCA2\uDCA5\uDCA6\uDCA9-\uDCAC\uDCAE-\uDCB9\uDCBB\uDCBD-\uDCC3\uDCC5-\uDD05\uDD07-\uDD0A\uDD0D-\uDD14\uDD16-\uDD1C\uDD1E-\uDD39\uDD3B-\uDD3E\uDD40-\uDD44\uDD46\uDD4A-\uDD50\uDD52-\uDEA5\uDEA8-\uDEC0\uDEC2-\uDEDA\uDEDC-\uDEFA\uDEFC-\uDF14\uDF16-\uDF34\uDF36-\uDF4E\uDF50-\uDF6E\uDF70-\uDF88\uDF8A-\uDFA8\uDFAA-\uDFC2\uDFC4-\uDFCB\uDFCE-\uDFFF]|\uD836[\uDE00-\uDE36\uDE3B-\uDE6C\uDE75\uDE84\uDE9B-\uDE9F\uDEA1-\uDEAF]|\uD838[\uDC00-\uDC06\uDC08-\uDC18\uDC1B-\uDC21\uDC23\uDC24\uDC26-\uDC2A]|\uD83A[\uDC00-\uDCC4\uDCD0-\uDCD6\uDD00-\uDD4A\uDD50-\uDD59]|\uD83B[\uDE00-\uDE03\uDE05-\uDE1F\uDE21\uDE22\uDE24\uDE27\uDE29-\uDE32\uDE34-\uDE37\uDE39\uDE3B\uDE42\uDE47\uDE49\uDE4B\uDE4D-\uDE4F\uDE51\uDE52\uDE54\uDE57\uDE59\uDE5B\uDE5D\uDE5F\uDE61\uDE62\uDE64\uDE67-\uDE6A\uDE6C-\uDE72\uDE74-\uDE77\uDE79-\uDE7C\uDE7E\uDE80-\uDE89\uDE8B-\uDE9B\uDEA1-\uDEA3\uDEA5-\uDEA9\uDEAB-\uDEBB]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1]|\uD87E[\uDC00-\uDE1D]|\uDB40[\uDD00-\uDDEF]/

With useUnicodeFlag enabled, the output is simpler, but larger (4,280 bytes):

/[\xB7\u0387\u1369-\u1371\u19DA\u2118\u212E\u309B\u309C\u{10000}-\u{1000B}\u{1000D}-\u{10026}\u{10028}-\u{1003A}\u{1003C}\u{1003D}\u{1003F}-\u{1004D}\u{10050}-\u{1005D}\u{10080}-\u{100FA}\u{10140}-\u{10174}\u{101FD}\u{10280}-\u{1029C}\u{102A0}-\u{102D0}\u{102E0}\u{10300}-\u{1031F}\u{10330}-\u{1034A}\u{10350}-\u{1037A}\u{10380}-\u{1039D}\u{103A0}-\u{103C3}\u{103C8}-\u{103CF}\u{103D1}-\u{103D5}\u{10400}-\u{1049D}\u{104A0}-\u{104A9}\u{104B0}-\u{104D3}\u{104D8}-\u{104FB}\u{10500}-\u{10527}\u{10530}-\u{10563}\u{10600}-\u{10736}\u{10740}-\u{10755}\u{10760}-\u{10767}\u{10800}-\u{10805}\u{10808}\u{1080A}-\u{10835}\u{10837}\u{10838}\u{1083C}\u{1083F}-\u{10855}\u{10860}-\u{10876}\u{10880}-\u{1089E}\u{108E0}-\u{108F2}\u{108F4}\u{108F5}\u{10900}-\u{10915}\u{10920}-\u{10939}\u{10980}-\u{109B7}\u{109BE}\u{109BF}\u{10A00}-\u{10A03}\u{10A05}\u{10A06}\u{10A0C}-\u{10A13}\u{10A15}-\u{10A17}\u{10A19}-\u{10A33}\u{10A38}-\u{10A3A}\u{10A3F}\u{10A60}-\u{10A7C}\u{10A80}-\u{10A9C}\u{10AC0}-\u{10AC7}\u{10AC9}-\u{10AE6}\u{10B00}-\u{10B35}\u{10B40}-\u{10B55}\u{10B60}-\u{10B72}\u{10B80}-\u{10B91}\u{10C00}-\u{10C48}\u{10C80}-\u{10CB2}\u{10CC0}-\u{10CF2}\u{11000}-\u{11046}\u{11066}-\u{1106F}\u{1107F}-\u{110BA}\u{110D0}-\u{110E8}\u{110F0}-\u{110F9}\u{11100}-\u{11134}\u{11136}-\u{1113F}\u{11150}-\u{11173}\u{11176}\u{11180}-\u{111C4}\u{111CA}-\u{111CC}\u{111D0}-\u{111DA}\u{111DC}\u{11200}-\u{11211}\u{11213}-\u{11237}\u{1123E}\u{11280}-\u{11286}\u{11288}\u{1128A}-\u{1128D}\u{1128F}-\u{1129D}\u{1129F}-\u{112A8}\u{112B0}-\u{112EA}\u{112F0}-\u{112F9}\u{11300}-\u{11303}\u{11305}-\u{1130C}\u{1130F}\u{11310}\u{11313}-\u{11328}\u{1132A}-\u{11330}\u{11332}\u{11333}\u{11335}-\u{11339}\u{1133C}-\u{11344}\u{11347}\u{11348}\u{1134B}-\u{1134D}\u{11350}\u{11357}\u{1135D}-\u{11363}\u{11366}-\u{1136C}\u{11370}-\u{11374}\u{11400}-\u{1144A}\u{11450}-\u{11459}\u{11480}-\u{114C5}\u{114C7}\u{114D0}-\u{114D9}\u{11580}-\u{115B5}\u{115B8}-\u{115C0}\u{115D8}-\u{115DD}\u{11600}-\u{11640}\u{11644}\u{11650}-\u{11659}\u{11680}-\u{116B7}\u{116C0}-\u{116C9}\u{11700}-\u{11719}\u{1171D}-\u{1172B}\u{11730}-\u{11739}\u{118A0}-\u{118E9}\u{118FF}\u{11AC0}-\u{11AF8}\u{11C00}-\u{11C08}\u{11C0A}-\u{11C36}\u{11C38}-\u{11C40}\u{11C50}-\u{11C59}\u{11C72}-\u{11C8F}\u{11C92}-\u{11CA7}\u{11CA9}-\u{11CB6}\u{12000}-\u{12399}\u{12400}-\u{1246E}\u{12480}-\u{12543}\u{13000}-\u{1342E}\u{14400}-\u{14646}\u{16800}-\u{16A38}\u{16A40}-\u{16A5E}\u{16A60}-\u{16A69}\u{16AD0}-\u{16AED}\u{16AF0}-\u{16AF4}\u{16B00}-\u{16B36}\u{16B40}-\u{16B43}\u{16B50}-\u{16B59}\u{16B63}-\u{16B77}\u{16B7D}-\u{16B8F}\u{16F00}-\u{16F44}\u{16F50}-\u{16F7E}\u{16F8F}-\u{16F9F}\u{16FE0}\u{17000}-\u{187EC}\u{18800}-\u{18AF2}\u{1B000}\u{1B001}\u{1BC00}-\u{1BC6A}\u{1BC70}-\u{1BC7C}\u{1BC80}-\u{1BC88}\u{1BC90}-\u{1BC99}\u{1BC9D}\u{1BC9E}\u{1D165}-\u{1D169}\u{1D16D}-\u{1D172}\u{1D17B}-\u{1D182}\u{1D185}-\u{1D18B}\u{1D1AA}-\u{1D1AD}\u{1D242}-\u{1D244}\u{1D400}-\u{1D454}\u{1D456}-\u{1D49C}\u{1D49E}\u{1D49F}\u{1D4A2}\u{1D4A5}\u{1D4A6}\u{1D4A9}-\u{1D4AC}\u{1D4AE}-\u{1D4B9}\u{1D4BB}\u{1D4BD}-\u{1D4C3}\u{1D4C5}-\u{1D505}\u{1D507}-\u{1D50A}\u{1D50D}-\u{1D514}\u{1D516}-\u{1D51C}\u{1D51E}-\u{1D539}\u{1D53B}-\u{1D53E}\u{1D540}-\u{1D544}\u{1D546}\u{1D54A}-\u{1D550}\u{1D552}-\u{1D6A5}\u{1D6A8}-\u{1D6C0}\u{1D6C2}-\u{1D6DA}\u{1D6DC}-\u{1D6FA}\u{1D6FC}-\u{1D714}\u{1D716}-\u{1D734}\u{1D736}-\u{1D74E}\u{1D750}-\u{1D76E}\u{1D770}-\u{1D788}\u{1D78A}-\u{1D7A8}\u{1D7AA}-\u{1D7C2}\u{1D7C4}-\u{1D7CB}\u{1D7CE}-\u{1D7FF}\u{1DA00}-\u{1DA36}\u{1DA3B}-\u{1DA6C}\u{1DA75}\u{1DA84}\u{1DA9B}-\u{1DA9F}\u{1DAA1}-\u{1DAAF}\u{1E000}-\u{1E006}\u{1E008}-\u{1E018}\u{1E01B}-\u{1E021}\u{1E023}\u{1E024}\u{1E026}-\u{1E02A}\u{1E800}-\u{1E8C4}\u{1E8D0}-\u{1E8D6}\u{1E900}-\u{1E94A}\u{1E950}-\u{1E959}\u{1EE00}-\u{1EE03}\u{1EE05}-\u{1EE1F}\u{1EE21}\u{1EE22}\u{1EE24}\u{1EE27}\u{1EE29}-\u{1EE32}\u{1EE34}-\u{1EE37}\u{1EE39}\u{1EE3B}\u{1EE42}\u{1EE47}\u{1EE49}\u{1EE4B}\u{1EE4D}-\u{1EE4F}\u{1EE51}\u{1EE52}\u{1EE54}\u{1EE57}\u{1EE59}\u{1EE5B}\u{1EE5D}\u{1EE5F}\u{1EE61}\u{1EE62}\u{1EE64}\u{1EE67}-\u{1EE6A}\u{1EE6C}-\u{1EE72}\u{1EE74}-\u{1EE77}\u{1EE79}-\u{1EE7C}\u{1EE7E}\u{1EE80}-\u{1EE89}\u{1EE8B}-\u{1EE9B}\u{1EEA1}-\u{1EEA3}\u{1EEA5}-\u{1EEA9}\u{1EEAB}-\u{1EEBB}\u{20000}-\u{2A6D6}\u{2A700}-\u{2B734}\u{2B740}-\u{2B81D}\u{2B820}-\u{2CEA1}\u{2F800}-\u{2FA1D}\u{E0100}-\u{E01EF}]/u

@danprince
Copy link
Author

This description is kind of vague. Which browsers?

Updated the description.

Do you think the edge case of non-ES6 identifier compliant unicode is big enough to warrant adding another that extra complexity to the package? Seeing as this obviously hasn't broken for anyone else yet, I'd guess that there just aren't many people trying to use unicode in their property names.

@mathiasbynens
Copy link
Contributor

Do you think the edge case […] is big enough to warrant adding […]

That edge case is exactly what this issue is about. The current patch works, but sometimes results in larger output than necessary. The patch + the regular expression above would special-case only the identifiers that need special-casing, resulting in optimal output size, which might be more desirable for a minifier. It’s not up to me to make that call though :)

@mathiasbynens
Copy link
Contributor

Updated the description.

Sorry — I meant the description in the commit message

@@ -12,6 +12,11 @@ module.exports = function({ types: t }) {
return;
}

const containsUnicode = key.value.match(/[^a-z0-9$_]/i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don’t care about the actual matches, let’s use test instead of match.

Firefox 50.1.0 (one behind latest stable) and Safari 10.0.2 (latest stable)
are unable to parse object literal properties that contain unicode characters
which are not valid ES5 identifiers.

The minifier saves bytes by converting objects such as { '\u2118':'wp' } into
{℘:'wp'}, which only works when the key is a valid identifier.

The fix involves bypassing any properties with string representation
containing characters outside of the a-zA-Z0-9_$ range.
@danprince
Copy link
Author

That edge case is exactly what this issue is about. The current patch works, but sometimes results in larger output than necessary.

Sure, but the only times it ever results in larger output is when an object property contains a unicode character that isn't a valid ES6 identifier. Doesn't seem like that's going to happen often enough to be worth the computational overhead of having to check every object property name through that crazy regular expression.

Sorry — I meant the description in the commit message

Got it.

@mathiasbynens
Copy link
Contributor

Doesn't seem like that's going to happen often enough to be worth the computational overhead of having to check every object property name through that crazy regular expression.

Regular expressions are heavily optimized in all JS engines — I doubt the computation overhead is significant, but haven’t benchmarked it yet. For a build tool such as a minifier IMHO it’s worth it to spend a couple of extra milliseconds to produce optimal output. You only build once, but the output is consumed (presumably) more than that.

@patrick-steele-idem
Copy link

I also ran into the problem when using Babili to minify he. My vote would be to go with with the solution that @danprince provided for now since it solves the problem and unicode property names are sufficiently rare. Maybe a follow up PR for the more precise fix if someone thought that was needed? In the meantime, I'll have to use a fork of this plugin which is unfortunate, but it would be great to see this PR or another fix merged in and a new version published. Thanks.

patrick-steele-idem added a commit to patrick-steele-idem/babili that referenced this pull request Feb 28, 2017
@patrick-steele-idem
Copy link

I did some additional investigation and found that the babel-plugin-minify-constant-folding plugin will also introduce Identifier nodes for object keys even if the key name contains special unicode characters because t.isValidateIdentifier() doesn't restrict to safe identifiers (safe for currently the latest Safari v10.0.3, that is). The babel-plugin-transform-property-literals plugin only transforms string literal keys (not identifier keys). Therefore, even with this patch, he does not get minified in a way that can be loaded in the latest Safari.

I tried adding additional plugins/presets to run after babili to convert the unsafe object keys back to string literals, but for some reason babili plugins always ran last. Any ideas?

I decided to create another PR that only allows safe object keys to be used as identifiers: #445

Please consider merging that PR because the current babili does not always produce JS code that will load in all browsers.

@timsuchanek
Copy link

Yes, would be great if the PR gets merged as this stops us from using babili (which we need in order to have webpack treeshaking with TypeScript)
Is there anything blocking the merge where help is needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Syntax errors after minifying unicode
4 participants