New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: wrong unicode regexp #6931

Open
vitaliylag opened this Issue Nov 29, 2017 · 24 comments

Comments

Projects
None yet
6 participants
@vitaliylag

vitaliylag commented Nov 29, 2017

Here is my code:

var a = 'a',                        //usual character
    s = String.fromCharCode(55296), //start of surrogate pair
    e = String.fromCharCode(56320); //end of surrogate pair

console.log((a + e).match(/./ug).length); //must print "2" but prints "1" after babel
console.log((e + e).match(/./ug).length); //must print "2" but prints "1" after babel

Babel transforms it to this code:

var a = 'a',
    s = String.fromCharCode(55296),
    e = String.fromCharCode(56320);

console.log((a + e).match(/(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])/g).length); //must print "2" but prints "1" after babel
console.log((e + e).match(/(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])/g).length); //must print "2" but prints "1" after babel

As you can see "." character is processed totally wrong.

Currently babel converts it to:

/(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])/

but must convert to:

/(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?)/

This not only much shorter but also does not contain such easy bugs.

I also wrote a test:

'use strict';



if (typeof describe === 'undefined') {
	//Use "mocha" command to run this test (npm install -g mocha and move the file to dir with name "test")
	global.describe = (name, clb) => clb();
	global.it = (name, clb) => clb();
}



const assert = require('assert');



const a = 'a',                        //usual character
      s = String.fromCharCode(55296), //start of surrogate pair
      e = String.fromCharCode(56320); //end of surrogate pair

testAll('native',      /./ug);
testAll('my',          /(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]?)/g);
testAll('my-shorted',  /(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\0-\t\x0B\f\x0E-\u2027\u202A-\uFFFF])/g);                //It is shorter but performance may be worse
testAll('babel-fixed', /(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])/g);
testAll('babel',       /(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])/g);



function testAll(name, rexp) {
	describe(name, function() {
		testOne(rexp, a + a, [a, a]);
		testOne(rexp, a + s, [a, s]);
		testOne(rexp, a + e, [a, e]);
		testOne(rexp, s + a, [s, a]);
		testOne(rexp, s + s, [s, s]);
		testOne(rexp, s + e, [s + e]);
		testOne(rexp, e + a, [e, a]);
		testOne(rexp, e + s, [e, s]);
		testOne(rexp, e + e, [e, e]);
		
		testOne(rexp, a, [a]);
		testOne(rexp, s, [s]);
		testOne(rexp, e, [e]);
		
		testOne(rexp, a + a + a, [a, a, a]);
		testOne(rexp, a + a + s, [a, a, s]);
		testOne(rexp, a + a + e, [a, a, e]);
		testOne(rexp, a + s + a, [a, s, a]);
		testOne(rexp, a + s + s, [a, s, s]);
		testOne(rexp, a + s + e, [a, s + e]);
		testOne(rexp, a + e + a, [a, e, a]);
		testOne(rexp, a + e + s, [a, e, s]);
		testOne(rexp, a + e + e, [a, e, e]);
		
		testOne(rexp, s + a + a, [s, a, a]);
		testOne(rexp, s + a + s, [s, a, s]);
		testOne(rexp, s + a + e, [s, a, e]);
		testOne(rexp, s + s + a, [s, s, a]);
		testOne(rexp, s + s + s, [s, s, s]);
		testOne(rexp, s + s + e, [s, s + e]);
		testOne(rexp, s + e + a, [s + e, a]);
		testOne(rexp, s + e + s, [s + e, s]);
		testOne(rexp, s + e + e, [s + e, e]);
		
		testOne(rexp, e + a + a, [e, a, a]);
		testOne(rexp, e + a + s, [e, a, s]);
		testOne(rexp, e + a + e, [e, a, e]);
		testOne(rexp, e + s + a, [e, s, a]);
		testOne(rexp, e + s + s, [e, s, s]);
		testOne(rexp, e + s + e, [e, s + e]);
		testOne(rexp, e + e + a, [e, e, a]);
		testOne(rexp, e + e + s, [e, e, s]);
		testOne(rexp, e + e + e, [e, e, e]);
	});
}



function testOne(rexp, s, expectedResult) {
	it('oneTest', function() {
		const a = s.match(rexp);
		assert.deepStrictEqual(a, expectedResult);
		assert.strictEqual(a.length, expectedResult.length);
	});
}

Native implementation and my implementation pass the test but babel implementation does not.


Versions:
babel-core: 6.26.0
babel-preset-env: 1.6.1
babel-loader: 7.1.2
(I am using Babel with webpack)

The test field in the main page https://babeljs.io/ has the same bug.

@babel-bot

This comment has been minimized.

Show comment
Hide comment
@babel-bot

babel-bot Nov 29, 2017

Collaborator

Hey @vitaliylag! We really appreciate you taking the time to report an issue. The collaborators
on this project attempt to help as many people as possible, but we're a limited number of volunteers,
so it's possible this won't be addressed swiftly.

If you need any help, or just have general Babel or JavaScript questions, we have a vibrant Slack
community that typically always has someone willing to help. You can sign-up here
for an invite.

Collaborator

babel-bot commented Nov 29, 2017

Hey @vitaliylag! We really appreciate you taking the time to report an issue. The collaborators
on this project attempt to help as many people as possible, but we're a limited number of volunteers,
so it's possible this won't be addressed swiftly.

If you need any help, or just have general Babel or JavaScript questions, we have a vibrant Slack
community that typically always has someone willing to help. You can sign-up here
for an invite.

@nicolo-ribaudo

This comment has been minimized.

Show comment
Hide comment
@nicolo-ribaudo

nicolo-ribaudo Nov 29, 2017

Member

@mathiasbynens This might be a bug with regexpu-core?

Member

nicolo-ribaudo commented Nov 29, 2017

@mathiasbynens This might be a bug with regexpu-core?

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Nov 29, 2017

Contributor

From https://github.com/mathiasbynens/regenerate#regenerateprototypetostringoptions:

Note that lone low surrogates cannot be matched accurately using regular expressions in JavaScript. Regenerate’s output makes a best-effort approach but there can be false negatives in this regard.

Lookbehind assertions make this possible but they’re not widely shipped yet so we just cannot rely on them for now.

Contributor

mathiasbynens commented Nov 29, 2017

From https://github.com/mathiasbynens/regenerate#regenerateprototypetostringoptions:

Note that lone low surrogates cannot be matched accurately using regular expressions in JavaScript. Regenerate’s output makes a best-effort approach but there can be false negatives in this regard.

Lookbehind assertions make this possible but they’re not widely shipped yet so we just cannot rely on them for now.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Nov 29, 2017

Contributor

@vitaliylag Do you want to submit a PR to Regenerate?

Contributor

mathiasbynens commented Nov 29, 2017

@vitaliylag Do you want to submit a PR to Regenerate?

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

I don't know what is Regenerate. I just don't understand why the Babel process "." character such strange way. Right regular expression I wrote above is so easy and really makes sense.

vitaliylag commented Nov 30, 2017

I don't know what is Regenerate. I just don't understand why the Babel process "." character such strange way. Right regular expression I wrote above is so easy and really makes sense.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Nov 30, 2017

Contributor

Babel uses regexpu-core which uses Regenerate, which is the source of this bug. Would you be open to submitting a patch based on your suggested RegExp?

Your test uses U+DC00 which is a low surrogate or trail surrogate.

Contributor

mathiasbynens commented Nov 30, 2017

Babel uses regexpu-core which uses Regenerate, which is the source of this bug. Would you be open to submitting a patch based on your suggested RegExp?

Your test uses U+DC00 which is a low surrogate or trail surrogate.

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

Your test uses U+DC00 which is a low surrogate or trail surrogate.

Yes, I confused them :)

Would you be open to submitting a patch based on your suggested RegExp?

I do not know how does Regenerate work so I do not know what can I change to fix this bug. Searching in the code does not match bad regexp so I guess I cannot create a PR.

vitaliylag commented Nov 30, 2017

Your test uses U+DC00 which is a low surrogate or trail surrogate.

Yes, I confused them :)

Would you be open to submitting a patch based on your suggested RegExp?

I do not know how does Regenerate work so I do not know what can I change to fix this bug. Searching in the code does not match bad regexp so I guess I cannot create a PR.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

No, it looks too difficult to understand it all.

vitaliylag commented Nov 30, 2017

No, it looks too difficult to understand it all.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Nov 30, 2017

Contributor

@vitaliylag Ok, here’s the underlying problem:

// The goal is to match lone surrogates only.
const regex = /^a(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b$/;

// Lone high surrogate U+D800:
console.assert(regex.test('a\uD800b') === true);
// Lone low surrogate U+DC00:
console.assert(regex.test('a\uDC00b') === true);
// Surrogate pair (i.e. no lone surrogates):
console.assert(regex.test('a\u{1D306}b') === false);

console.log('You did it!');

Can you think of a regular expression that passes all the assertions without using lookbehind assertions? If so, please let me know.

Update: This test case is misleading as the problem is more complex. #6931 (comment) is hopefully more accurate.

Contributor

mathiasbynens commented Nov 30, 2017

@vitaliylag Ok, here’s the underlying problem:

// The goal is to match lone surrogates only.
const regex = /^a(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b$/;

// Lone high surrogate U+D800:
console.assert(regex.test('a\uD800b') === true);
// Lone low surrogate U+DC00:
console.assert(regex.test('a\uDC00b') === true);
// Surrogate pair (i.e. no lone surrogates):
console.assert(regex.test('a\u{1D306}b') === false);

console.log('You did it!');

Can you think of a regular expression that passes all the assertions without using lookbehind assertions? If so, please let me know.

Update: This test case is misleading as the problem is more complex. #6931 (comment) is hopefully more accurate.

@jridgewell

This comment has been minimized.

Show comment
Hide comment
@jridgewell

jridgewell Nov 30, 2017

Member

/^a(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])b$/?

Member

jridgewell commented Nov 30, 2017

/^a(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])b$/?

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

What for do you need to match lone surrogates? What is the original regexp that we need to convert to ES5? If it was /^a.b$/u then third call should return true not false. Regexp I suggested already returns true so everything is good.

vitaliylag commented Nov 30, 2017

What for do you need to match lone surrogates? What is the original regexp that we need to convert to ES5? If it was /^a.b$/u then third call should return true not false. Regexp I suggested already returns true so everything is good.

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

I guess I know what the problem you want to solve. When I call something like this var set = regenerate().add('\uD800') you should find only lone surrogates. That is not related with dot but interesting problem.

vitaliylag commented Nov 30, 2017

I guess I know what the problem you want to solve. When I call something like this var set = regenerate().add('\uD800') you should find only lone surrogates. That is not related with dot but interesting problem.

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

But should you even use Regenerate to process dots? I guess on every dot you add all ranges (0-9, 11-12, 14-... and so on) but I don't know why you should add ranges because dot is special case and Regenerate is not needed for it.

vitaliylag commented Nov 30, 2017

But should you even use Regenerate to process dots? I guess on every dot you add all ranges (0-9, 11-12, 14-... and so on) but I don't know why you should add ranges because dot is special case and Regenerate is not needed for it.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Nov 30, 2017

Contributor

@jridgewell Right, I should have done a better job of creating a test case. Let me try again:

{
	// The goal is to match lone surrogates only.
	const regex = /([\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])/;

	// Lone high surrogate U+D800:
	console.assert('\uD800'.match(regex)[1] === '\uD800');
	// Lone low surrogate U+DC00:
	console.assert('\uDC00'.match(regex)[1] === '\uDC00');
	// Surrogate pair (i.e. no lone surrogates):
	console.assert('\uD834\uDF06'.match(regex) === null);
}

{
	// This must be the same regular expression as above, with only `^a` added at
	// the beginning and `b$` added at the end.
	const regex = /^a([\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b$/;

	// Lone high surrogate U+D800:
	console.assert('a\uD800b'.match(regex)[1] === '\uD800');
	// Lone low surrogate U+DC00:
	console.assert('a\uDC00b'.match(regex)[1] === '\uDC00');
	// Surrogate pair (i.e. no lone surrogates):
	console.assert('a\uD834\uDF06b'.match(regex) === null);
}

console.log('You did it!');

@vitaliylag

[…] find only lone surrogates. That is not related with dot but interesting problem.

It is related because /./u matches lone surrogates.

But should you even use regenerate to process dots? I guess you add all ranges (0-9, 11-12, 14-... and so on) but I don't know why you should add ranges because dot is special case and Regenerate is not needed for it.

/./u is just one example of a regexp that matches lone surrogates. There are infinitely more, and we cannot special-case them all. That’s why a generic solution is needed.

Contributor

mathiasbynens commented Nov 30, 2017

@jridgewell Right, I should have done a better job of creating a test case. Let me try again:

{
	// The goal is to match lone surrogates only.
	const regex = /([\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])/;

	// Lone high surrogate U+D800:
	console.assert('\uD800'.match(regex)[1] === '\uD800');
	// Lone low surrogate U+DC00:
	console.assert('\uDC00'.match(regex)[1] === '\uDC00');
	// Surrogate pair (i.e. no lone surrogates):
	console.assert('\uD834\uDF06'.match(regex) === null);
}

{
	// This must be the same regular expression as above, with only `^a` added at
	// the beginning and `b$` added at the end.
	const regex = /^a([\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b$/;

	// Lone high surrogate U+D800:
	console.assert('a\uD800b'.match(regex)[1] === '\uD800');
	// Lone low surrogate U+DC00:
	console.assert('a\uDC00b'.match(regex)[1] === '\uDC00');
	// Surrogate pair (i.e. no lone surrogates):
	console.assert('a\uD834\uDF06b'.match(regex) === null);
}

console.log('You did it!');

@vitaliylag

[…] find only lone surrogates. That is not related with dot but interesting problem.

It is related because /./u matches lone surrogates.

But should you even use regenerate to process dots? I guess you add all ranges (0-9, 11-12, 14-... and so on) but I don't know why you should add ranges because dot is special case and Regenerate is not needed for it.

/./u is just one example of a regexp that matches lone surrogates. There are infinitely more, and we cannot special-case them all. That’s why a generic solution is needed.

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

Well. There is two cases:

  1. You need match high surrogate. This is easy because of look ahead: "(?!)".
  2. You need match low surrogate. Actually it is easy too. This is because of current position cannot be at the end of surrogate pair. So you just need check [\uDC00-\uDFFF]. That's all.

Of course you can ask why current position cannot be at end of surrogate pair.

  • If you was searching range of code points >= 65536 then you wrote something this:
    /( \uD877[\uDF55-\uDFFF]   |   [\uD878-\uD87A][\uDC00-\uDFFF]   |   \uD87B[\uDC00-\uDC01] )/
    
    //(but without spaces of course)
    It can match only full surrogate pairs. This example matches code points from 188245 to 191489. I have chosen this range randomly. Even if you will use *?+ you will match full surrogates anyway.
  • If you was searching range of code points <= 55295 you cannot go to the position of some surrogate pair.
  • If you was searching lone high surrogate everything will be OK because of look ahead using "(?!)"
  • If you was searching lone low surrogate you cannot be at the end of surrogate pair because end is already passed.

So as result you don't need look behind but you DO need look ahead.

vitaliylag commented Nov 30, 2017

Well. There is two cases:

  1. You need match high surrogate. This is easy because of look ahead: "(?!)".
  2. You need match low surrogate. Actually it is easy too. This is because of current position cannot be at the end of surrogate pair. So you just need check [\uDC00-\uDFFF]. That's all.

Of course you can ask why current position cannot be at end of surrogate pair.

  • If you was searching range of code points >= 65536 then you wrote something this:
    /( \uD877[\uDF55-\uDFFF]   |   [\uD878-\uD87A][\uDC00-\uDFFF]   |   \uD87B[\uDC00-\uDC01] )/
    
    //(but without spaces of course)
    It can match only full surrogate pairs. This example matches code points from 188245 to 191489. I have chosen this range randomly. Even if you will use *?+ you will match full surrogates anyway.
  • If you was searching range of code points <= 55295 you cannot go to the position of some surrogate pair.
  • If you was searching lone high surrogate everything will be OK because of look ahead using "(?!)"
  • If you was searching lone low surrogate you cannot be at the end of surrogate pair because end is already passed.

So as result you don't need look behind but you DO need look ahead.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Nov 30, 2017

Contributor

@vitaliylag Not sure I follow. Which regex would you use to solve the latest example I posted?

Contributor

mathiasbynens commented Nov 30, 2017

@vitaliylag Not sure I follow. Which regex would you use to solve the latest example I posted?

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

I forgot third case:

  1. You need match low surrogate but NOTHING is before low surrogate in current regexp (beginning of regexp). Example
    /\uDC00 blablabla/ug
    I know no solutions for this case. But this is very and very rare case so I guess you just should throw exception that this regexp cannot be converted to ES5. You don't need throw exception in Regenerator but in Babel itself.

 

Which regex would you use to solve the latest example I posted?

This is easy:


Example 1 (high surrogate):

Original regexp:

/abc \uD800 fgh/ug

Converted regexp:

/abc \uD800(?![\uDC00-\uDFFF]) fgh/g

Example 2 (low surrogate):

Original regexp:

/abc \uDC00 fgh/ug

Converted regexp:

/abc \uDC00 fgh/g   //it works right in all cases because when we search \uDC00 current
                    //position cannot be equal position of some low surrogate that is
                    //end of surrogate pair.

Example 3 (low surrogate at the start):

/\uDC00 fgh/ug

throw new BabelError('Syntax error');   //You cannot convert it to ES5

Example 4 (low surrogate that looks like start but not):

/^\uDC00 fgh/ug

well be converted to:

/^\uDC00 fgh/g

Example 5 (low surrogate that does not look like start but is a start):

/(a|b*)\uDC00 fgh/ug

throw new BabelError('Syntax error');   //You cannot convert it to ES5

Example 6 (high surrogate):

This example shows why you need look ahead check

Original regexp:

/abc \uD800./ug

Converted regexp:

/abc \uD800(?![\uDC00-\uDFFF])(HERE IS DOT EXPRESSION)/g

Example 7 (high surrogate):

Interesting example

Original regexp:

/abc \uD800(\uDC00)/ug

Converted regexp:

/abc \uD800(?![\uDC00-\uDFFF])(\uDC00)/g   //match nothing but that is right!

Example 8 (high surrogate):

Interesting example 2

Original regexp:

/abc \uD800(\uDC00|d)/ug

Converted regexp:

/abc \uD800(?![\uDC00-\uDFFF])(\uDC00|d)/g   //match only "abc \uD800d" and that is right

vitaliylag commented Nov 30, 2017

I forgot third case:

  1. You need match low surrogate but NOTHING is before low surrogate in current regexp (beginning of regexp). Example
    /\uDC00 blablabla/ug
    I know no solutions for this case. But this is very and very rare case so I guess you just should throw exception that this regexp cannot be converted to ES5. You don't need throw exception in Regenerator but in Babel itself.

 

Which regex would you use to solve the latest example I posted?

This is easy:


Example 1 (high surrogate):

Original regexp:

/abc \uD800 fgh/ug

Converted regexp:

/abc \uD800(?![\uDC00-\uDFFF]) fgh/g

Example 2 (low surrogate):

Original regexp:

/abc \uDC00 fgh/ug

Converted regexp:

/abc \uDC00 fgh/g   //it works right in all cases because when we search \uDC00 current
                    //position cannot be equal position of some low surrogate that is
                    //end of surrogate pair.

Example 3 (low surrogate at the start):

/\uDC00 fgh/ug

throw new BabelError('Syntax error');   //You cannot convert it to ES5

Example 4 (low surrogate that looks like start but not):

/^\uDC00 fgh/ug

well be converted to:

/^\uDC00 fgh/g

Example 5 (low surrogate that does not look like start but is a start):

/(a|b*)\uDC00 fgh/ug

throw new BabelError('Syntax error');   //You cannot convert it to ES5

Example 6 (high surrogate):

This example shows why you need look ahead check

Original regexp:

/abc \uD800./ug

Converted regexp:

/abc \uD800(?![\uDC00-\uDFFF])(HERE IS DOT EXPRESSION)/g

Example 7 (high surrogate):

Interesting example

Original regexp:

/abc \uD800(\uDC00)/ug

Converted regexp:

/abc \uD800(?![\uDC00-\uDFFF])(\uDC00)/g   //match nothing but that is right!

Example 8 (high surrogate):

Interesting example 2

Original regexp:

/abc \uD800(\uDC00|d)/ug

Converted regexp:

/abc \uD800(?![\uDC00-\uDFFF])(\uDC00|d)/g   //match only "abc \uD800d" and that is right
@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

Check lone high surrogate:

/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/

Check lone low surrogate:

/[\uDC00-\uDFFF]/   //does not work at the start of regexp

Did you understand my solution and why you don't need lookbehind?

vitaliylag commented Nov 30, 2017

Check lone high surrogate:

/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/

Check lone low surrogate:

/[\uDC00-\uDFFF]/   //does not work at the start of regexp

Did you understand my solution and why you don't need lookbehind?

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

I added fixed version of Babel to tests in first comment (it is called "babel-fixed").

But it'll be great if you optimize dot to my version because of performance and file size. Now dot is sum of:

  1. usual chars
  2. astral chars
  3. lone high surrogates
  4. lone low surrogates

but using my regexp you can optimize it to sum of:

  1. usual chars + lone low surrogates
  2. lone high surrogates + astral chars

I also added "my-shorted" version. It is shorter but performance may be worse.

vitaliylag commented Nov 30, 2017

I added fixed version of Babel to tests in first comment (it is called "babel-fixed").

But it'll be great if you optimize dot to my version because of performance and file size. Now dot is sum of:

  1. usual chars
  2. astral chars
  3. lone high surrogates
  4. lone low surrogates

but using my regexp you can optimize it to sum of:

  1. usual chars + lone low surrogates
  2. lone high surrogates + astral chars

I also added "my-shorted" version. It is shorter but performance may be worse.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Nov 30, 2017

Contributor

You need match low surrogate but NOTHING is before low surrogate in current regexp (beginning of regexp). Example

That’s an easy case as we can just use ^. This is what the current Regenerate output does.

I’m still a little confused — could you please post the full regular expression that solves #6931 (comment)?

Contributor

mathiasbynens commented Nov 30, 2017

You need match low surrogate but NOTHING is before low surrogate in current regexp (beginning of regexp). Example

That’s an easy case as we can just use ^. This is what the current Regenerate output does.

I’m still a little confused — could you please post the full regular expression that solves #6931 (comment)?

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

That’s an easy case as we can just use ^. This is what the current Regenerate output does.

No, you cannnot do this.

Example:

/\uDC00/u.test('aaa \uDC00');       //returns true
/\uDC00/u.test('aaa \uD800\uDC00'); //returns false

But if you try to convert it then:

/^\uDC00/.test('aaa \uDC00');       //returns false (wrong result)
/^\uDC00/.test('aaa \uD800\uDC00'); //returns false

This is what the current Regenerate output does.

No. It does this:

(?:[^\uD800-\uDBFF]|^)

And it fails the tests because this version captures character it should not capture.

It fails not only in tests with "match" but even in tests with calling "test" function:

/a \uDC00/u.test('a \uDC00');        //return true
/a (?:[^\uD800-\uDBFF]|^)\uDC00/.test('a \uDC00'); //but after Babel returns false. This is bug

My solution does not have such bugs (and any bugs at all). But yeah, my solution cannot work with lone high surrogate at the start of regexp but this is very rare case and I don't even think somebody needs it.

I’m still a little confused — could you please post the full regular expression that solves #6931 (comment)?

To match all lone surrogates you just need match lone high and lone low surrogates:

/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF]/

so the answer is:

{
	// The goal is to match lone surrogates only.
	const regex = / ([\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])/; //I added space because Babel should throw SyntaxError when matching lone low surrogate in the start of regexp
	
	// Lone high surrogate U+D800:
	console.assert(' \uD800'.match(regex)[1] === '\uD800');
	// Lone low surrogate U+DC00:
	console.assert(' \uDC00'.match(regex)[1] === '\uDC00');
	// Surrogate pair (i.e. no lone surrogates):
	console.assert(' \uD834\uDF06'.match(regex) === null);
}

{
	// This must be the same regular expression as above, with only `^a` added at
	// the beginning and `b$` added at the end.
	const regex = /^a([\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])b$/;

	// Lone high surrogate U+D800:
	console.assert('a\uD800b'.match(regex)[1] === '\uD800');
	// Lone low surrogate U+DC00:
	console.assert('a\uDC00b'.match(regex)[1] === '\uDC00');
	// Surrogate pair (i.e. no lone surrogates):
	console.assert('a\uD834\uDF06b'.match(regex) === null);
}

Of course it passes all tests.

vitaliylag commented Nov 30, 2017

That’s an easy case as we can just use ^. This is what the current Regenerate output does.

No, you cannnot do this.

Example:

/\uDC00/u.test('aaa \uDC00');       //returns true
/\uDC00/u.test('aaa \uD800\uDC00'); //returns false

But if you try to convert it then:

/^\uDC00/.test('aaa \uDC00');       //returns false (wrong result)
/^\uDC00/.test('aaa \uD800\uDC00'); //returns false

This is what the current Regenerate output does.

No. It does this:

(?:[^\uD800-\uDBFF]|^)

And it fails the tests because this version captures character it should not capture.

It fails not only in tests with "match" but even in tests with calling "test" function:

/a \uDC00/u.test('a \uDC00');        //return true
/a (?:[^\uD800-\uDBFF]|^)\uDC00/.test('a \uDC00'); //but after Babel returns false. This is bug

My solution does not have such bugs (and any bugs at all). But yeah, my solution cannot work with lone high surrogate at the start of regexp but this is very rare case and I don't even think somebody needs it.

I’m still a little confused — could you please post the full regular expression that solves #6931 (comment)?

To match all lone surrogates you just need match lone high and lone low surrogates:

/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF]/

so the answer is:

{
	// The goal is to match lone surrogates only.
	const regex = / ([\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])/; //I added space because Babel should throw SyntaxError when matching lone low surrogate in the start of regexp
	
	// Lone high surrogate U+D800:
	console.assert(' \uD800'.match(regex)[1] === '\uD800');
	// Lone low surrogate U+DC00:
	console.assert(' \uDC00'.match(regex)[1] === '\uDC00');
	// Surrogate pair (i.e. no lone surrogates):
	console.assert(' \uD834\uDF06'.match(regex) === null);
}

{
	// This must be the same regular expression as above, with only `^a` added at
	// the beginning and `b$` added at the end.
	const regex = /^a([\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])b$/;

	// Lone high surrogate U+D800:
	console.assert('a\uD800b'.match(regex)[1] === '\uD800');
	// Lone low surrogate U+DC00:
	console.assert('a\uDC00b'.match(regex)[1] === '\uDC00');
	// Surrogate pair (i.e. no lone surrogates):
	console.assert('a\uD834\uDF06b'.match(regex) === null);
}

Of course it passes all tests.

@vitaliylag

This comment has been minimized.

Show comment
Hide comment
@vitaliylag

vitaliylag Nov 30, 2017

You try to make "lookbehind" in your solution. But you don't need it: current char cannot be end of surrogate pair. Current char is always lone low surrogate (if it match [\uDC00-\uDFFF]). We already know it is lone so we don't need to check it one more time.

So you just need to delete this "lookbehind" and everything will start working right.


@jridgewell

/^a(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])b$/

Your solution is totally right. It is the same solution as I proposed but without explaining why it works correctly in all cases and without algorithm for Regenerator.


@mathiasbynens, to combile unicode regexps with not unicode regexps you could also add this check:

/(?![\uD800-\uDBFF]|^)/

instead of

/(?:[^\uD800-\uDBFF]|^)/

But that does not make sense because I cannot imagine how somebody can start building regexp that partially has "u" flag and partially does not have "u" flag. Also this hack does not work in the beginning of regexp too because requires at least one matched character in the left. So it does not make sense for fully unicode regexps (with flag "u") too because it will change nothing.

vitaliylag commented Nov 30, 2017

You try to make "lookbehind" in your solution. But you don't need it: current char cannot be end of surrogate pair. Current char is always lone low surrogate (if it match [\uDC00-\uDFFF]). We already know it is lone so we don't need to check it one more time.

So you just need to delete this "lookbehind" and everything will start working right.


@jridgewell

/^a(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])b$/

Your solution is totally right. It is the same solution as I proposed but without explaining why it works correctly in all cases and without algorithm for Regenerator.


@mathiasbynens, to combile unicode regexps with not unicode regexps you could also add this check:

/(?![\uD800-\uDBFF]|^)/

instead of

/(?:[^\uD800-\uDBFF]|^)/

But that does not make sense because I cannot imagine how somebody can start building regexp that partially has "u" flag and partially does not have "u" flag. Also this hack does not work in the beginning of regexp too because requires at least one matched character in the left. So it does not make sense for fully unicode regexps (with flag "u") too because it will change nothing.

@jridgewell

This comment has been minimized.

Show comment
Hide comment
@jridgewell

jridgewell Nov 30, 2017

Member

@mathiasbynens: yours isn't solvable without look-behind. However, I'm not sure it has to be solved:

var a = 'a',                        //usual character
    s = String.fromCharCode(55296), //start of surrogate pair
    e = String.fromCharCode(56320); //end of surrogate pair
r = /(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])/g;

console.log((a).match(r));
console.log((s).match(r));
console.log((e).match(r));
console.log((a + s).match(r));
console.log((a + e).match(r));
console.log((s + e).match(r));
console.log((s + s).match(r));
console.log((e + e).match(r));
Member

jridgewell commented Nov 30, 2017

@mathiasbynens: yours isn't solvable without look-behind. However, I'm not sure it has to be solved:

var a = 'a',                        //usual character
    s = String.fromCharCode(55296), //start of surrogate pair
    e = String.fromCharCode(56320); //end of surrogate pair
r = /(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\uDC00-\uDFFF])/g;

console.log((a).match(r));
console.log((s).match(r));
console.log((e).match(r));
console.log((a + s).match(r));
console.log((a + e).match(r));
console.log((s + e).match(r));
console.log((s + s).match(r));
console.log((e + e).match(r));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment