Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring Dart's RegExp support in line with JavaScript: lookbehinds, property escapes, and named groups. #34935

Closed
gspencergoog opened this issue Oct 25, 2018 · 12 comments

Comments

Projects
None yet
7 participants
@gspencergoog
Copy link

commented Oct 25, 2018

So, it appears that JavaScript now does support positive lookbehinds (e.g. "(?<=foo)"), Unicode property escapes (i.e. "\\p{Greek}"), and named capture groups (e.g. "(?<foo>.*)"), at least according to the next release of the ECMA-262 specification. They are already supported in Chrome 64 and later.

https://github.com/tc39/proposal-regexp-named-groups
https://github.com/tc39/proposal-regexp-lookbehind
https://github.com/tc39/proposal-regexp-unicode-property-escapes

Unfortunately, the Dart VM/AOT compiler version of RegExp (i.e. when not compiled to JS) doesn't seem to support them, so the statement "Dart regular expressions have the same syntax and semantics as JavaScript regular expressions" isn't really true anymore. Some regular expressions work on Dartpad, but not in Flutter, for instance.

Dart should implement these too: they're also super useful.

@lrhn

This comment has been minimized.

Copy link
Member

commented Dec 11, 2018

We have a discrepancy now. Dart compiled for the web does support these features, native Dart (including Flutter) does not. It's possible for people to write libraries, e.g., using DDC, that are intended to be portable, but which fail when used on Flutter.

There have been stack-overflow answers suggesting look-behind, and they even tested it in dartpad first. I can't fault them, but it's not a valid answer for a Flutter question.

We should address this in some way. Either:

  • Say that it's deliberate, and do nothing. Web users get to use the full power of ECMAScript RegExps.
  • Make a plan to update the VM RegExp engine.
  • Do something to catch VM-invalid RegExps in browser code (e.g., a debug-mode only syntax check).

Since most RegExps are string literals, the analyzer can already now warn if a RegExp is not a valid ES5.1 RegExp.

@gspencergoog

This comment has been minimized.

Copy link
Author

commented Dec 11, 2018

Also, with the announcement of Hummingbird, the story gets even more muddled: someone could implement a full working Flutter example and even share a link to it on SO, and be confused as to why it fails on Flutter when compiled for mobile.

Yes, there could be an analyzer check that warns users building for mobile that they can't use features meant for the web, but that feels like more of an excuse than a real solution.

I think this fractures Flutter's "write once, run anywhere" magic that Dart powers so well for virtually everything else.

@mit-mit

This comment has been minimized.

Copy link
Member

commented Dec 11, 2018

Just to clarify: We have announced Hummingbird as an experimental project, not as an officially complete & supported product.

But, yes, we'll need to work through this problem, and resolve it to preserve good compatibility across our various host targets.

@mit-mit

This comment has been minimized.

Copy link
Member

commented Dec 11, 2018

moving to @kevmoo to track Hummingbird requirements. Maybe we need a label for these.

@mit-mit mit-mit assigned kevmoo and unassigned mraleph, lrhn and mit-mit Dec 11, 2018

@gspencergoog

This comment has been minimized.

Copy link
Author

commented Dec 11, 2018

Yes, of course, it's just a tech preview. But Hummingbird isn't the only reason to want this: if someone shares business logic between AngularDart and mobile, they'll have similar issues.

@kevmoo

This comment has been minimized.

Copy link
Member

commented Dec 11, 2018

@mraleph mraleph assigned sstrickl and unassigned kevmoo Dec 27, 2018

@mraleph

This comment has been minimized.

Copy link
Contributor

commented Dec 27, 2018

I think this is great midsize project for somebody getting familiar with the VM. Assigning to @sstrickl

@lrhn

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

We might also want to support the dotAll flag (/s) which makes . match any character, not just non-newline characters. It's not strictly necessary. Since it requires a flag that we don't pass to the JS RegExp, the current Dart RegExp doesn't support this feature on the web.

The same can be said for the Unicode flag (/u), but it's a useful feature, so we want to provide access to it. It's not urgent, but it is important.

The sticky (/y) flag is not something we need directly, we already support that with our matchAsPrefix operation on Pattern. The JS compiled version of that function could use the sticky flag if it's available (can we feature detect that?).

Features enabled entirely by RegExp syntax is the big problem here, because those features may already be available when running on the web, and they then fail when run on Flutter or the VM.

@sstrickl

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2019

I think adding Unicode property group support (which I've been working on) definitely requires appropriate handling of non-BMP characters (and thus the Unicode flag) if you want to ensure that dart2js and the Dart VM match here. For one, Javascript only allows them in /u mode anyway. In addition, certain Unicode property groups that seem like they'd see fairly common use, like Decimal_Number, contain non-BMP characters. Without that support, a regexp that uses \p{Decimal_Number} would have different matching semantics in dart2js and Dart VM if we tried to allow it in BMP mode (i.e., restricting the property groups to only their BMP characters and throw out others).

Wait, since \p (and \P) is only allowed in Unicode mode, does dart2js just always use the Unicode flag with regular expressions, or does it attempt to detect regexps that use /u-only features and turn it on if they're found? If the former, that's an even bigger issue than I thought, as there are a fair number of changes in semantics of RegExp parsing/matching when that's enabled.

@rakudrama

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

dart2js RegExp has not changed substantially since it was first written, well before ES2015. It completely ignores the possibility of /u. I'd call it a big issue.
Another issue is browser compatibility. We are trying to deprecate IE11, but we have not done so yet.

@lrhn

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

There are two issues here:

  • Current incompatibility between VM and dart2js,
  • Lack of features compared to JS.

The former is the more urgent issue. Users can write code that acts differently when compiled for the VM or JavaScript. That issue is restricted to non-flag based changes to RegExp grammar: look-behinds and named capture groups. Those features work in JS RegExps with no new flags, so Dart compiled to JS can use them, and Dart compiled for the VM cannot.

There are other new features in JavaScript regexps that require flags to enable. Since the Dart JS RegExp code does not pass those flags, the features are not available.
This includes unicode regexps (/u and the syntax changes enabled by that flag) and dot-all (/s).
We will want to catch up with those features as well, but that requires changing the RegExp object (in order to add isUnicode and isDotAll getters like we have for the other flags). We could make UnicodeRegExp a new class, but that still only handles one of the flags, and there is no design reason to do it.

Arguably, we should make named capture groups available from the Match object by name, but we can choose to not do that yet (they are also available as positional groups). We can even stealth-introduce it by letting our RegExp implementation return a RegExpMatch with more features, without changing the RegExp interface. Then only users aware of this would be able to down-cast the Match to a RegExpMatch and access the new features. We can then combine that with the other changes above for when we need to change the RegExp interface, and only break things once (if anyone implements RegExp, which they probably shouldn't).

dart-bot pushed a commit that referenced this issue Mar 14, 2019

[VM] Adding regexp lookbehind assertion support.
See https://github.com/tc39/proposal-regexp-lookbehind
for a high-level description of the feature and examples.  This is one of the
features requested in #34935.

This work takes the feature as present in the v8 engine and appropriately
merges it into our irregexp fork. Notable changes to the irregexp codebase to
introduce this feature:

-----

We can no longer assume that all matching proceeds forwards, since lookbehind
matching proceeds backwards. Similarly, we cannot assume that we can only be
at the start of a string if we started matching from that point. The direction
of matching must also be taken into consideration when doing bounds checking,
which previously assumed the engine would never attempt to look before the
start of a string.

-----

We may now parse backreferences to captures before the capture they
reference, since we parse regular expressions left to right, but lookbehinds
perform captures as they evaluate the string from right to left.  Since
RegExpBackReference objects contain a pointer to their corresponding capture,
this means that we may need to create RegExpCapture objects prior to the
parsing of the corresponding captured subexpression.

Thus, RegExpCapture objects are now only initialized with their index, and the
body is set later when the subexpression is encountered and parsed. This means
any method that operates on the body of a RegExpCapture can no longer be const,
which also affects the rest of the RegExpTree class hierarchy. This also means
that we don't have a valid max_match length for backreferences based off the
capture body, and must assume they can end up being any length.

-----


Change-Id: Iffe0e71b17b1a0c6fea77235e8aee5c093005811
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/94540
Commit-Queue: Stevie Strickland <sstrickl@google.com>
Reviewed-by: Martin Kustermann <kustermann@google.com>

dart-bot pushed a commit that referenced this issue Mar 19, 2019

[VM] Partial support for named regexp captures.
See https://github.com/tc39/proposal-regexp-named-groups
for a high-level description of the feature and examples.  This is one of the
features requested in #34935.

This is a partial implementation because while there is a way to retrieve
groups via Dart by name, it requires casting the returned Match to the
new RegExpMatch interface to avoid changing the RegExp interface.
Changing the RegExp interface will happen in a future update, since there
are other planned changes to the RegExp interface coming soon and that way
we only change it once. See #36171
for more details on the planned changes.

Also, since only BMP regular expressions are supported, not full
Unicode ones (i.e., those with the /u flag in ECMAscript), \k<NAME>
will only be parsed as a named back reference if there are named
captures in the string. Otherwise, the \k will be parsed as the identity
escape for backwards compatibility. The new tests illustrate this
difference.

Change-Id: Ieeb0374813db78924c9aa8ac3e652dfb6d4a5934
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/95461
Commit-Queue: Stevie Strickland <sstrickl@google.com>
Reviewed-by: Lasse R.H. Nielsen <lrn@google.com>
Reviewed-by: Martin Kustermann <kustermann@google.com>
Reviewed-by: Jenny Messerly <jmesserly@google.com>
Reviewed-by: Johnni Winther <johnniwinther@google.com>

@dart-bot dart-bot closed this in 5ebb640 Apr 24, 2019

dart-bot pushed a commit that referenced this issue Apr 25, 2019

Revert "[vm] Finish adding support for ECMAScript 2018 features."
This reverts commit 5ebb640.

Reason for revert: <INSERT REASONING HERE>

Original change's description:
> [vm] Finish adding support for ECMAScript 2018 features.
> 
> This work pulls in v8 support for these features with
> appropriate changes for Dart and closes
> #34935.
> 
> This adds support for the following features:
> 
> * Interpreting patterns as Unicode patterns instead of
>   BMP patterns
> * the dotAll flag (`/s`) for changing the behavior
>   of '.' to also match line terminators
> * Escapes for character classes described by Unicode
>   property groups (e.g., \p{Greek} to match all Greek
>   characters, or \P{Greek} for all non-Greek characters).
> 
> The following TC39 proposals describe some of the added features:
> 
> * https://github.com/tc39/proposal-regexp-dotall-flag
> * https://github.com/tc39/proposal-regexp-unicode-property-escapes
> 
> These additional changes are included:
> 
> * Extends named capture group names to include the full
>   range of identifier characters supported by ECMAScript,
>   not just ASCII.
> * Changing the RegExp interface to return RegExpMatch
>   objects, not Match objects, so that downcasting is
>   not necessary to use named capture groups from Dart
> 
> **Note**: The changes to the RegExp interface are a
> breaking change for implementers of the RegExp interface.
> Current users of the RegExp interface (i.e., code using Dart
> RegExp objects) will not be affected.
> 
> Change-Id: I0709ed0a8d5db36680e32bbad585594857b9ace4
> Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/95651
> Commit-Queue: Stevie Strickland <sstrickl@google.com>
> Reviewed-by: Johnni Winther <johnniwinther@google.com>
> Reviewed-by: Lasse R.H. Nielsen <lrn@google.com>
> Reviewed-by: Martin Kustermann <kustermann@google.com>

TBR=lrn@google.com,kustermann@google.com,jmesserly@google.com,johnniwinther@google.com,sstrickl@google.com

# Not skipping CQ checks because original CL landed > 1 day ago.

Change-Id: I1eda0fee4fd9e94df095944049833a67b07277e2
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/100560
Reviewed-by: Keerti Parthasarathy <keertip@google.com>
Reviewed-by: Martin Kustermann <kustermann@google.com>
Commit-Queue: Keerti Parthasarathy <keertip@google.com>
@sstrickl

This comment has been minimized.

Copy link
Contributor

commented Apr 26, 2019

Reopening due to revert in 9238e25, checking into issue that caused revert now.

@sstrickl sstrickl reopened this Apr 26, 2019

@dart-bot dart-bot closed this in 4028fec Apr 29, 2019

dart-bot pushed a commit that referenced this issue Jul 8, 2019

Update doc link to point at ECMAScript 2018.
After the work done for #34935 and other
related issues, we should handle all of the new RegExp featuers in ECMAScript
2018. However, we still link to the ECMAScript 2011 specification. This
corrects the link.

Closes #23047.

Change-Id: If7dd8411d2f425caf1f03a49807121a0d61d96af
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/108272
Reviewed-by: Lasse R.H. Nielsen <lrn@google.com>
Commit-Queue: Teagan Strickland <sstrickl@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.