Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RegExp v flag with set notation + properties of strings #10984

Conversation

msaboff
Copy link
Contributor

@msaboff msaboff commented Mar 3, 2023

270c824

Implement RegExp `v` flag with set notation + properties of strings
https://bugs.webkit.org/show_bug.cgi?id=241593
rdar://100337109

Reviewed by Yusuke Suzuki.

This change implements the TC39 stage 3 proposal RegExp v flag with set notation + properties of strings,
https://github.com/tc39/proposal-regexp-v-flag.  It adds a new "unicodeSets" compile mode for the Yarr engine.
Like the prior Unicode Yarr features, this change is driven by Unicode Database Files (UCD).
This change includes two such new files, JavaScriptCore/ucd/{emoji-sequences.txt & emoji-zwj-sequences.txt}.

The newly added properties include lists of strings.  These strings are processed via the character class syntax
through.  When it comes to matching however, there is some desuguraing that turns such a property of strings into
a list of alternations.  For example, say a property has strings str1...strN plus a traditional character class,
single-character-class, we create the pattern equivalent of:
     (?:str1|str2|...|strN|[single-character-class])
Per the spec, longer strings appear earlier in the alternation, and before the traditional character class.
This allows for searching for longer properties in a property list where substrings of other strings are included
in that list.

There are new set of combining operators allowed in the class set character classes.  Two character class elements
that appear adjacent to each other implicitly have the Union combining operations.  There is also an Intersection
operation with the && operator and a Subtraction operation with the || operator.

There is new ClassSet parsing that follows new "cleaner" rules that traditional character classes.
The prior ccharacter class constructor and delegates are mostly unchanged, except for the compile mode
now being switched on an enum instead of a bool.

Added check that both 'u' and 'v' flags don't appear in the same RegExp.

Added unicodeSets getter watchpoint to the m_regExpPrimordialPropertiesWatchpointSet.

* JSTests/es6/Proxy_internal_get_calls_RegExp.prototype.flags.js:
* JSTests/stress/regexp-vflag-property-of-strings.js: Added.
(arrayToString):
(objectToString):
(dumpValue):
(compareArray):
(compareGroups):
(testRegExp):
(testRegExpSyntaxError):
* JSTests/stress/static-getter-in-names.js:
* JSTests/test262/config.yaml:
* LayoutTests/js/Object-getOwnPropertyNames-expected.txt:
* LayoutTests/js/script-tests/Object-getOwnPropertyNames.js:
* Source/JavaScriptCore/JavaScriptCore.xcodeproj/project.pbxproj:
* Source/JavaScriptCore/builtins/BuiltinNames.h:
* Source/JavaScriptCore/builtins/RegExpPrototype.js:
(linkTimeConstant.hasObservableSideEffectsForRegExpMatch):
(linkTimeConstant.hasObservableSideEffectsForRegExpSplit):
(overriddenName.string_appeared_here.split):
* Source/JavaScriptCore/builtins/StringPrototype.js:
(linkTimeConstant.hasObservableSideEffectsForStringReplace):
* Source/JavaScriptCore/bytecode/LinkTimeConstant.h:
* Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:
(JSC::DFG::AbstractInterpreter<AbstractStateType>::executeEffects):
* Source/JavaScriptCore/dfg/DFGFixupPhase.cpp:
(JSC::DFG::FixupPhase::addStringReplacePrimordialChecks):
* Source/JavaScriptCore/dfg/DFGOperations.cpp:
(JSC::DFG::JSC_DEFINE_JIT_OPERATION):
* Source/JavaScriptCore/dfg/DFGStrengthReductionPhase.cpp:
(JSC::DFG::StrengthReductionPhase::handleNode):
* Source/JavaScriptCore/runtime/CachedTypes.cpp:
* Source/JavaScriptCore/runtime/CommonIdentifiers.h:
* Source/JavaScriptCore/runtime/JSGlobalObject.cpp:
(JSC::JSGlobalObject::init):
* Source/JavaScriptCore/runtime/JSGlobalObject.h:
* Source/JavaScriptCore/runtime/JSGlobalObjectInlines.h:
(JSC::JSGlobalObject::regExpProtoUnicodeSetsGetter const):
* Source/JavaScriptCore/runtime/RegExp.h:
* Source/JavaScriptCore/runtime/RegExpCache.h:
* Source/JavaScriptCore/runtime/RegExpObject.cpp:
(JSC::RegExpObject::matchGlobal):
* Source/JavaScriptCore/runtime/RegExpPrototype.cpp:
(JSC::RegExpPrototype::finishCreation):
(JSC::JSC_DEFINE_HOST_FUNCTION):
* Source/JavaScriptCore/ucd/emoji-sequences.txt: Added.
* Source/JavaScriptCore/ucd/emoji-zwj-sequences.txt: Added.
* Source/JavaScriptCore/yarr/Yarr.h:
* Source/JavaScriptCore/yarr/YarrErrorCode.cpp:
(JSC::Yarr::errorMessage):
(JSC::Yarr::errorToThrow):
* Source/JavaScriptCore/yarr/YarrErrorCode.h:
* Source/JavaScriptCore/yarr/YarrFlags.h:
* Source/JavaScriptCore/yarr/YarrInterpreter.cpp:
(JSC::Yarr::ByteTermDumper::ByteTermDumper):
(JSC::Yarr::ByteTermDumper::unicode const):
(JSC::Yarr::ByteTermDumper::unicodeSets const):
(JSC::Yarr::ByteTermDumper::eitherUnicode const):
(JSC::Yarr::Interpreter::tryConsumeBackReference):
(JSC::Yarr::Interpreter::matchCharacterClass):
(JSC::Yarr::Interpreter::backtrackCharacterClass):
(JSC::Yarr::Interpreter::matchDisjunction):
(JSC::Yarr::Interpreter::Interpreter):
(JSC::Yarr::Interpreter::isLegacyCompilation const):
(JSC::Yarr::Interpreter::isUnicodeCompilation const):
(JSC::Yarr::Interpreter::isUnicodeSetsCompilation const):
(JSC::Yarr::Interpreter::isEitherUnicodeCompilation const):
(JSC::Yarr::ByteTermDumper::dumpTerm):
(JSC::Yarr::ByteTermDumper::unicode): Deleted.
* Source/JavaScriptCore/yarr/YarrInterpreter.h:
(JSC::Yarr::BytecodePattern::BytecodePattern):
(JSC::Yarr::BytecodePattern::compileMode const):
(JSC::Yarr::BytecodePattern::unicodeSets const):
(JSC::Yarr::BytecodePattern::eitherUnicode const):
* Source/JavaScriptCore/yarr/YarrJIT.cpp:
* Source/JavaScriptCore/yarr/YarrParser.h:
(JSC::Yarr::Parser::CharacterClassParserDelegate::CharacterClassParserDelegate):
(JSC::Yarr::Parser::ClassSetParserDelegate::ClassSetParserDelegate):
(JSC::Yarr::Parser::ClassSetParserDelegate::begin):
(JSC::Yarr::Parser::ClassSetParserDelegate::nestedClassBegin):
(JSC::Yarr::Parser::ClassSetParserDelegate::doneAfterCharacterClassEnd):
(JSC::Yarr::Parser::ClassSetParserDelegate::setUnionOp):
(JSC::Yarr::Parser::ClassSetParserDelegate::setSubtractOp):
(JSC::Yarr::Parser::ClassSetParserDelegate::setIntersectionOp):
(JSC::Yarr::Parser::ClassSetParserDelegate::flushCachedCharacterIfNeeded):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomBuiltInCharacterClass):
(JSC::Yarr::Parser::ClassSetParserDelegate::end):
(JSC::Yarr::Parser::ClassSetParserDelegate::error):
(JSC::Yarr::Parser::ClassSetParserDelegate::assertionWordBoundary):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomBackReference):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomNamedBackReference):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomNamedForwardReference):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::ClassStringDisjunctionParserDelegate):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::newAlternative):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::end):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::assertionWordBoundary):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomBackReference):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomNamedBackReference):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomNamedForwardReference):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomBuiltInCharacterClass):
(JSC::Yarr::Parser::Parser):
(JSC::Yarr::Parser::isIdentityEscapeAnError):
(JSC::Yarr::Parser::parseEscape):
(JSC::Yarr::Parser::consumePossibleSurrogatePair):
(JSC::Yarr::Parser::parseAtomEscape):
(JSC::Yarr::Parser::parseCharacterClassEscape):
(JSC::Yarr::Parser::parseClassSetEscape):
(JSC::Yarr::Parser::parseClassStringDisjunctionEscape):
(JSC::Yarr::Parser::parseCharacterClass):
(JSC::Yarr::Parser::parseClassSet):
(JSC::Yarr::Parser::parseClassStringDisjunction):
(JSC::Yarr::Parser::parseParenthesesEnd):
(JSC::Yarr::Parser::parseTokens):
(JSC::Yarr::Parser::handleIllegalReferences):
(JSC::Yarr::Parser::tryConsumeUnicodeEscape):
(JSC::Yarr::Parser::tryConsumeUnicodePropertyExpression):
(JSC::Yarr::Parser::isLegacyCompilation const):
(JSC::Yarr::Parser::isUnicodeCompilation const):
(JSC::Yarr::Parser::isUnicodeSetsCompilation const):
(JSC::Yarr::Parser::isEitherUnicodeCompilation const):
(JSC::Yarr::compileMode):
(JSC::Yarr::parse):
* Source/JavaScriptCore/yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::combiningSetOp):
(JSC::Yarr::CharacterClassConstructor::append):
(JSC::Yarr::CharacterClassConstructor::appendInverted):
(JSC::Yarr::CharacterClassConstructor::putRange):
(JSC::Yarr::CharacterClassConstructor::atomClassStringDisjunction):
(JSC::Yarr::CharacterClassConstructor::performSetOpWith):
(JSC::Yarr::CharacterClassConstructor::performSetOpWithStrings):
(JSC::Yarr::CharacterClassConstructor::performSetOpWithMatches):
(JSC::Yarr::CharacterClassConstructor::hasInverteStrings):
(JSC::Yarr::CharacterClassConstructor::compareUTF32Strings):
(JSC::Yarr::CharacterClassConstructor::sort):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::mergeRangesFrom):
(JSC::Yarr::CharacterClassConstructor::unionStrings):
(JSC::Yarr::CharacterClassConstructor::intersectionStrings):
(JSC::Yarr::CharacterClassConstructor::subtractionStrings):
(JSC::Yarr::CharacterClassConstructor::asciiOpSorted):
(JSC::Yarr::CharacterClassConstructor::unicodeOpSorted):
(JSC::Yarr::YarrPatternConstructor::YarrPatternConstructor):
(JSC::Yarr::YarrPatternConstructor::resetForReparsing):
(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):
(JSC::Yarr::YarrPatternConstructor::atomBuiltInCharacterClass):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassAtom):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassRange):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassBuiltIn):
(JSC::Yarr::YarrPatternConstructor::atomClassStringDisjunction):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassSetOp):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassPushNested):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassPopNested):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassEnd):
(JSC::Yarr::YarrPattern::compile):
(JSC::Yarr::PatternTerm::dump):
(JSC::Yarr::YarrPattern::dumpPatternString):
(JSC::Yarr::YarrPattern::dumpPattern):
* Source/JavaScriptCore/yarr/YarrPattern.h:
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::CharacterClass::hasNonBMPCharacters const):
(JSC::Yarr::CharacterClass::hasOneCharacterSize const):
(JSC::Yarr::CharacterClass::hasOnlyNonBMPCharacters const):
(JSC::Yarr::CharacterClass::hasStrings const):
(JSC::Yarr::CharacterClass::hasSingleCharacters const):
(JSC::Yarr::ClassSet::ClassSet):
(JSC::Yarr::YarrPattern::unicodeSets const):
(JSC::Yarr::YarrPattern::eitherUnicode const):
(JSC::Yarr::YarrPattern::compileMode const):
(JSC::Yarr::CharacterClass::hasNonBMPCharacters): Deleted.
(JSC::Yarr::CharacterClass::hasOneCharacterSize): Deleted.
(JSC::Yarr::CharacterClass::hasOnlyNonBMPCharacters): Deleted.
* Source/JavaScriptCore/yarr/YarrSyntaxChecker.cpp:
(JSC::Yarr::SyntaxChecker::atomClassStringDisjunction):
(JSC::Yarr::SyntaxChecker::atomCharacterClassSetOp):
(JSC::Yarr::SyntaxChecker::atomCharacterClassPushNested):
(JSC::Yarr::SyntaxChecker::atomCharacterClassPopNested):
(JSC::Yarr::checkSyntax):
* Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:
(JSC::Yarr::unicodeMatchProperty):
(JSC::Yarr::createUnicodeCharacterClassFor):
(JSC::Yarr::characterClassMayContainStrings):
* Source/JavaScriptCore/yarr/YarrUnicodeProperties.h:
* Source/JavaScriptCore/yarr/generateYarrUnicodePropertyTables.py:
(PropertyData.__init__):
(PropertyData.addMatchString):
(PropertyData.stringsCompare):
(PropertyData):
(PropertyData.sortStrings):
(PropertyData.dumpMatchData):
(PropertyData.convertStringToCppFormat):
(PropertyData.dump):
(PropertyData.dumpAll):
(PropertyData.dumpMayContainStringFunc):
(BinaryProperty.dump):
(SequenceProperty):
(SequenceProperty.__init__):
(SequenceProperty.parsePropertyFile):
(SequenceProperty.dump):
* Source/WebCore/contentextensions/URLFilterParser.cpp:
(WebCore::ContentExtensions::PatternParser::atomClassStringDisjunction):
(WebCore::ContentExtensions::PatternParser::atomCharacterClassSetOp):
(WebCore::ContentExtensions::PatternParser::atomCharacterClassPushNested):
(WebCore::ContentExtensions::PatternParser::atomCharacterClassPopNested):
(WebCore::ContentExtensions::URLFilterParser::addPattern):

Canonical link: https://commits.webkit.org/261188@main

033aaa0

Misc iOS, tvOS & watchOS macOS Linux Windows
  πŸ§ͺ style   πŸ›  ios   πŸ›  mac   πŸ›  wpe   πŸ›  wincairo
  πŸ§ͺ bindings   πŸ›  ios-sim   πŸ›  mac-AS-debug   πŸ§ͺ wpe-wk2
βœ… πŸ§ͺ webkitperl   πŸ§ͺ ios-wk2   πŸ§ͺ api-mac   πŸ›  gtk
  πŸ§ͺ api-ios   πŸ§ͺ mac-wk1   πŸ§ͺ gtk-wk2
  πŸ›  πŸ§ͺ jsc   πŸ›  tv   πŸ§ͺ mac-wk2   πŸ§ͺ api-gtk
βœ… πŸ›  πŸ§ͺ jsc-arm64   πŸ›  tv-sim   πŸ§ͺ mac-AS-debug-wk2 βœ… πŸ›  jsc-armv7
  πŸ›  watch   πŸ§ͺ mac-wk2-stress   πŸ§ͺ jsc-armv7-tests
  πŸ›  watch-sim βœ… πŸ›  jsc-mips
❌ πŸ›  πŸ§ͺ unsafe-merge   πŸ§ͺ jsc-mips-tests

@msaboff msaboff requested a review from a team as a code owner March 3, 2023 06:12
@msaboff msaboff self-assigned this Mar 3, 2023
@msaboff msaboff added the JavaScriptCore For bugs in JavaScriptCore, the JS engine used by WebKit, other than kxmlcore issues. label Mar 3, 2023
Comment on lines 237 to 240
var regexpUnicode = @tryGetById(regexp, "unicodeSets");
if (regexpUnicode !== @regExpProtoUnicodeGetter)
return true;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add optimization for unicodeSets. See JSGlobalObject.cpp etc. This is comparing with @regExpProtoUnicodeGetter, which is unicode getter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to add optimization done for unicode getter etc. Like, DFGAbstractInterpreterInlines.h etc. for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See DFGFixupPhase.cpp and DFGAbstractInterpreterInlines.h too. Let's grep with propertyNames->global for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed these.

Comment on lines 467 to 469
var regexpUnicode = @tryGetById(regexp, "unicodeSets");
if (regexpUnicode !== @regExpProtoUnicodeGetter)
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 98 to 100
var regexpUnicode = @tryGetById(regexp, "unicodeSets");
if (regexpUnicode !== @regExpProtoUnicodeGetter)
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. This is unicode getter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@rkirsling rkirsling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be sure to update the test262 results in expectations.yaml; in addition to the 7(x2) outstanding failures introduced by #10355 (displayed here), this patch appears to introduce an additional 2(x2) failures:

test/built-ins/RegExp/prototype/unicodeSets/uv-flags-constructor.js:
  default: 'Test262Error: Expected a SyntaxError to be thrown but no exception was thrown at all'
  strict mode: 'Test262Error: Expected a SyntaxError to be thrown but no exception was thrown at all'
test/built-ins/RegExp/prototype/unicodeSets/uv-flags.js:
  default: 'Test262: This statement should not be evaluated.'
  strict mode: 'Test262: This statement should not be evaluated.'

@msaboff msaboff force-pushed the eng/Implement-RegExp-v-flag-with-set-notation--properties-of-strings branch from 64ed139 to b632354 Compare March 3, 2023 19:33
@msaboff
Copy link
Contributor Author

msaboff commented Mar 3, 2023

Please be sure to update the test262 results in expectations.yaml; in addition to the 7(x2) outstanding failures introduced by #10355 (displayed here), this patch appears to introduce an additional 2(x2) failures:

test/built-ins/RegExp/prototype/unicodeSets/uv-flags-constructor.js:
  default: 'Test262Error: Expected a SyntaxError to be thrown but no exception was thrown at all'
  strict mode: 'Test262Error: Expected a SyntaxError to be thrown but no exception was thrown at all'
test/built-ins/RegExp/prototype/unicodeSets/uv-flags.js:
  default: 'Test262: This statement should not be evaluated.'
  strict mode: 'Test262: This statement should not be evaluated.'

@rkirsling I'll update test262 results in a follow on PR.

@@ -3555,6 +3555,9 @@ bool AbstractInterpreter<AbstractStateType>::executeEffects(unsigned clobberLimi
if (attemptToFold(m_vm.propertyNames->unicode.impl(), globalObject->regExpProtoUnicodeGetter()))
break;

if (attemptToFold(m_vm.propertyNames->unicode.impl(), globalObject->regExpProtoUnicodeSetsGetter()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not unicode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy paste error - Fixed.

@msaboff msaboff force-pushed the eng/Implement-RegExp-v-flag-with-set-notation--properties-of-strings branch from b632354 to 0045ca6 Compare March 3, 2023 22:24
@msaboff msaboff requested a review from rkirsling March 3, 2023 22:27
@msaboff
Copy link
Contributor Author

msaboff commented Mar 3, 2023

Please be sure to update the test262 results in expectations.yaml; in addition to the 7(x2) outstanding failures introduced by #10355 (displayed here), this patch appears to introduce an additional 2(x2) failures:

test/built-ins/RegExp/prototype/unicodeSets/uv-flags-constructor.js:
  default: 'Test262Error: Expected a SyntaxError to be thrown but no exception was thrown at all'
  strict mode: 'Test262Error: Expected a SyntaxError to be thrown but no exception was thrown at all'
test/built-ins/RegExp/prototype/unicodeSets/uv-flags.js:
  default: 'Test262: This statement should not be evaluated.'
  strict mode: 'Test262: This statement should not be evaluated.'

Fixed these failures in the latest version.

@@ -1416,6 +1416,9 @@ capitalName ## Constructor* lowerName ## Constructor = featureFlag ? capitalName
GetterSetter* regExpProtoUnicodeGetter = getGetterById(this, m_regExpPrototype.get(), vm.propertyNames->unicode);
catchScope.assertNoException();
m_linkTimeConstants[static_cast<unsigned>(LinkTimeConstant::regExpProtoUnicodeGetter)].set(vm, this, regExpProtoUnicodeGetter);
GetterSetter* regExpProtoUnicodeSetsGetter = getGetterById(this, m_regExpPrototype.get(), vm.propertyNames->unicodeSets);
catchScope.assertNoException();
m_linkTimeConstants[static_cast<unsigned>(LinkTimeConstant::regExpProtoUnicodeSetsGetter)].set(vm, this, regExpProtoUnicodeSetsGetter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think RegExpPrototype watchpoint optimization is probably missing. Please grep all use of existing getters carefully and align the new one with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grep'ed the other uses of the RegExp getters and only find uses in DFGAbstractInterpreterInlines.h and DFGFixupPhase. The unicodeSets getter is already added to those sections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@Constellation Constellation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rs=me with watchpoint fix.

@msaboff msaboff force-pushed the eng/Implement-RegExp-v-flag-with-set-notation--properties-of-strings branch from 0045ca6 to 033aaa0 Compare March 4, 2023 00:35
@msaboff msaboff added the unsafe-merge-queue Applied to send a pull request to merge-queue, but skip building and testing label Mar 4, 2023
@webkit-early-warning-system webkit-early-warning-system force-pushed the eng/Implement-RegExp-v-flag-with-set-notation--properties-of-strings branch from 033aaa0 to b309ad8 Compare March 4, 2023 01:01
https://bugs.webkit.org/show_bug.cgi?id=241593
rdar://100337109

Reviewed by Yusuke Suzuki.

This change implements the TC39 stage 3 proposal RegExp v flag with set notation + properties of strings,
https://github.com/tc39/proposal-regexp-v-flag.  It adds a new "unicodeSets" compile mode for the Yarr engine.
Like the prior Unicode Yarr features, this change is driven by Unicode Database Files (UCD).
This change includes two such new files, JavaScriptCore/ucd/{emoji-sequences.txt & emoji-zwj-sequences.txt}.

The newly added properties include lists of strings.  These strings are processed via the character class syntax
through.  When it comes to matching however, there is some desuguraing that turns such a property of strings into
a list of alternations.  For example, say a property has strings str1...strN plus a traditional character class,
single-character-class, we create the pattern equivalent of:
     (?:str1|str2|...|strN|[single-character-class])
Per the spec, longer strings appear earlier in the alternation, and before the traditional character class.
This allows for searching for longer properties in a property list where substrings of other strings are included
in that list.

There are new set of combining operators allowed in the class set character classes.  Two character class elements
that appear adjacent to each other implicitly have the Union combining operations.  There is also an Intersection
operation with the && operator and a Subtraction operation with the || operator.

There is new ClassSet parsing that follows new "cleaner" rules that traditional character classes.
The prior ccharacter class constructor and delegates are mostly unchanged, except for the compile mode
now being switched on an enum instead of a bool.

Added check that both 'u' and 'v' flags don't appear in the same RegExp.

Added unicodeSets getter watchpoint to the m_regExpPrimordialPropertiesWatchpointSet.

* JSTests/es6/Proxy_internal_get_calls_RegExp.prototype.flags.js:
* JSTests/stress/regexp-vflag-property-of-strings.js: Added.
(arrayToString):
(objectToString):
(dumpValue):
(compareArray):
(compareGroups):
(testRegExp):
(testRegExpSyntaxError):
* JSTests/stress/static-getter-in-names.js:
* JSTests/test262/config.yaml:
* LayoutTests/js/Object-getOwnPropertyNames-expected.txt:
* LayoutTests/js/script-tests/Object-getOwnPropertyNames.js:
* Source/JavaScriptCore/JavaScriptCore.xcodeproj/project.pbxproj:
* Source/JavaScriptCore/builtins/BuiltinNames.h:
* Source/JavaScriptCore/builtins/RegExpPrototype.js:
(linkTimeConstant.hasObservableSideEffectsForRegExpMatch):
(linkTimeConstant.hasObservableSideEffectsForRegExpSplit):
(overriddenName.string_appeared_here.split):
* Source/JavaScriptCore/builtins/StringPrototype.js:
(linkTimeConstant.hasObservableSideEffectsForStringReplace):
* Source/JavaScriptCore/bytecode/LinkTimeConstant.h:
* Source/JavaScriptCore/dfg/DFGAbstractInterpreterInlines.h:
(JSC::DFG::AbstractInterpreter<AbstractStateType>::executeEffects):
* Source/JavaScriptCore/dfg/DFGFixupPhase.cpp:
(JSC::DFG::FixupPhase::addStringReplacePrimordialChecks):
* Source/JavaScriptCore/dfg/DFGOperations.cpp:
(JSC::DFG::JSC_DEFINE_JIT_OPERATION):
* Source/JavaScriptCore/dfg/DFGStrengthReductionPhase.cpp:
(JSC::DFG::StrengthReductionPhase::handleNode):
* Source/JavaScriptCore/runtime/CachedTypes.cpp:
* Source/JavaScriptCore/runtime/CommonIdentifiers.h:
* Source/JavaScriptCore/runtime/JSGlobalObject.cpp:
(JSC::JSGlobalObject::init):
* Source/JavaScriptCore/runtime/JSGlobalObject.h:
* Source/JavaScriptCore/runtime/JSGlobalObjectInlines.h:
(JSC::JSGlobalObject::regExpProtoUnicodeSetsGetter const):
* Source/JavaScriptCore/runtime/RegExp.h:
* Source/JavaScriptCore/runtime/RegExpCache.h:
* Source/JavaScriptCore/runtime/RegExpObject.cpp:
(JSC::RegExpObject::matchGlobal):
* Source/JavaScriptCore/runtime/RegExpPrototype.cpp:
(JSC::RegExpPrototype::finishCreation):
(JSC::JSC_DEFINE_HOST_FUNCTION):
* Source/JavaScriptCore/ucd/emoji-sequences.txt: Added.
* Source/JavaScriptCore/ucd/emoji-zwj-sequences.txt: Added.
* Source/JavaScriptCore/yarr/Yarr.h:
* Source/JavaScriptCore/yarr/YarrErrorCode.cpp:
(JSC::Yarr::errorMessage):
(JSC::Yarr::errorToThrow):
* Source/JavaScriptCore/yarr/YarrErrorCode.h:
* Source/JavaScriptCore/yarr/YarrFlags.h:
* Source/JavaScriptCore/yarr/YarrInterpreter.cpp:
(JSC::Yarr::ByteTermDumper::ByteTermDumper):
(JSC::Yarr::ByteTermDumper::unicode const):
(JSC::Yarr::ByteTermDumper::unicodeSets const):
(JSC::Yarr::ByteTermDumper::eitherUnicode const):
(JSC::Yarr::Interpreter::tryConsumeBackReference):
(JSC::Yarr::Interpreter::matchCharacterClass):
(JSC::Yarr::Interpreter::backtrackCharacterClass):
(JSC::Yarr::Interpreter::matchDisjunction):
(JSC::Yarr::Interpreter::Interpreter):
(JSC::Yarr::Interpreter::isLegacyCompilation const):
(JSC::Yarr::Interpreter::isUnicodeCompilation const):
(JSC::Yarr::Interpreter::isUnicodeSetsCompilation const):
(JSC::Yarr::Interpreter::isEitherUnicodeCompilation const):
(JSC::Yarr::ByteTermDumper::dumpTerm):
(JSC::Yarr::ByteTermDumper::unicode): Deleted.
* Source/JavaScriptCore/yarr/YarrInterpreter.h:
(JSC::Yarr::BytecodePattern::BytecodePattern):
(JSC::Yarr::BytecodePattern::compileMode const):
(JSC::Yarr::BytecodePattern::unicodeSets const):
(JSC::Yarr::BytecodePattern::eitherUnicode const):
* Source/JavaScriptCore/yarr/YarrJIT.cpp:
* Source/JavaScriptCore/yarr/YarrParser.h:
(JSC::Yarr::Parser::CharacterClassParserDelegate::CharacterClassParserDelegate):
(JSC::Yarr::Parser::ClassSetParserDelegate::ClassSetParserDelegate):
(JSC::Yarr::Parser::ClassSetParserDelegate::begin):
(JSC::Yarr::Parser::ClassSetParserDelegate::nestedClassBegin):
(JSC::Yarr::Parser::ClassSetParserDelegate::doneAfterCharacterClassEnd):
(JSC::Yarr::Parser::ClassSetParserDelegate::setUnionOp):
(JSC::Yarr::Parser::ClassSetParserDelegate::setSubtractOp):
(JSC::Yarr::Parser::ClassSetParserDelegate::setIntersectionOp):
(JSC::Yarr::Parser::ClassSetParserDelegate::flushCachedCharacterIfNeeded):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomBuiltInCharacterClass):
(JSC::Yarr::Parser::ClassSetParserDelegate::end):
(JSC::Yarr::Parser::ClassSetParserDelegate::error):
(JSC::Yarr::Parser::ClassSetParserDelegate::assertionWordBoundary):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomBackReference):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomNamedBackReference):
(JSC::Yarr::Parser::ClassSetParserDelegate::atomNamedForwardReference):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::ClassStringDisjunctionParserDelegate):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::newAlternative):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::end):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::assertionWordBoundary):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomBackReference):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomNamedBackReference):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomNamedForwardReference):
(JSC::Yarr::Parser::ClassStringDisjunctionParserDelegate::atomBuiltInCharacterClass):
(JSC::Yarr::Parser::Parser):
(JSC::Yarr::Parser::isIdentityEscapeAnError):
(JSC::Yarr::Parser::parseEscape):
(JSC::Yarr::Parser::consumePossibleSurrogatePair):
(JSC::Yarr::Parser::parseAtomEscape):
(JSC::Yarr::Parser::parseCharacterClassEscape):
(JSC::Yarr::Parser::parseClassSetEscape):
(JSC::Yarr::Parser::parseClassStringDisjunctionEscape):
(JSC::Yarr::Parser::parseCharacterClass):
(JSC::Yarr::Parser::parseClassSet):
(JSC::Yarr::Parser::parseClassStringDisjunction):
(JSC::Yarr::Parser::parseParenthesesEnd):
(JSC::Yarr::Parser::parseTokens):
(JSC::Yarr::Parser::handleIllegalReferences):
(JSC::Yarr::Parser::tryConsumeUnicodeEscape):
(JSC::Yarr::Parser::tryConsumeUnicodePropertyExpression):
(JSC::Yarr::Parser::isLegacyCompilation const):
(JSC::Yarr::Parser::isUnicodeCompilation const):
(JSC::Yarr::Parser::isUnicodeSetsCompilation const):
(JSC::Yarr::Parser::isEitherUnicodeCompilation const):
(JSC::Yarr::compileMode):
(JSC::Yarr::parse):
* Source/JavaScriptCore/yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::combiningSetOp):
(JSC::Yarr::CharacterClassConstructor::append):
(JSC::Yarr::CharacterClassConstructor::appendInverted):
(JSC::Yarr::CharacterClassConstructor::putRange):
(JSC::Yarr::CharacterClassConstructor::atomClassStringDisjunction):
(JSC::Yarr::CharacterClassConstructor::performSetOpWith):
(JSC::Yarr::CharacterClassConstructor::performSetOpWithStrings):
(JSC::Yarr::CharacterClassConstructor::performSetOpWithMatches):
(JSC::Yarr::CharacterClassConstructor::hasInverteStrings):
(JSC::Yarr::CharacterClassConstructor::compareUTF32Strings):
(JSC::Yarr::CharacterClassConstructor::sort):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::mergeRangesFrom):
(JSC::Yarr::CharacterClassConstructor::unionStrings):
(JSC::Yarr::CharacterClassConstructor::intersectionStrings):
(JSC::Yarr::CharacterClassConstructor::subtractionStrings):
(JSC::Yarr::CharacterClassConstructor::asciiOpSorted):
(JSC::Yarr::CharacterClassConstructor::unicodeOpSorted):
(JSC::Yarr::YarrPatternConstructor::YarrPatternConstructor):
(JSC::Yarr::YarrPatternConstructor::resetForReparsing):
(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):
(JSC::Yarr::YarrPatternConstructor::atomBuiltInCharacterClass):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassAtom):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassRange):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassBuiltIn):
(JSC::Yarr::YarrPatternConstructor::atomClassStringDisjunction):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassSetOp):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassPushNested):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassPopNested):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassEnd):
(JSC::Yarr::YarrPattern::compile):
(JSC::Yarr::PatternTerm::dump):
(JSC::Yarr::YarrPattern::dumpPatternString):
(JSC::Yarr::YarrPattern::dumpPattern):
* Source/JavaScriptCore/yarr/YarrPattern.h:
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::CharacterClass::hasNonBMPCharacters const):
(JSC::Yarr::CharacterClass::hasOneCharacterSize const):
(JSC::Yarr::CharacterClass::hasOnlyNonBMPCharacters const):
(JSC::Yarr::CharacterClass::hasStrings const):
(JSC::Yarr::CharacterClass::hasSingleCharacters const):
(JSC::Yarr::ClassSet::ClassSet):
(JSC::Yarr::YarrPattern::unicodeSets const):
(JSC::Yarr::YarrPattern::eitherUnicode const):
(JSC::Yarr::YarrPattern::compileMode const):
(JSC::Yarr::CharacterClass::hasNonBMPCharacters): Deleted.
(JSC::Yarr::CharacterClass::hasOneCharacterSize): Deleted.
(JSC::Yarr::CharacterClass::hasOnlyNonBMPCharacters): Deleted.
* Source/JavaScriptCore/yarr/YarrSyntaxChecker.cpp:
(JSC::Yarr::SyntaxChecker::atomClassStringDisjunction):
(JSC::Yarr::SyntaxChecker::atomCharacterClassSetOp):
(JSC::Yarr::SyntaxChecker::atomCharacterClassPushNested):
(JSC::Yarr::SyntaxChecker::atomCharacterClassPopNested):
(JSC::Yarr::checkSyntax):
* Source/JavaScriptCore/yarr/YarrUnicodeProperties.cpp:
(JSC::Yarr::unicodeMatchProperty):
(JSC::Yarr::createUnicodeCharacterClassFor):
(JSC::Yarr::characterClassMayContainStrings):
* Source/JavaScriptCore/yarr/YarrUnicodeProperties.h:
* Source/JavaScriptCore/yarr/generateYarrUnicodePropertyTables.py:
(PropertyData.__init__):
(PropertyData.addMatchString):
(PropertyData.stringsCompare):
(PropertyData):
(PropertyData.sortStrings):
(PropertyData.dumpMatchData):
(PropertyData.convertStringToCppFormat):
(PropertyData.dump):
(PropertyData.dumpAll):
(PropertyData.dumpMayContainStringFunc):
(BinaryProperty.dump):
(SequenceProperty):
(SequenceProperty.__init__):
(SequenceProperty.parsePropertyFile):
(SequenceProperty.dump):
* Source/WebCore/contentextensions/URLFilterParser.cpp:
(WebCore::ContentExtensions::PatternParser::atomClassStringDisjunction):
(WebCore::ContentExtensions::PatternParser::atomCharacterClassSetOp):
(WebCore::ContentExtensions::PatternParser::atomCharacterClassPushNested):
(WebCore::ContentExtensions::PatternParser::atomCharacterClassPopNested):
(WebCore::ContentExtensions::URLFilterParser::addPattern):

Canonical link: https://commits.webkit.org/261188@main
@webkit-early-warning-system webkit-early-warning-system force-pushed the eng/Implement-RegExp-v-flag-with-set-notation--properties-of-strings branch from b309ad8 to 270c824 Compare March 4, 2023 01:04
@webkit-early-warning-system webkit-early-warning-system merged commit 270c824 into WebKit:main Mar 4, 2023
@webkit-commit-queue
Copy link
Collaborator

Committed 261188@main (270c824): https://commits.webkit.org/261188@main

Reviewed commits have been landed. Closing PR #10984 and removing active labels.

@webkit-commit-queue webkit-commit-queue removed the unsafe-merge-queue Applied to send a pull request to merge-queue, but skip building and testing label Mar 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
JavaScriptCore For bugs in JavaScriptCore, the JS engine used by WebKit, other than kxmlcore issues.
Projects
None yet
5 participants