Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scalar matching in grapheme semantic mode #565

Closed
wants to merge 2 commits into from

Conversation

hamishknight
Copy link
Contributor

@hamishknight hamishknight commented Jul 8, 2022

Previously we would allow consuming a single scalar of a grapheme in grapheme semantic mode when matching a DSL .scalar. Change this behavior such that it is treated the same as a character with a single scalar. That is:

  • In grapheme mode it must match an entire grapheme.
  • In scalar semantic mode, it may match a single scalar.

Additionally, start converting AST .scalar atoms into .scalar DSL atoms. This is both for consistency with the DSL, and allows us to preserve the \u{...} syntax in more cases when a DSL conversion is performed.

Resolves #563
Resolves #564
rdar://96662651

Previously we would allow consuming a single
scalar in grapheme semantic mode for a DSL `.scalar`.
Change this behavior such that it is treated the
same as a character with a single scalar. That is:

- In grapheme mode it must match an entire grapheme.
- In scalar semantic mode, it may match a single scalar.
Convert AST scalars to DSL scalars such that we
can preserve the `\u{...}` syntax where the user
chooses to use it. This requires fixing up some
escaping logic such that we don't escape `\u{...}`
sequences.
@hamishknight
Copy link
Contributor Author

@swift-ci please test

Comment on lines +1148 to +1157
let r3 = Regex {
"👨" as UnicodeScalar
"\u{200D}" as UnicodeScalar
"👨" as UnicodeScalar
"\u{200D}" as UnicodeScalar
"👧" as UnicodeScalar
"\u{200D}" as UnicodeScalar
"👦" as UnicodeScalar
}
XCTAssertNil(try r3.firstMatch(in: "👨‍👨‍👧‍👦"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually I'm not sure whether this is correct, as we allow matching for e.g:

print("👨‍👨‍👧‍👦".firstMatch(of: /👨\u{200D}👨\u{200D}👧\u{200D}👦/)

should we be combining the scalars in a DSL concatenation the same way we combine them in a literal? And if so, would it also apply to regex scalar children e.g /\u{...}/?

Or is a DSL concatenation semantically more like the following?

print("👨‍👨‍👧‍👦".firstMatch(of: /(?:👨)(?:\u{200D})(?:👨)(?:\u{200D})(?:👧)(?:\u{200D})(?:👦)/)) // nil

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalars should get globbed together into a stream of scalars, over which grapheme breaking can occur. This is similar to scalar escaped inside string literals. For the DSL, I'd treat them like a stream of scalars ala /\u{...}\u{...}\u{...}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense, though we'll need to define exactly when the combining happens, e.g do we combine the following?

Regex {
  "a" as UnicodeScalar
  /\u{301}/
}

and would that still happen if there's further contents in the regex? (because this is now a concat within a concat) e.g:

Regex {
  "a" as UnicodeScalar
  /\u{301}b/
}

Should Regex {...} calls serve as a boundary between combining scalars? e.g:

Regex {
  "a" as UnicodeScalar
  Regex {
    "\u{301}" as UnicodeScalar
    "b"
  }
}

@rctcwyvrn
Copy link
Contributor

I'm still not convinced that we should have this scalar api while in grapheme semantics at all, converting it to a matching a character under the hood seems unintuitive from a user's perspective. What does a scalar RegexComponent even mean then? If it's given a single scalar character it'll match it as a character and if it's given something that isn't a character (like \u{301}) it'll just never match anything? Why even have this option in grapheme semantic mode then? Seems like strange behaviour to me

Would it be possible for us to instead throw an error if we have a .scalar while in grapheme semantic mode?

@hamishknight
Copy link
Contributor Author

converting it to a matching a character under the hood seems unintuitive from a user's perspective

I don't disagree with this, though I should note that it's equivalent to using a /\u{...}/ regex literal in place of the unicode scalar. IMO the current behavior of potentially splitting graphemes is also somewhat surprising (especially in a character class where some elements could match only a scalar, others could match a character), and isn't something I'd expect in grapheme mode.

Why even have this option in grapheme semantic mode then? Seems like strange behaviour to me

Yeah, I'm starting to think we should maybe consider retracting these scalar APIs on RegexBuilder until we work out these details (there's only 2 of them AFAIK, the RegexComponent conformance + CharacterClass.anyOf overload). I think we'd still want this change to the matching engine tho, as /\u{...}/ should still match a grapheme IMO.

Would it be possible for us to instead throw an error if we have a .scalar while in grapheme semantic mode?

Yeah, that would be a possible option, though it would surface as a runtime crash on the first call to a matching function.

Copy link
Collaborator

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think we can clean up the code gen path a little.

// A scalar always matches the same as a single scalar character. This
// means it must match a whole grapheme in grapheme semantic mode, but
// can match a single scalar in scalar semantic mode.
try emitCharacter(Character(s))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be conditional on whether we're in grapheme semantics or scalar semantics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emitCharacter will call emitConsumeScalar for scalar mode

// A scalar always matches the same as a single scalar character. This
// means it must match a whole grapheme in grapheme semantic mode, but
// can match a single scalar in scalar semantic mode.
return try Character(s).generateConsumer(opts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty convoluted. Why do we not just emit a scalar consuming instruction in scalar semantic mode and set the grapheme boundary bit? CC @rctcwyvrn to see if we need to get some of her instructions in first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do that within a consumer matching function? I know we eventually want to migrate off consumer functions, but for now this will generate a consumeScalar consumer in scalar semantic mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path should only be reached for non-ascii custom character classes so I don't think we can emit an instruction instead of a consume fn in that case. Maybe we could if we made custom character classes emit in the same way we do alternations?

Comment on lines +1148 to +1157
let r3 = Regex {
"👨" as UnicodeScalar
"\u{200D}" as UnicodeScalar
"👨" as UnicodeScalar
"\u{200D}" as UnicodeScalar
"👧" as UnicodeScalar
"\u{200D}" as UnicodeScalar
"👦" as UnicodeScalar
}
XCTAssertNil(try r3.firstMatch(in: "👨‍👨‍👧‍👦"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalars should get globbed together into a stream of scalars, over which grapheme breaking can occur. This is similar to scalar escaped inside string literals. For the DSL, I'd treat them like a stream of scalars ala /\u{...}\u{...}\u{...}.

rctcwyvrn added a commit to rctcwyvrn/swift-experimental-string-processing that referenced this pull request Jul 11, 2022
@rctcwyvrn
Copy link
Contributor

rctcwyvrn commented Jul 11, 2022

Edge case I ran into while updating #525

Should a single scalar composed value match the decomposed version while in grapheme mode? Ie "é" as UnicodeScalar

In the scalar optimizations branch I tried to make it match this PR's scalar matching by emitting .scalar as matchScalar + boundary check, but that would fail to match the decomposed version. This PR would match the decomposed version because it would convert it to a character and match by unicode equivalence.

I'm leaning towards that it shouldn't be able to match the decomposed version, as we asked it to match a scalar and it would end up matching a completely different scalar sequence

    let r4 = Regex { "é" as UnicodeScalar }
    XCTAssertNotNil( // nil or not nil?
      try r4.firstMatch(in: "e\u{301}")
    )
    XCTAssertNotNil(
      try r4.firstMatch(in: "é")
    )

@hamishknight
Copy link
Contributor Author

hamishknight commented Jul 11, 2022

I think I'm leaning towards saying it actually should match the decomposed version. IMO the UnicodeScalar conformance should be semantically equivalent to /\u{...}/, and in grapheme semantic mode IMO it seems reasonable to expect /\u{E9}/ to be the same regex as /e\u{301}/. This would also match with the expected behavior of combining all adjacent scalars such that they effectively become a .quotedLiteral rather than distinct semantic elements.

I do agree that it could potentially be unexpected tho.

@@ -216,7 +216,7 @@ extension AST.Atom {

switch self.kind {
case let .char(c): return .char(c)
case let .scalar(s): return .char(Character(s.value))
case let .scalar(s): return .scalar(s.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the behaviour of regex to be different depending on if we input a unicode value as its scalar code or as a literal character? ie "ợ" vs "\u{1ee3}"? With this one would be interpreted as a .char and the other as a .scalar in the DSLTree. I think the intent for the original code here was to make those two inputs equivalent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should be semantically equivalent (i.e they go through the exact same code paths in byte code emission), but this distinction allows us to have more accurate printing of the DSL tree when performing a literal -> DSL conversion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if we didn't need the distinction for printing, IMO it would make sense to outright remove .scalar from the DSL tree)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they behave the same then why would we need to print them differently? Printing them differently would imply that they behave differently I think.

Also yea, seeing as we're moving towards making a .scalar behave exactly the same as .char, maybe it would make sense to remove it outright

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they behave the same then why would we need to print them differently? Printing them differently would imply that they behave differently I think.

For example, if the user wrote /e\u{301}/, ideally the transformed DSL tree would be:

Regex {
  "e\u{301}"
}

instead of:

Regex {
  "é"
}

Semantically they're the same, but IMO it's much nicer to preserve the original syntax the user wrote.

@milseman
Copy link
Collaborator

Should a single scalar composed value match the decomposed version while in grapheme mode?

Yes, in grapheme cluster semantics we do Unicode Canonical Equivalence.

@rctcwyvrn
Copy link
Contributor

Ok so #525, is ready to merge and up to date with this branch. Should we go ahead and merge that one?

@hamishknight were you planning on doing additional work on combining scalars together?

@hamishknight
Copy link
Contributor Author

@rctcwyvrn I'm happy with landing both, I can work on the scalar combination logic in a follow-up

@milseman
Copy link
Collaborator

Let's merge this whenever you're ready @rctcwyvrn, and we can coordinate how to land @hamishknight's work on top and cherry-pick things together

@hamishknight
Copy link
Contributor Author

@swift-ci please test

rctcwyvrn added a commit that referenced this pull request Jul 12, 2022
- Adds new instructions for matching characters and scalars case insensitively
- Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode
- Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings
- Also includes fixes to scalar matching in grapheme semantic mode (#565)
@rctcwyvrn
Copy link
Contributor

Changes merged in 7752047

@rctcwyvrn rctcwyvrn closed this Jul 12, 2022
@hamishknight hamishknight deleted the not-to-scale branch July 12, 2022 18:44
rctcwyvrn added a commit to rctcwyvrn/swift-experimental-string-processing that referenced this pull request Jul 12, 2022
- Adds new instructions for matching characters and scalars case insensitively
- Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode
- Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings
- Also includes fixes to scalar matching in grapheme semantic mode (apple#565)
hamishknight pushed a commit to hamishknight/swift-experimental-string-processing that referenced this pull request Jul 19, 2022
- Adds new instructions for matching characters and scalars case insensitively
- Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode
- Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings
- Also includes fixes to scalar matching in grapheme semantic mode (apple#565)
hamishknight pushed a commit to hamishknight/swift-experimental-string-processing that referenced this pull request Jul 21, 2022
- Adds new instructions for matching characters and scalars case insensitively
- Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode
- Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings
- Also includes fixes to scalar matching in grapheme semantic mode (apple#565)
hamishknight pushed a commit to rctcwyvrn/swift-experimental-string-processing that referenced this pull request Jul 21, 2022
- Adds new instructions for matching characters and scalars case insensitively
- Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode
- Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings
- Also includes fixes to scalar matching in grapheme semantic mode (apple#565)
hamishknight pushed a commit to hamishknight/swift-experimental-string-processing that referenced this pull request Jul 21, 2022
- Adds new instructions for matching characters and scalars case insensitively
- Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode
- Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings
- Also includes fixes to scalar matching in grapheme semantic mode (apple#565)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants