Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Digit matching behaving as intended? #401

Closed
stephentyrone opened this issue May 11, 2022 · 5 comments · Fixed by #570
Closed

Digit matching behaving as intended? #401

stephentyrone opened this issue May 11, 2022 · 5 comments · Fixed by #570
Assignees

Comments

@stephentyrone
Copy link
Member

stephentyrone commented May 11, 2022

Reposted from the Swift forums: https://forums.swift.org/t/bad-digit-matching-bugreport-regarding-se-0354-regex-literals/57262/1

Problem: Some digit character groups match number-like grapheme clusters.

// this matches:
try /[1-2]/.wholeMatch(in: "1️⃣")

// still matches:
try /[1-2]/.asciiOnlyDigits().wholeMatch(in: "1️⃣")

// does not match:
try /[12]/.wholeMatch(in: "1️⃣")

Above described behavior seems inconsistent and difficult to predict. Shouldn't [1-2] and [12] be identical? Should they match anything outside of ascii?

Note: 1️⃣ is U+0031 (ascii digit 1) U+FE0F (VARIATION SELECTOR-16) U+20E3 (COMBINING ENCLOSING KEYCAP)

Same is true for 1︎⃣: U+0031 (ascii digit 1) U+FE0E (VARIATION SELECTOR-15) U+20E3 (COMBINING ENCLOSING KEYCAP)

rdar://96898279

@NikolaiRuhe
Copy link

If there are questions about my report: I'm following this issue.

@milseman
Copy link
Collaborator

Note that the emoji compares between "1" and "2" as a Character. @natecook1000

@NikolaiRuhe
Copy link

Note that the emoji compares between "1" and "2" as a Character. @natecook1000

This explains a lot of above behavior—at least technically. On the other hand it scares me away from using regular expressions when writing code that should sanitize arbitrary input.

Maybe an ascii-only mode would help? In my code I'm manually checking input now:

input.utf8.allSatisfy { $0 <= 127 }.

@natecook1000
Copy link
Member

Thanks for the example, @NikolaiRuhe! It looks like custom character class ranges will need something a bit different than Character.<, at least by default.

@milseman
Copy link
Collaborator

Note that in scalar semantics you should get your desired behavior. It's a reasonable interpretation to extend A-Z to "A"..."Z", but probably undesirable and counter intuitive. The Unicode pitch was arguing for using NFD for ranges, but that wouldn't affect this example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants