Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter grammar for Unicode identifier names #11216

Open
HertzDevil opened this issue Sep 15, 2021 · 3 comments · Fixed by #11508
Open

Stricter grammar for Unicode identifier names #11216

HertzDevil opened this issue Sep 15, 2021 · 3 comments · Fixed by #11508

Comments

@HertzDevil
Copy link
Contributor

HertzDevil commented Sep 15, 2021

Any chracter whose Unicode codepoint is 0xA0 or above is allowed in identifier names (see #10978 (comment)). This means the following is valid code:

= 1
puts# => 1

The above, in turn, is produced by the following Crystal code:

puts "\u200B= 1"
puts "puts \u200B"

That is, the variable's name is just the zero-width space character. To exclude those edge cases I suggest that we implement the Default Identifiers from UAX #31 (revision 35) with the following profile:

<Start> := XID_Start + U+005F
<Continue> := <Start> + XID_Continue
<Medial> :=

The above sets, when restricted to ASCII, are consistent with Crystal's existing lexer rules. Type and constant names must still begin with an ASCII uppercase letter.

For reference, this is also how C++23 does it (with the additional requirement that all identifiers must be in NFC).

@straight-shoota
Copy link
Member

#11508 (comment) suggests to remove the Number Letter category from <Start>. I think that's fine.
We can always reconsider and loosen this restriction if there are some legitimate use cases for identifiers starting with number letters.

@straight-shoota
Copy link
Member

#11508 has been reverted in #11687 because it was too restrictive resulting in a breaking change that's unacceptable for a minor release.

Unicode automation moved this from Done to In progress Jan 3, 2022
@HertzDevil
Copy link
Contributor Author

HertzDevil commented Dec 2, 2022

Do we really want "most" emojis as identifiers? Ones that contain ZWJs won't be allowed anyway, unless the lexer operates on grapheme clusters and the compiler / standard library supports those sequences. I wouldn't miss being able to name my variables like 😂 and 🫱🏿‍🫲🏾.

We could simply bring #11508 back but use the new parser warning facilities instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Unicode
In progress
Development

Successfully merging a pull request may close this issue.

2 participants