Stricter grammar for Unicode identifier names #11216

HertzDevil · 2021-09-15T12:31:06Z

Any chracter whose Unicode codepoint is 0xA0 or above is allowed in identifier names (see #10978 (comment)). This means the following is valid code:

= 1
puts  # => 1

The above, in turn, is produced by the following Crystal code:

puts "\u200B= 1"
puts "puts \u200B"

That is, the variable's name is just the zero-width space character. To exclude those edge cases I suggest that we implement the Default Identifiers from UAX #31 (revision 35) with the following profile:

<Start> := XID_Start + U+005F
<Continue> := <Start> + XID_Continue
<Medial> :=

The above sets, when restricted to ASCII, are consistent with Crystal's existing lexer rules. Type and constant names must still begin with an ASCII uppercase letter.

For reference, this is also how C++23 does it (with the additional requirement that all identifiers must be in NFC).

The text was updated successfully, but these errors were encountered:

straight-shoota · 2021-12-13T18:53:47Z

#11508 (comment) suggests to remove the Number Letter category from <Start>. I think that's fine.
We can always reconsider and loosen this restriction if there are some legitimate use cases for identifiers starting with number letters.

straight-shoota · 2022-01-03T20:42:19Z

#11508 has been reverted in #11687 because it was too restrictive resulting in a breaking change that's unacceptable for a minor release.

HertzDevil · 2022-12-02T21:50:55Z

Do we really want "most" emojis as identifiers? Ones that contain ZWJs won't be allowed anyway, unless the lexer operates on grapheme clusters and the compiler / standard library supports those sequences. I wouldn't miss being able to name my variables like 😂 and 🫱🏿‍🫲🏾.

We could simply bring #11508 back but use the new parser warning facilities instead.

HertzDevil added status:discussion topic:compiler:parser labels Sep 15, 2021

This was referenced Nov 1, 2021

Parser vulnerable to Trojan Source attack #11392

Open

Fix disallow Unicode bi-directional control characters #11393

Closed

straight-shoota added this to To do in Unicode Nov 10, 2021

straight-shoota mentioned this issue Nov 29, 2021

Restrict identifier grammar #11508

Merged

straight-shoota mentioned this issue Dec 7, 2021

OptionParser flag parsing is very lax #11547

Open

straight-shoota closed this as completed in #11508 Dec 18, 2021

Unicode automation moved this from To do to Done Dec 18, 2021

straight-shoota added breaking-change and removed breaking-change labels Jan 3, 2022

straight-shoota mentioned this issue Jan 3, 2022

Revert "Restrict identifier grammar" #11687

Merged

straight-shoota reopened this Jan 3, 2022

Unicode automation moved this from Done to In progress Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stricter grammar for Unicode identifier names #11216

Stricter grammar for Unicode identifier names #11216

HertzDevil commented Sep 15, 2021 •

edited

straight-shoota commented Dec 13, 2021

straight-shoota commented Jan 3, 2022

HertzDevil commented Dec 2, 2022 •

edited

Stricter grammar for Unicode identifier names #11216

Stricter grammar for Unicode identifier names #11216

Comments

HertzDevil commented Sep 15, 2021 • edited

straight-shoota commented Dec 13, 2021

straight-shoota commented Jan 3, 2022

HertzDevil commented Dec 2, 2022 • edited

HertzDevil commented Sep 15, 2021 •

edited

HertzDevil commented Dec 2, 2022 •

edited