Author: Richard A. O'Keefe <ok(at)cs(dot)otago(dot)ac(dot)nz> Status: Draft Type: Standards Track Erlang-version: R15B02 Created: 19-Oct-2012 Post-History: 19-Oct-2012
EEP 40: A proposal for Unicode variable and atom names in Erlang
This EEP proposes how to extend variable and atom names in Erlang to contain Unicode characters in a backwards compatible way.
[Note]: <> (Underscores in regular text below are backslash escaped) [Note]: <> (due to a weird Markdown rule for emphasis within words.) [Note]: <> (So e.g. where you find XID_Start it means XID_Start.)
- Support for Unicode continues to increase, with minimal source code support about to arrive.
- Unicode variable names and unquoted atoms are not here yet, so now is the time to settle on a design.
- They will need to come. There may be legal or institutional reasons why unicode-capable languages are required. Some people just want to use their own language and script. Erlang's strength in network applications means that being able to represent Internationalized Domain Names as unquoted atoms would be just as much of a convenience as being able to represent ASCII domain names like www.example.com (which needs no quotes in Erlang) is.
- Existing Erlang identifiers should remain valid, including ones containing "@" and ".".
- Existing Erlang support features, such as ignoring variables that start with underscore when reporting singleton variables, should not be broken.
- We should not "steal" any characters to use as "magic markers" for variables because they might be needed for other purposes. A good (bad) example of this is "?", which could be used for several things if it were not used for macros.
- Character sequences in Latin-1 that are not legal variable or atom names now should not be made into such by this specification.
Lu = upper case letters Lt = title case letters Ll = lower case letters Lo = non-case letters (Arabic, Chinese, and so on) Pc = connector punctuators, including the low line (_) and a number of other characters like undertie (‿). Other_Id_Start = script capital p, estimated symbol, katakana-hiragana voiced sound mark, and katakana-hiragana semi-voiced sound mark.
variable ::= var_start var_continue* var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_Id_Start)) ∪ Pc var_continue ::= XID_Continue ∪ "@" \ "ªº"
The choice of XID here follows Python. It ensures that the normalisation of a variable is still a variable. In fact Unicode variables should be normalised. Unicode has enough look-alike characters that we cannot hope for "look the same <=> are the same" to be true, but we should go some way in that direction.
Variables in scripts that do not distinguish letter case have to begin with some special character to ensure that they are not mistaken for unquoted atoms. There are 10 Pc characters in the Basic Multilingual Plane. The Erlang parser treats a variable beginning with an underscore specially: there will be no complaint if it is a singleton. One approach would be to say that this special treatment does not apply to the other 9 Pc characters. Using that approach, ‿ would not be a wild-card, _隠者 should be a singleton, and ‿隠者 should not.
Of course, someone might be using fonts that do include say Arabic letters but not say the undertie. We can deal with that by revising the underscore rule, which I recommend:
Variable does not begin with a Pc character => should not be a singleton. Variable is just a Pc character and nothing else => is a wild card. Variable begins with a Pc character followed by an Lu or Lt or Pc character => may be a singleton. Variable begins with a Pc character followed by a legal character other than an Lu or Lt or Pc character => should not be a singleton.
Thus ‿ is a wild-card, 隠者 is an atom, _隠者 should not be a singleton, but __隠者 may be a singleton. This rule is a consistent generalisation of the existing rule.
unquoted_atom ::= "."? atom_start atom_continue* atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº") atom_continue ::= XID_Continue ∪ "@" \ "ªº" | "." atom_start
Again the choice of XID follows Python, and ensures that the normalisation of an unquoted atom is still an unquoted atom. Unquoted atoms should be normalised.
The details of Erlang unquoted atoms are somewhat subtle; I have checked my understanding experimentally. An initial dot is allowed, but is always discarded. That's odd, but it's the way it is now.
Keywords have the form of unquoted atoms. No new keywords are introduced.
Any Python identifier or keyword is an Erlang variable or unquoted atom or keyword unless it contains "ª" or "º".
@ signs may occur freely in variables and unquoted atoms except as the first character, as now.
Although they are in the Ll set, and so are technically lower case letters, "ª" and "º" are not allowed in variable names or unquoted atoms in this proposal because they are not allowed in Erlang now.
dots may not be followed by capital letters, digits, or underscores, as now.
I am not sure whether modifier letters should be allowed after a dot.
I am not sure what to do with the Other_ID_Start characters. Script capital p looks like a capital p and even has "capital" in its name. All other "* SCRIPT CAPITAL *" characters are upper case letters. Surely it should be allowed to start a variable. The estimated sign looks like an enlarged lower case e; other symbols that look like letters are classified as letters. You'd expect this to begin an atom. As for the Katakana-Hiragana voicing marks, I have no intuition whatever. Assigning the whole group to atoms seems safest.
All existing variable names and unquoted atoms remain legal, and no new variable or atom forms using only Latin-1 characters have been introduced.
While Erlang files meant to be shared with a wide audience should still be written in English, if people are working in a group fluent in some language on requirements also written in that language, it is desirable that they should be able to stay close to the terminology of the requirements lest they introduce translation errors.
The whole design flows in the direction "if someone wants to use their own script in an Erlang file, they should be able to do so comfortably in a way that is generally consistent with other programming languages."
This does mean that there will be Erlang source files that a skilled Erlang programmer is unable to decipher because of the unfamiliarity of the script. With over 110,000 characters in Unicode 6, this is just going to happen no matter what we do. Once Unicode strings are available, can quoted Unicode atoms be far behind? And once they are possible, refusing unquoted Unicode atoms does not salvage universal readability. All it would accomplish is to annoy people by requiring single quotation marks to be used liberally. Old Algol programmers will recall only too clearly how much of an impairment to readability a hailstorm of single quotation marks was. And if you can use γαμμα as an atom, does it make any sense to refuse Γαμμα?
One of the goals for this EEP is that if an Erlang text contains only Latin-1 characters, then it should be legal under the new rules if and only if it is legal under the old rules, and should have the same meaning in either context. During the transition period, there will be people writing Erlang code for systems following the new rules, and giving it to people using Latin-1 or at any rate old-rules systems. They should not accidentally introduce incompatibilities. This is why we have to ban "ª" and "º" for now. Later we may lift that ban.
There are three ways we have to customize the UAX 31 definition.
- We have to continue to support "@" in variables and "@" and "." in unquoted atoms for backwards compatibility. - We have to continue to forbid unquoted atoms containing the Latin-1 masculine and feminine ordinal indicators. - We have to distinguish between variables and unquoted atoms.
There is a fourth way we might customize it. Ken Whistler of Unicode advises that he "doesn't see much point" in allowing Pc characters other than LOW LINE and FULLWIDTH LOW LINE, unless there are legacy reasons why something else has to be supported. It seems like a good idea that if s is a legal ASCII identifier, the full width version of s should also be a legal identifier, so FULLWIDTH LOW LINE definitely ought to be allowed. I find using UNDERTIE cool, but it's an editor's mark really. If we reject the other Pc characters now, we can always allow them later if we find a need; if we allow them now, it will be hard to reject them later. Making this change clearly in the definitions will take a little thought, so that's for the next revision.
Dmitry Belyaev has raised the issue of localising keywords. That is outside the scope of this EEP, which is concerned with which character sequences are variables and which are keywords-or-unquoted-atoms. This has to be got right first before we can consider localised keywords.
The leading underscore rule was revised on the 5th of November on the advice of Ulrich Neumerkel to avoid the problem that _Œuvre would not have been accepted as a singleton. Now it will. This was ironic, as Māori variables like _Āporo would have been misclassified.
It is highly desirable that a legal Erlang text should remain legal even as Unicode is revised. UAX#31 and Stability very nearly give us what we need. The one problem that seems to be technically possible is that an upper or lower case letter without an opposite case counterpart might change its General Category (while being given the Other_ID_Start property if it ceased to be a letter at all), so an identifier beginning with such a cased orphan might switch from variable to unquoted atom or vice versa. Some cased orphans do exist, like LATIN LETTER SMALL CAPITAL M, but what would a capital capital M be?
One possibility is to raise the issue with the Unicode consortium and leave this unresolved until they reply. The issue has been raised, and the tentative reply "you may not be able to rely on any given standard property for special purposes. Especially if that property is not formally stable." given. The next step may well be to seek a revision to UAX#31, because Erlang is not alone in wanting a case distinction.
Another possibility would be to say that an Lu character may only begin a variable if it has a lower-case counterpart, and an Ll character may only begin an unquoted atom if it has an upper-case counterpart. Since "ß" and "ÿ" have upper-case counterparts in Unicode, Latin-1 unquoted atoms would not be affected by such a rule. The great mass of Lo characters would also be unaffected.
[References]: <> "==========="
This document has been placed in the public domain.
[EmacsVar]: <> "Local Variables:" [EmacsVar]: <> "mode: indented-text" [EmacsVar]: <> "indent-tabs-mode: nil" [EmacsVar]: <> "sentence-end-double-space: t" [EmacsVar]: <> "fill-column: 70" [EmacsVar]: <> "coding: utf-8" [EmacsVar]: <> "End:" [VimVar]: <> " vim: set fileencoding=utf-8 expandtab shiftwidth=4 softtabstop=4: "