Browse files

Update EEP 40 from Richard O'Keefe dated Thu, 1 Nov 2012 18:27:10 +1300

  • Loading branch information...
1 parent 140117c commit b84a353fc1c2cd7f29c450abd5ba2ddcec5e974d @RaimoNiskanen RaimoNiskanen committed Nov 2, 2012
Showing with 73 additions and 8 deletions.
  1. +73 −8 eeps/
@@ -20,7 +20,7 @@ to contain Unicode characters in a backwards compatible way.
[Note]: <> (Underscores in regular text below are backslash escaped)
[Note]: <> (due to a weird Markdown rule for emphasis within words.)
-[Note]: <> (So e.g. where it stands XID\_Start it means XID_Start. )
+[Note]: <> (So e.g. where you find XID\_Start it means XID_Start.)
@@ -52,7 +52,9 @@ Forces
7. We should not "steal" any characters to use as "magic
markers" for variables because they might be needed for
other purposes. A good (bad) example of this is "?", which
- could be used for several things if it were not used for macros.
+ could be used for several things if it were not used for macros.
+8. Character sequences in Latin-1 that are not legal variable or
+ atom names now should not be made into such by this specification.
@@ -80,9 +82,9 @@ Variables
variable ::= var_start var_continue*
- var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ PcOther_Id_Start)
+ var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_Id_Start))Pc
- var_continue ::= XID_Continue U "@"
+ var_continue ::= XID_Continue "@"
The choice of XID here follows Python. It ensures that the normalisation
of a variable is still a variable. In fact Unicode variables should be
@@ -121,12 +123,12 @@ consistent generalisation of the existing rule.
Unquoted atoms
- unquoted_atom ::= atom_start atom_continue
+ unquoted_atom ::= atom_start atom_continue*
- atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Lo ∪ Pc)
+ atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
| "." (Ll ∪ Lo)
- atom_continue ::= XID_Continue U "@"
+ atom_continue ::= XID_Continue | "@"
| "." (Ll ∪ Lo)
Again the choice of XID follows Python, and ensures that the
@@ -145,7 +147,8 @@ introduced.
### Specifics ###
- Any Python identifier or keyword is
- an Erlang variable or unquoted atom or keyword.
+ an Erlang variable or unquoted atom or keyword
+ unless it begins with "ª" or "º".
- @ signs may occur freely in variables and unquoted atoms except as the
first character, as now.
@@ -169,7 +172,67 @@ introduced.
new variable or atom forms using only Latin-1 characters have been
+While Erlang files meant to be shared with a wide audience should
+still be written in English, if people are working in a group fluent
+in some language on requirements also written in that language, it
+is desirable that they should be able to stay close to the terminology
+of the requirements lest they introduce translation errors.
+The whole design flows in the direction "if someone wants to use their
+own script in an Erlang file, they should be able to do so comfortably
+in a way that is generally consistent with other programming
+This _does_ mean that there will be Erlang source files that a skilled
+Erlang programmer is unable to decipher because of the unfamiliarity
+of the script. With over 110,000 characters in Unicode 6, this is
+just going to happen no matter what we do. Once Unicode strings are
+available, can quoted Unicode atoms be far behind? And once they are
+possible, refusing unquoted Unicode atoms does not salvage universal
+readability. All it would accomplish is to annoy people by requiring
+single quotation marks to be used liberally. Old Algol programmers
+will recall only too clearly how much of an impairment to readability
+a hailstorm of single quotation marks was. And if you can use
+γαμμα as an atom, does it make any sense to refuse Γαμμα?
+There are three ways we have to customize the UAX 31 definition.
+ - We have to continue to support "@" in variables and
+ "@" and "." in unquoted atoms for backwards compatibility.
+ - We have to continue to forbid unquoted atoms beginning
+ with the Latin-1 masculine and feminine ordinal indicators.
+ - We have to distinguish between variables and unquoted atoms.
Trouble spot
+It is highly desirable that a legal Erlang text should remain legal
+even as Unicode is revised. [UAX#31][] and [Stability][] very nearly
+give us what we need. The one problem that seems to be technically
+possible is that an upper or lower case letter without an opposite
+case counterpart might change its General Category (while being given
+the Other\_ID\_Start property if it ceased to be a letter at all),
+so an identifier beginning with such a cased orphan might switch from
+variable to unquoted atom or vice versa. Some cased orphans do exist,
+like LATIN LETTER SMALL CAPITAL M, but what would a capital capital M
+One possibility is to raise the issue with the Unicode consortium and
+leave this unresolved until they reply.
+Another possibility would be to say that an Lu character may only
+begin a variable if it has a lower-case counterpart, and an Ll
+character may only begin an unquoted atom if it has an upper-case
+counterpart. Since "ß" and "ÿ" have upper-case counterparts in
+Unicode, Latin-1 unquoted atoms would not be affected by such a rule.
+The great mass of Lo characters would also be unaffected.
@@ -188,6 +251,8 @@ Trouble spot
[PEP 3131]:
"Python Enhancement Proposal 3131"
+ "Unicode Character Encoding Stability Policy"

0 comments on commit b84a353

Please sign in to comment.