Permalink
Browse files

Add EEP 40: A proposal for Unicode variable and atom names in Erlang

  • Loading branch information...
1 parent e349b08 commit 140117c84424ffe508ed2714ac3750cb07fe949c @RaimoNiskanen RaimoNiskanen committed Oct 19, 2012
Showing with 206 additions and 0 deletions.
  1. +206 −0 eeps/eep-0040.md
View
@@ -0,0 +1,206 @@
+ Author: Richard A. O'Keefe <ok(at)cs(dot)otago(dot)ac(dot)nz>
+ Status: Draft
+ Type: Standards Track
+ Erlang-version: R15B02
+ Created: 19-Oct-2012
+ Post-History: 19-Oct-2012
+****
+EEP 40: A proposal for Unicode variable and atom names in Erlang
+----
+
+
+
+Abstract
+========
+
+This EEP proposes how to extend variable and atom names in Erlang
+to contain Unicode characters in a backwards compatible way.
+
+
+
+[Note]: <> (Underscores in regular text below are backslash escaped)
+[Note]: <> (due to a weird Markdown rule for emphasis within words.)
+[Note]: <> (So e.g. where it stands XID\_Start it means XID_Start. )
+
+
+
+Forces
+======
+
+1. Support for Unicode continues to increase, with
+ minimal source code support about to arrive.
+2. Unicode variable names and unquoted atoms are not
+ here yet, so now is the time to settle on a design.
+3. They will need to come. There may be legal or
+ institutional reasons why unicode-capable languages
+ are required. Some people just want to use their
+ own language and script. Erlang's strength in
+ network applications means that being able to
+ represent Internationalized Domain Names as unquoted
+ atoms would be just as much of a convenience as
+ being able to represent ASCII domain names like
+ www.example.com (which needs no quotes in Erlang) is.
+4. There is a framework for Unicode identifiers in [UAX#31][],
+ used by several programming languages, including Ada, Java,
+ C++, C, C#, Javascript, and Python (section 2.3 of [Python Lexical][],
+ and see also [PEP 3131][]).
+5. Existing Erlang identifiers should remain valid,
+ including ones containing "@" and ".".
+6. Existing Erlang support features, such as ignoring
+ variables that start with underscore when reporting
+ singleton variables, should not be broken.
+7. We should not "steal" any characters to use as "magic
+ markers" for variables because they might be needed for
+ other purposes. A good (bad) example of this is "?", which
+ could be used for several things if it were not used for macros.
+
+
+
+Reference
+=========
+
+Names of sets of characters, XID\_Start, XID\_Continue, Lu, Lt, Lo, Pc,
+Other\_Id\_Start, are drawn from [Unicode][] and [UAX#31][].
+
+ Lu = upper case letters
+ Lt = title case letters
+ Pc = connector punctuators, including the low line (_) and
+ a number of other characters like undertie (‿).
+ Other_Id_Start = script capital p, estimated symbol,
+ katakana-hiragana voiced sound mark, and
+ katakana-hiragana semi-voiced sound mark.
+
+
+
+Specification
+=============
+
+Variables
+---------
+
+ variable ::= var_start var_continue*
+
+ var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ Pc ∪ Other_Id_Start)
+
+ var_continue ::= XID_Continue U "@"
+
+The choice of XID here follows Python. It ensures that the normalisation
+of a variable is still a variable. In fact Unicode variables should be
+normalised. Unicode has enough look-alike characters that we cannot hope
+for "look the same <=> are the same" to be true, but we should go _some_
+way in that direction.
+
+Variables in scripts that do not distinguish letter case have to
+begin with _some_ special character to ensure that they are not
+mistaken for unquoted atoms. There are 10 Pc characters in the Basic
+Multilingual Plane. The Erlang parser treats a variable beginning
+with an underscore specially: there will be no complaint if it is a
+singleton. There are 9 other Pc characters for which this special
+treatment is not applied. Of course, someone might be using fonts
+that do include say Arabic letters but not say the undertie. We can
+deal with that by revising the underscore rule.
+
+ Variable does not begin with a Pc character =>
+ should not be a singleton.
+
+ Variable is just a Pc character and nothing else =>
+ is a wild card.
+
+ Variable begins with a Pc character followed by a
+ Latin-1 character =>
+ may be a singleton.
+
+ Variable begins with a Pc character following by
+ a character outside the Latin-1 range =>
+ should not be a singleton.
+
+Thus ‿ is a wild-card, 隠者 is an atom, \_隠者 should not be
+a singleton, but \_\_隠者 _may_ be a singleton. This rule is a
+consistent generalisation of the existing rule.
+
+Unquoted atoms
+--------------
+
+ unquoted_atom ::= atom_start atom_continue
+
+ atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Lo ∪ Pc)
+ | "." (Ll ∪ Lo)
+
+ atom_continue ::= XID_Continue U "@"
+ | "." (Ll ∪ Lo)
+
+Again the choice of XID follows Python, and ensures that the
+normalisation of an unquoted atom is still an unquoted atom.
+Unquoted atoms should be normalised.
+
+The details of Erlang unquoted atoms are somewhat subtle; I have
+checked my understanding experimentally.
+
+Keywords
+--------
+
+Keywords have the form of unquoted atoms. No new keywords are
+introduced.
+
+### Specifics ###
+
+- Any Python identifier or keyword is
+ an Erlang variable or unquoted atom or keyword.
+
+- @ signs may occur freely in variables and unquoted atoms except as the
+ first character, as now.
+
+- dots may not be followed by capital letters, digits, or underscores,
+ as now.
+
+- I am not sure whether modifier letters should be allowed after a dot.
+
+- I am not sure what to do with the Other\_ID\_Start characters.
+ Script capital p _looks_ like a capital p and even has "capital" in
+ its name. All other "* SCRIPT CAPITAL *" characters are upper case
+ letters. Surely it should be allowed to start a variable.
+ The estimated sign looks like an enlarged lower case e; other symbols
+ that look like letters are classified as letters. You'd expect this
+ to begin an atom. As for the Katakana-Hiragana voicing marks, I have
+ no intuition whatever. Assigning the whole group to atoms seems
+ safest.
+
+- All existing variable names and unquoted atoms remain legal, and no
+ new variable or atom forms using only Latin-1 characters have been
+ introduced.
+
+Trouble spot
+
+
+
+[References]: <>
+"==========="
+
+[Unicode]: http://www.unicode.org/versions/Unicode6.2.0/
+ "The Unicode Standard version 6.2.0"
+
+[UAX#31]: http://www.unicode.org/reports/tr31/
+ "Unicode Standard Annex 31"
+
+[Python Lexical]: http://docs.python.org/release/3.1.5/reference/lexical_analysis.html
+ "Python Lexical Analysis"
+
+[PEP 3131]: http://www.python.org/dev/peps/pep-3131/
+ "Python Enhancement Proposal 3131"
+
+
+
+Copyright
+=========
+This document has been placed in the public domain.
+
+
+
+[EmacsVar]: <> "Local Variables:"
+[EmacsVar]: <> "mode: indented-text"
+[EmacsVar]: <> "indent-tabs-mode: nil"
+[EmacsVar]: <> "sentence-end-double-space: t"
+[EmacsVar]: <> "fill-column: 70"
+[EmacsVar]: <> "coding: utf-8"
+[EmacsVar]: <> "End:"
+[VimVar]: <> " vim: set fileencoding=utf-8 "

0 comments on commit 140117c

Please sign in to comment.