diff --git a/eeps/eep-0066.md b/eeps/eep-0066.md index e29bfe2..bcf8285 100644 --- a/eeps/eep-0066.md +++ b/eeps/eep-0066.md @@ -45,8 +45,8 @@ Design Decisions ---------------- In the following text double angle quotation marks are used to -mark source code characters in a paragraph. For example: -«`.`» means the dot character (full stop). +mark source code characters to improve clarity. +For example: the dot character (full stop): «`.`». ### Erlang Language Structure (Tokenizer and Parser) @@ -76,9 +76,9 @@ much state and looks just a few fixed number of characters ahead in the input. For example; from the start state, if the tokenizer sees -a «`'`» character, it switches state to scanning a quoted atom. -While doing so it translates escape sequences such as «`\n`» -(into ASCII 10) and when it sees a «`'`» character it produces +a `'` character, it switches state to scanning a quoted atom. +While doing so it translates escape sequences such as `\n` +(into ASCII 10) and when it sees a `'` character it produces an atom token and goes back to the start state. ### Problems with simple prefixes @@ -92,7 +92,7 @@ The tokenizer would have to know of all combinations of prefix characters and emit distinct tokens for every combination. Today, the character sequence «`b`», «`f`», «`"`» is scanned as a token -for the atom «`bf`» followed by the string start token «`"`». +for the atom `bf` followed by the string start token `"`. That combination fails in the parser so it is syntactically invalid today, which is what makes simple prefixes a possible language extension. @@ -107,30 +107,30 @@ Furthermore, it is likely that we want the feature of choosing re(^"+.*/.*$) -Among the desired delimiters are «`/`» and «`<`»+«`>`». The currently -valid code «`b`. The currently +valid code «`b`», and the following characters -are start delimiters that have themselves as end delimiters: -«`/`», «`|`», «`'`», and «`"`». +The allowed start-end delimiter character pairs are: +`() [] {} <>`. -This EEP proposes to so far only implement the regular string -start and end delimiter «`"`» as single character demiliter. -It is the established string delimiter in Erlang and will -create no confusion. The other can be added later -and by not allowing them yet it will still be possible for them -to have different semantics, if we find some good use for that. +The following characters are start delimiters that have themselves +as end delimiters: `` / | ' " ` # ``. Triple-quote delimiters are also allowed, that is; a sequence of -3 or more double quote «`"`» characters as described in [EEP 64][]. +3 or more double quote `"` characters as described in [EEP 64][]. For a given [Sigil Type][] except the [Vanilla Sigil][], which String Delimiters that are used does not affect how @@ -304,6 +321,31 @@ For a triple-quoted string, though, conceptually the end delimiter doesn't occur in the string's content, so interpreting the string content does not interfere with finding the end delimiter. +The proposed set of delimiters is the same as in [Elixir][1], +plus `` ` `` and `#`. They are the characters in [ASCII][] +that are normally used for bracketing or text quoting, +and those that feel like full height vertikal lines, +except: `\` is too often used for character escaping, +plus `#` which is too useful to *not* include since +in many contexts (shell scripts, Perl regular expressions) +it is a comment character that is easy to avoid +in the [String Content][]. + +Even though [Latin-1][] is the character set that Erlang +is defined in, it is still [ASCII][] that is the common denominator +for programming languages. Only western Europeean keyboards +and code pages that have the possibility to produce [Latin-1][] +characters above 127. + +[Latin-1][] characters above 127 are allowed in variable names +and unquoted atoms, but the programmer that uses such should +be aware that the code will not read correctly for +non-[Latin-1][] users. On the other hand it would be bad to lure +a programmer into using e.g a quote character that happens to exist +on a [Latin-1][] keyboard but will be something completely different +for other programmers. Therefore characters like `« »` +should *not* be used for a general syntactical element. + ### String Content Between the start and end [String Delimiters][], all characters @@ -314,7 +356,7 @@ of indentation and leading and trailing newline is done as usual as described in [EEP 64][]. In a string with single character [String Delimiters][], -normal Erlang escape sequences prefixed with «`\`» are honoured, +normal Erlang escape sequences prefixed with `\` are honoured, as usual for regular Erlang strings and quoted atoms A specific [Sigil Type][] can have it's own character escaping rules, @@ -330,10 +372,10 @@ of name characters. The Sigil Suffix may indicate how to interpret the String Content, for a specific [Sigil Type][]. -For example; for the «`~r`» [Sigil Prefix][] (regular expression), +For example; for the `~R` [Sigil Prefix][] (regular expression), the Sigil Suffix is interpreted as short form compile options such as «`i`» that makes the regular expression character -case insensitive. +case insensitive. For example «`~R/^from: /i`». Things that may have to be performed by the tokenizer, such as how to handle escape character rules, should not be affected @@ -346,7 +388,7 @@ or the parser. ### Regular Expressions -A regular expression sigil «`~r"expression"flags`» should +A regular expression sigil «`~R"expression"flags`» should be translated to something useful for tools/libraries. There are at least two ways; [uncompiled regular expressions][], or [compiled regular expressions][]. @@ -411,41 +453,47 @@ should represent an *uncompiled* regular expression with compile flags. ### Comparison with Elixir -The [Vanilla Sigil][] (empty [Sigil Type][]) is not allowed in Elixir. +There is no [Vanilla Sigil][] (empty [Sigil Type][]) in Elixir. -This EEP proposes to only implement the «`"`» [String Delimiters][], -for starters. Elixir has got a much wider set. +This EEP proposes to add the following [String Delimiters][] +to the set that Elixir has: `` # ` ``. The string and binary [Sigil Type][]s are named differently between the languages, to keep the names consistent within -the language (Erlang): «`~s`» in Elixir is «`~b`» in Erlang, -and «`~c`» in Elixir is «`~s`» in Erlang, so «`~s`» means +the language (Erlang): `~s` in Elixir is `~b` in Erlang, +and `~c` in Elixir is `~s` in Erlang, so `~s` means different things, because strings are different things. When Elixir allows escape sequences in the [String Content][] it also allows string interpolation. This EEP proposes to *not* implement string interpolation in the suggested [Sigil Type][]s. +When Elixir doesn't allow escape sequences in the [String Content][], +it still allows escaping the end delimiter. This EEP proposes +that such strings should be truly verbatim whith no possibility +to escape the end delimiter. + There are small differences in which escape sequences that are implemented in the languages; Elixir allows escaping of newlines, and has -an escape sequence «`\a`», that Erlang does not have. +an escape sequence `\a`, that Erlang does not have. There are also small differences in how newlines are handled -between «`~S`» heredocs in Elixir and triple-quoted strings in Erlang. +between `~S` heredocs in Elixir and triple-quoted strings in Erlang. See [EEP 64][]. -Details about regular expression sigils, «`~r`», in particular +Details about regular expression sigils, `~R`, in particular their [Sigil Suffix][]es remains to be decided in Erlang. +Also, there still is a question about escaping the end delimiter or not. It has not been decided how or even *if* string interpolation -in will be implemented in Erlang, but a [Sigil Suffix][] or +will be implemented in Erlang, but a [Sigil Suffix][] or new [Sigil Type][]s would most probably be used. Reference Implementation ------------------------ -[PR-7684][] Implements the «`s`», «`S`», «`b`», «`B`» -and the «``» (vanilla) Sigil, according to this EEP. +[PR-7684][] Implements the `~s`, `~S`, `~b`, `~B` +and the `~` (vanilla) Sigil, according to this EEP. The tokenizer produces a `sigil_prefix` token before the string literal, and a `sigil_suffix` token after. The parser merges and transforms them @@ -478,6 +526,9 @@ more tokenizer rewriting. [Latin-1]: https://en.wikipedia.org/wiki/ISO/IEC_8859-1 "Wikipedia: ISO-IEC 8859-1" +[ASCII]: https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block) + "Unicode Basic Latin" + [PR-7684]: https://github.com/erlang/otp/pull/7684 "Sigils on String Literals"