From 735ac21ec74352ef923fa1530b4865c06c4c494c Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Mon, 18 May 2020 23:00:03 -0700 Subject: [PATCH 01/11] P0016 Lexical conventions, first draft. --- docs/proposals/p0016.md | 543 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 543 insertions(+) create mode 100644 docs/proposals/p0016.md diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md new file mode 100644 index 000000000000..d67d96946cfd --- /dev/null +++ b/docs/proposals/p0016.md @@ -0,0 +1,543 @@ + + +# Carbon: Lexical conventions + +- **Authors:** Richard Smith +- **[Tracking issue](https://github.com/carbon-language/carbon-lang/issues/16)** +- **Status:** RFC +- **Created:** 2020-05-18 + +**_PLEASE_ DO NOT SHARE OUTSIDE CARBON FORUMS** + +## Problem + +This document proposes a set of rules for the initial phase of processing a +Carbon source file: interpreting the contents of the file and forming +[tokens](#tokens). + +## Proposal + +Carbon source files are [UTF-8](#file-contents-and-encoding) text files whose +contents are divided into [whitespace](#whitespace), [comments](#comments), +[literals](#literals), [identifiers](#identifiers), [keywords](#keywords), +[designators](#designators), [operators](#operators), and +[brackets](#brackets), as described below. + +## Details + +### File contents and encoding + +Carbon source files are Unicode text files encoded in UTF-8. An initial UTF-8 +BOM is permitted and ignored. All contents outside of [comments](#comments) and +[literals](#literals) shall be in Normalization Form C. +[[why?]](#encoding-rationale) + +Carbon is currently based on Unicode 13.0, and will adopt new Unicode versions +as they are published. + +### Whitespace + +Characters are identified as whitespace if they have the Unicode +`Pattern_White_Space` property. This includes the C++ whitespace characters: + + * Space and horizontal tab + * Carriage return and line feed (which C++ conflates as "new line") + * Vertical tab and form feed + +As of Unicode version 13, 5 additional characters are included: + + * U+0085 NEXT LINE + * U+200E LEFT-TO-RIGHT MARK + * U+200F RIGHT-TO-LEFT MARK + * U+2028 LINE SEPARATOR + * U+2029 PARAGRAPH SEPARATOR + +Space, horizontal tab, and the LTR and RTL mark are *horizontal whitespace* +characters. All other whitespace characters are *vertical whitespace* +characters. + +Characters with the Unicode property `White_Space` but not +`Pattern_White_Space` are invalid outside comments and literals. Code +formatters are encouraged to convert them into recognized horizontal whitespace +characters. Implementations are encouraged to recover from the error as if +those characters were treated as horizontal whitespace. + +### Comments + +A *comment* in Carbon is either: + + * A *line comment*, beginning with `//` and running to the end of the line, or + * A *block comment*, beginning with `/*` and running to the matching `*/`. + +Carbon has no mechanism for physical line continuation, so a `//` comment +always ends at the next vertical whitespace character. +[[why?]](#line-continuation-rationale) + +If the character after the `/*` introducing a block comment is `{`, the comment +is a *code comment*. In a code comment, the following text is tokenized until a +matching `}*/` token is formed; such a token terminates the comment. (In +particular, such a token is not recognized if it is nested within another +comment or a literal.) Otherwise, the comment ends at the first matching `*/` +character sequence. +[[why?]](#nested-comments-rationale) + +Example: + +```carbon +// This is a comment. +// The characters /* introduce a block comment. +This is not a comment. +/*{ + // The characters */ end a block comment. +}*/ +This is not a comment. +``` + +If the character after the comment introducer is an exclamation mark, the +comment is a documentation comment. Documentation comments are tokens, and are +recognized by the language grammar only in specific locations, which determine +the entity to which they attach. +[[why?]](#nested-comments-rationale) + +Non-documentation comments are treated equivalently to whitespace. + +In addition to the cases above, a block comment introducer may be followed by +additional `*` characters. If the character after the comment introducer is not +one of those mentioned above, it shall be a whitespace character. +[[why?]](#comment-introducers-rationale) + +### Literals + +Carbon provides literal syntax for numbers, and for character and string data. +(Additional constants, such as `True` and `Nullptr`, are exposed as keywords or +predeclared identifiers.) + +A *literal* is a numeric literal, a character literal, or a string literal, as +defined below. + +A literal shall not be immediately followed by a character with property +`XID_Start`. Carbon has no literal suffixes, but the corresponding lexical +space is reserved for future extensions. + +#### Numbers + +Decimal integers are written as a non-zero decimal digit followed by zero or +more additional decimal digits. + +Integers in other bases are written as a `0` followed by a base specifier +character, followed by a sequence of digits in the corresponding base. The +available base specifiers and corresponding bases are: + +| Base specifier | Base | Digits | +| -------------- | ---- | ------------------------------------- | +| `b` or `B` | 2 | `0` and `1` | +| `o` | 8 | `0` ... `7` | +| `x` or `X` | 16 | `0` ... `9`, `a` ... `f`, `A` ... `F` | + +[TODO: This doesn't belong here.] There are no size suffixes. Each literal has +a unique type that can be converted to any sufficiently-large integer type, but +operations on it are always exact. + +Real numbers are written as a sequence of one or more decimal digits followed +by a decimal point followed by a sequence of one or more decimal digits. + +A real number can be followed by an `e`, an optional `+` or `-` (defaulting to +`+`), and a decimal integer *N*; the effect is to multiply the given value by +10*N*. + +A *numeric literal* is an integer or real number expressed as described above. + +#### Characters + +A *character literal* is formed of any single character other than a backslash +(`\\`) or single quotation mark, enclosed in a pair of single quotation marks +(`'`), or an escape sequence enclosed in a pair of single quotation marks. + +An escape sequence is replaced by the corresponding character sequence or +encoding, which shall fit in a single character. + +TODO: Table of escape sequences. + +#### Strings + +A *simple string literal* is formed of a sequence of + + * characters other than backslashes, double quotation marks, and vertical + whitespace + * escape sequences + +enclosed in double quotation marks (`"`). Each escape sequence is replaced with +the corresponding character sequence or encoding. + +A *raw string literal* starts with an `r` followed by *N* `#` characters +followed by a double quotation mark, and ends with the first following +occurrence of a double quotation mark followed by *N* `#` characters. The text +in between is not interpreted in any way. + +A *block string literal* starts with three double quotation marks followed by +an optional sequence of non-whitespace characters, followed by a newline. Each +following line within the literal shall start with the same initial sequence of +zero or more horizontal whitespace characters and optionally one `|` character +as were present on the first such line. The literal ends at the first instance +of three double quotation mark characters (where the first such character is +not escaped). The common initial horizontal whitespace is removed from each +line, as is the terminating newline character. Escape sequences are expanded +as in a simple string literal. The initial sequence of characters before the +newline is ignored, but can be used to indicate the formatting rules for a code +formatter to use for the literal contents. +[[why?]](#block-strings-rationale) + +A *raw block string literal* is expressed analogously to a raw string literal, +but for a block string literal. Escape sequences are ignored. + +For example: + +```carbon +fn f() { + var String: x = r#""" + This is the content of the string. The 'T' is the first character + of the string. + """ <-- This is not the end of the string. But this is --> """#; + var String: y = r"Hello\"; // OK, final character is \ + var String: z = r##"Raw strings r#"nesting"#"##; + + var String: starts_with_whitespace = """ + | int x = 1; + | int y = 2;"""; + var String: starts_with_pipe = r#""" + || is a pipe. + |\ is not a pipe."""#; + + var String: code = """c++ + const char *str = R"foo(hello)foo"; + """; + + var String: error = """ +This is invalid (insufficiently indented)."""; +} +``` + +### Identifiers + +An *identifier* is a maximal sequence of characters beginning with a character +with Unicode property `XID_Start`, followed by zero or more characters with +property `XID_Continue`. + +Notably, `XID_Start` does not include the underscore character. Tokens +beginning with an identifier are reserved, but are expected to be used as +pattern matching placeholders. + +Additionally, a *raw identifier* can be specified by prefixing an identifier +with `r#`, such as `r#requires`. Raw identifiers can be used to introduce and +use names that are lexically identical to keywords. + +All identifier tokens in all contexts are looked up using the same lexical +scoping rule. + +An identifier shall not be immediately followed by a `"` or `'`. + +### Keywords + +A *keyword* is an identifier with predefined meaning. Carbon has a predefined +set of keywords, that will be specified separately as part of the syntax rules. + +An identifier that is a keyword may also be declared explicitly in a source +file. The same identifier shall not be used as both a keyword and as a non-raw +non-keyword identifier in a single source file. +[[why?]](#keywords-rationale) + +Example: + +```carbon +var Int: fn = 3; // OK, variable named 'fn' +fn f() {} // error, 'fn' is not a keyword in this source file +interface var {} // error, already used 'var' as a keyword in this source file +``` + +### Designators + +A *designator* is a token formed by prefixing an identifier with a period +character, such as `.member`. The identifier after the period is the *member +name*, and is looked up in a context-dependent manner. +[[why?]](#designators-rationale) + +### Operators + +An *operator* is a maximal sequence of characters with Unicode property +`Pattern_Syntax`, excluding `"` and `'` and those characters with class `Ps` or +`Pe` (for which, see [brackets](#brackets)), which we will refer to as +*operator characters*. +[[why?]](#operators-rationale) +We do not intend to define any operators containing non-ASCII characters. The +ASCII operator characters are: + +``` +! # $ % & * + - . / : ; < = > ? @ \ ^ ` | ~ +``` + +Of these, we intend to not use \` due to its common use to escape +code, nor `$` due to its absence from many non-US keyboards. This leaves 20 +operator characters, 400 digraphs, and so on. + +Bracket operators, described below, are also operators. + +### Brackets + +A *simple open bracket* is a character with Unicode property `Pattern_Syntax` +and character class `Ps`, such as `(` or `[`. +A *simple close bracket* is a character +with Unicode property `Pattern_Syntax` and character class `Pe`, such as `}`. +A *bracket terminator character* is one of `|` or `:`. +A *bracket continuation character* is an operator character that is not a +bracket terminator character. + +A *compound open bracket* is a simple open bracket followed by zero or more +bracket continuation characters followed by a bracket terminator character, +such as `[:`. +A *compound close bracket* is a bracket terminator character followed by zero +or more bracket continuation characters followed by a simple close bracket, +such as `|=)`. +[[why?]](#compound-brackets-rationale) + +A *close bracket* is either a simple open bracket or a compound open bracket. +A *close bracket* is either a simple close bracket or a compound close bracket. + +The close bracket matching an open bracket is formed by reversing the character +sequence in the open bracket and replacing each caracter with class `Ps` with +the corresponding character with class `Pe`. Every open bracket is required to +have a matching close bracket such that the bracketed regions form a tree +structure. + +There are 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket digraphs +(`(| |)`, `{| |}`, `[| |]`, `(: :)`, `{: :}`, `[: :]`), 108 bracket trigraphs, +and so on. + +For example: + +```carbon +(this is within brackets {and this [this too]}) +(|this is a different kind of bracket {: and another :} + (**|lots of kinds of brackets can be built [=: this way :=]|**) + |) +``` + +TODO: Indentation restrictions for bracket matching? + +In addition, Carbon recognizes *bracket operators*, formed by a simple open +bracket followed by one or more operator characters followed by a matching +simple close bracket, such as `[~>]` or `(*)`. Bracket operators are operators, +not brackets. + +### Tokens + +A *token* is a documentation comment, a literal, an identifier, a keyword, a +designator, an operator, or a bracket. Tokens are formed by a single +left-to-right scan of the source file, using a "max munch" rule: the longest +possible next token is formed at each step. + +## Rationale + +### Encoding rationale + +We intend to follow the Unicode Consortium's recommentations for identifiers in +programming languages as described in Unicode 13.0.0 +[UAX#31](https://unicode.org/reports/tr31/) Revision 31. We do not see a reason +to be inventive in this regard, and delegating the complex considerations over +how Unicode characters should be used to a group with greater expertise in that +area seems appropriate. + +We observe UAX#31's requirements as follows: + + * UAX31-R1: requirement met. Identifiers are of the form `XID_Start` + `XID_Continue`\*. + * UAX31-R1a: requirement not met. Format characters are not permitted in + identifiers. + * UAX31-R1b: requirement not met. We intend for Carbon evolution to follow + Unicode evolution, including removing identifier characters as appropriate + over time. + * UAX31-R2: requirement not met. We intend for Carbon evolution to follow + Unicode evolution, including adding identifier characters as appropriate + over time. + * UAX31-R3: requirement met. Carbon treats characters as whitespace if and + only if they are `Pattern_White_Space` characters, and all operator tokens + are formed exclusively from `Pattern_Syntax` characters. + * UAX31-R4: requirement met. Carbon identifiers are required to be in NFC, + so identifiers that are the same in NFC are equivalent. + * UAX31-R5: requirement not met. Carbon identifiers are case-sensitive. + * UAX31-R6: requirement met. No characters are excluded from normalization. + * UAX31-R7: requirement not met. Carbon identifiers are case-sensitive. + +### Line continuation rationale + +Line continuation in C++ is sometimes necessary in order to combine the needs +of line-oriented parsing with the desire to meet a specific column limit or +format code nicely. For example: + +```c++ +#define SOME_MACRO \ + very long macro body \ + split over multiple lines +#define OTHER_MACRO \ + if (pretty_code) \ + do { wrap_lines() } while (false) +``` + +We do not have a commensurate need for line continuation in Carbon. We intend +to include no line-oriented syntax. In reasonable cases where an individual +token is longer than a natural column limit (such as for a long string +literal), we will provide a mechanism to wrap the token without line +continuation. + +Line continuation for comments in particular is a known source of gotchas in C +and C++ programs. + +## Nested comments rationale + +There is a known need for nested comment syntax: it is important to be able to +comment out a block of code confident in the knowledge that all text between +the comment markers (and exactly that text) was in fact commented out. + +The only facility for this in C and C++ is `#if 0` ... `#endif`, and that is +what is used in practice. As we do not want Carbon to have a textual +preprocessor, enabling nesting of some other comment scheme seems natural. + +Conversely, it is valuable to have a comment syntax for human-readable +commentary that can be used within a line and can span multiple lines, and we +see no reason to invent something other than `/* ... */` for this purpose. We +cannot use the same notation as nested comment syntax without compromising its +usage for human-readable commentary. For example: + +```carbon +f(1.0 / x /* can't be negative */, 2); +``` + +... would not be valid as a `/*{ ... }*/` comment, because it would contain an +unterminated character literal. + +### Comment introducers rationale + +We anticipate the possibility of adding additional kinds of comment in the +future. Reserving syntactic space in comment syntax, in a way that is easy for +programs to avoid, allows us to add such additional kinds of comment as a +non-breaking change. + +### Block strings rationale + +Block literals are a useful way of expressing multiline string content in a +program. It's useful to treat block literals and raw literals as distinct +concepts: even within multiline literals, explicit encoding of tab characters, +character escapes, and so forth can be useful or undesirable. + +Further, separating the concepts permits us to disallow newlines in non-block +raw string literals, which prevents one class of runaway lexing problem: the +inability to find the end of a raw string literal can lead to scanning and +consuming the entirety of the source file. + +Removing the initial indentation from block string literals serves two primary +purposes: it allows the lexer to intelligently abort and backtrack sooner if it +reaches the end of the indented region without seeing the end of string marker, +and it improves the readability of the code. + +### Keywords rationale + +One of Carbon's most important goals is to support program and language +evolution. We know that the set of keywords in Carbon will grow over time, +and the easiest kind of language change from an evolutionary perspective is one +that is known to break no programs, that lets programs migrate incrementally to +the new language rule, and that either has no migration cost or only imposes +automatable migration cost on the code that intends to use the new feature. + +The proposed approach to keywords intends to support such a migration story. +Adding new keywords to Carbon is a non-breaking change. Because every +identifier is locally declared using obvious syntax before it is used, it is +straightforward to detect, using simple rules, whether a particular identifier +is a keyword or not in a particular source file. + +Using a new keyword in an existing source file requires first replacing all +existing uses of that identifier with raw identifiers throughout the source +file, which is a mechanical, automatable change. + +For identifiers whose scopes are constrained to a single source file, raw +identifiers are not necessary to permit such a transition. However, for +identifiers that are declared in one source file and consumed in another, we +still need a mechanism to continue declaring a name as an identifier after it +has been claimed as a keyword. + +Note that while this means that adding a new keyword is cheap in terms of +migration cost, we should still think of adding a keyword as being a +significant undertaking, as each keyword will occupy space in the mind of the +Carbon programmer. However, we should not feel any pressure to reuse the same +keyword for distinct purposes. + +This approach brings one important restriction: in any syntax that introduces +an identifier, there should never be an optional keyword preceding the +identifier, and nor should the identifier be optional if it can be followed by +a keyword. + +### Designators rationale + +We wish to have uniform scoping and name lookup rules throughout Carbon. +However, we also wish to parse expressions such as `x.y`, where `x` is looked +up as an identifier, and `y` has some other lookup rule. We also wish to use +`.y` as a designator when initializing fields. + +It is reasonable to conclude that an identifier preceded by a period is a +fundamentally different kind of token from a regular identifier: it has +different name lookup rules (if it's looked up at all) and cannot simply be +immediately resolved by lookup in the environment. + +Treating such tokens as special at the lexical level has other beneficial effects: + + * It avoids a special case in the rules for binary operators. This would be + the only binary operator that affects how name lookup is performed on its + right-hand side, and would be the only binary operator for which we do not + expect (or perhaps require) whitespace on both sides. + * It prohibits whitespace between the period and the identifier. This enforces + an intended stylistic convention. As a postfix unary operator, we will also + enforce an absence of whitespace before designators in member access syntax. + * It frees up `.` for use as a binary operator, should we so desire. + * It allows extremely fast lexical name lookup for all identifiers via string + interning: the same lookup that checks whether an identifier is a keyword + can also perform the complete lexical lookup if it's not an identifier. + * It permits uniform typo correction, including correction to keywords, for + all identifiers in all contexts, because we can identify typos from the + lexing stage before we even reach the parser. + +### Operators rationale + +We use a strict "max munch" rule for operators, without regard for Carbon's +current operator set. This requires parentheses in code that would apply +multiple prefix or postfix operators in a row, such as `-*p`, but gives us the +advantage that adding new operators is always a non-breaking change for all +existing Carbon code. + +### Compound brackets rationale + +We intend for each notation in Carbon code to have exactly one meaning, and for +the language to be evolvable in new directions. However, there are only three +easily-typable sets of brackets for most Carbon programmers -- four if you +include `<>`, which introduces a host of problems. We already know of more than +this many different kinds of bracketed region we wish to support, and have +started trying to play syntactic games to treat them as the same thing in order +to get back down to only three. + +Including compound brackets allows us to solve these problems: we have an +unbounded set of potential bracket pairs, without substantially increasing the +complexity of parsing Carbon code. The bracket terminator characters are chosen +such that a bracket followed by an operator can be easily visually separated by +a reader of Carbon code: even in a tricky case such as `[*|*p|*]`, the +expression `*p` within the `[*|` ... `|*]` brackets is reasonably readable. And +we will not need to resort to three-character brackets unless we exhaust our 9 +kinds of two-character brackets. + +Support for compound brackets requires that we make a concession: the bracket +terminator characters cannot be used within prefix or postfix operators. For +the two chosen symbols, this is unlikely to present a problem. + +## Alternatives considered + +TODO: Consider alternatives From 511a9729edd69991d8123d0ce802a9b3c22927d4 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Tue, 19 May 2020 18:33:38 -0700 Subject: [PATCH 02/11] Rephrase introduction of list of ASCII whitespace characters. Co-authored-by: Dmitri Gribenko --- docs/proposals/p0016.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index d67d96946cfd..7420fa921695 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -42,7 +42,7 @@ as they are published. ### Whitespace Characters are identified as whitespace if they have the Unicode -`Pattern_White_Space` property. This includes the C++ whitespace characters: +`Pattern_White_Space` property. These include the ASCII whitespace characters (recognized in C++): * Space and horizontal tab * Carriage return and line feed (which C++ conflates as "new line") From 22ed0b8e8bc0598f5e6a996bee9cf8c0aa66526a Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Wed, 20 May 2020 15:19:24 -0700 Subject: [PATCH 03/11] Apply suggestions from code review Co-authored-by: Dmitri Gribenko Co-authored-by: josh11b --- docs/proposals/p0016.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index 7420fa921695..629a81f21cda 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -56,7 +56,7 @@ As of Unicode version 13, 5 additional characters are included: * U+2028 LINE SEPARATOR * U+2029 PARAGRAPH SEPARATOR -Space, horizontal tab, and the LTR and RTL mark are *horizontal whitespace* +Space, horizontal tab, and the LTR and RTL marks are *horizontal whitespace* characters. All other whitespace characters are *vertical whitespace* characters. @@ -228,7 +228,7 @@ with Unicode property `XID_Start`, followed by zero or more characters with property `XID_Continue`. Notably, `XID_Start` does not include the underscore character. Tokens -beginning with an identifier are reserved, but are expected to be used as +beginning with an underscore are reserved, but are expected to be used as pattern matching placeholders. Additionally, a *raw identifier* can be specified by prefixing an identifier @@ -303,7 +303,7 @@ or more bracket continuation characters followed by a simple close bracket, such as `|=)`. [[why?]](#compound-brackets-rationale) -A *close bracket* is either a simple open bracket or a compound open bracket. +An *open bracket* is either a simple open bracket or a compound open bracket. A *close bracket* is either a simple close bracket or a compound close bracket. The close bracket matching an open bracket is formed by reversing the character From cc207d19d662ff8236c5f687fcfa240fe37292dc Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Tue, 19 May 2020 19:05:29 -0700 Subject: [PATCH 04/11] Recognize `0` as an integer literal. --- docs/proposals/p0016.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index 629a81f21cda..5a293fdda0d5 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -126,7 +126,7 @@ space is reserved for future extensions. #### Numbers Decimal integers are written as a non-zero decimal digit followed by zero or -more additional decimal digits. +more additional decimal digits, or as a single `0`. Integers in other bases are written as a `0` followed by a base specifier character, followed by a sequence of digits in the corresponding base. The From bce8eb461854fe141b3791a029e256fea9fa6daf Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Wed, 20 May 2020 15:19:34 -0700 Subject: [PATCH 05/11] Updates based on review feedback. --- docs/proposals/p0016.md | 302 ++++++++++++++++++++++++++++++---------- 1 file changed, 231 insertions(+), 71 deletions(-) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index 5a293fdda0d5..337e94dd7707 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -23,9 +23,9 @@ Carbon source file: interpreting the contents of the file and forming Carbon source files are [UTF-8](#file-contents-and-encoding) text files whose contents are divided into [whitespace](#whitespace), [comments](#comments), -[literals](#literals), [identifiers](#identifiers), [keywords](#keywords), -[designators](#designators), [operators](#operators), and -[brackets](#brackets), as described below. +[literals](#literals), [identifiers](#identifiers) (including +[keywords](#keywords)), [designators](#designators), [operators](#operators), +and [brackets](#brackets), as described below. ## Details @@ -66,6 +66,17 @@ formatters are encouraged to convert them into recognized horizontal whitespace characters. Implementations are encouraged to recover from the error as if those characters were treated as horizontal whitespace. +A *line* is a possibly-empty sequence of characters preceded and followed by +either vertical whitespace or the beginning or end of the file. As a special +case, an empty sequence of characters preceded by a carriage return and +followed by a line feed is not treated as a line. For example, "foo\r\nbar" +contains two lines, and "foo\n\rbar" contains three, of which the middle line +is empty. + +The *indentation* of a line is the sequence of horizontal whitespace characters +at the start of the line. A line *A* has more indentation than a line *B* the +indentation of *B* is a proper prefix of the indentation of *A*. + ### Comments A *comment* in Carbon is either: @@ -119,9 +130,9 @@ predeclared identifiers.) A *literal* is a numeric literal, a character literal, or a string literal, as defined below. -A literal shall not be immediately followed by a character with property -`XID_Start`. Carbon has no literal suffixes, but the corresponding lexical -space is reserved for future extensions. +A literal shall not be immediately followed an [identifier continuation +character](#identifiers). Carbon has no literal suffixes, but the corresponding +lexical space is reserved for future extensions. #### Numbers @@ -147,23 +158,14 @@ by a decimal point followed by a sequence of one or more decimal digits. A real number can be followed by an `e`, an optional `+` or `-` (defaulting to `+`), and a decimal integer *N*; the effect is to multiply the given value by -10*N*. +10±*N*. A *numeric literal* is an integer or real number expressed as described above. -#### Characters - -A *character literal* is formed of any single character other than a backslash -(`\\`) or single quotation mark, enclosed in a pair of single quotation marks -(`'`), or an escape sequence enclosed in a pair of single quotation marks. - -An escape sequence is replaced by the corresponding character sequence or -encoding, which shall fit in a single character. - -TODO: Table of escape sequences. - #### Strings +[[alternatives]](#string-alternatives) + A *simple string literal* is formed of a sequence of * characters other than backslashes, double quotation marks, and vertical @@ -173,81 +175,108 @@ A *simple string literal* is formed of a sequence of enclosed in double quotation marks (`"`). Each escape sequence is replaced with the corresponding character sequence or encoding. +TODO: Table of escape sequences. + A *raw string literal* starts with an `r` followed by *N* `#` characters followed by a double quotation mark, and ends with the first following -occurrence of a double quotation mark followed by *N* `#` characters. The text -in between is not interpreted in any way. - -A *block string literal* starts with three double quotation marks followed by -an optional sequence of non-whitespace characters, followed by a newline. Each -following line within the literal shall start with the same initial sequence of -zero or more horizontal whitespace characters and optionally one `|` character -as were present on the first such line. The literal ends at the first instance -of three double quotation mark characters (where the first such character is -not escaped). The common initial horizontal whitespace is removed from each -line, as is the terminating newline character. Escape sequences are expanded -as in a simple string literal. The initial sequence of characters before the -newline is ignored, but can be used to indicate the formatting rules for a code -formatter to use for the literal contents. +occurrence of a double quotation mark followed by *N* `#` characters on the +same line. The text in between is not interpreted in any way. + +A *block string literal* starts with three double quotation marks, followed by +an optional file type indicator, followed by a newline, and ends at the next +instance of three double quotation marks. The closing `"""` shall be the first +non-whitespace characters on that line. The lines between the opening line and +the new line are *content lines*. Each non-empty content line shall be [at +least as indented](#whitespace) as the line containing the closing `"""`. The +closing line shall be at least as indented as the opening line, and shall be +more indented if the opening `"""` are not the first non-whitespace characters +on the opening line. The content of the literal is formed by removing the +indentation of the closing line from each (non-empty) content line, and +concatenating the results with a line feed character added between each pair of +lines. [[why?]](#block-strings-rationale) +A *file type indicator* is a sequence of characters that are either [identifier +continuation characters](#identifiers) or [operator characters](#operators). + A *raw block string literal* is expressed analogously to a raw string literal, -but for a block string literal. Escape sequences are ignored. +but for a block string literal. Escape sequences are ignored, but indentation +is removed and each vertical whitespace character is replaced by a line feed +as in a non-raw block string literal. For example: ```carbon fn f() { + """ + This is a string literal. Its first character is 'T' and its last character + is '.'. It contains one embedded newline, between 'character' and 'is'. + """ var String: x = r#""" This is the content of the string. The 'T' is the first character of the string. - """ <-- This is not the end of the string. But this is --> """#; + """ <-- This is not the end of the string. + """#; // <-- But this is. var String: y = r"Hello\"; // OK, final character is \ var String: z = r##"Raw strings r#"nesting"#"##; - var String: starts_with_whitespace = """ - | int x = 1; - | int y = 2;"""; - var String: starts_with_pipe = r#""" - || is a pipe. - |\ is not a pipe."""#; - - var String: code = """c++ - const char *str = R"foo(hello)foo"; - """; - - var String: error = """ -This is invalid (insufficiently indented)."""; + // This string starts and ends with two "s. + var String: ambig1 = r#"""This is a raw string literal starting with """#; + var String: ambig2 = r#"""This + is a block string literal with file type 'This' and first character 'i'. + """#; + + var String: starts_with_whitespace = """c++ + int x = 1; // This line starts with two spaces. + int y = 2; // This line starts with two spaces. + """; + + var String: invalid1 = """ +error: insufficiently indented. +"""; + var String: invalid2 = r#""" + error: closing """ is not on its own line."""#; } ``` +#### Characters + +A *character literal* is lexically identical to a simple string literal, except +that it is enclosed in single quotation marks (`'`) instead of double quotation +marks (`"`). + ### Identifiers An *identifier* is a maximal sequence of characters beginning with a character -with Unicode property `XID_Start`, followed by zero or more characters with -property `XID_Continue`. +with Unicode property `XID_Start`, followed by zero or more *identifier +continuation characters*, which are characters that either have property +`XID_Continue` or are underscores (`_`). Notably, `XID_Start` does not include the underscore character. Tokens -beginning with an underscore are reserved, but are expected to be used as -pattern matching placeholders. +beginning with an underscore are [reserved](#reserved-tokens). +[[why?]](#underscores-rationale) Additionally, a *raw identifier* can be specified by prefixing an identifier with `r#`, such as `r#requires`. Raw identifiers can be used to introduce and use names that are lexically identical to keywords. +[[why?]](#keywords-rationale) All identifier tokens in all contexts are looked up using the same lexical scoping rule. An identifier shall not be immediately followed by a `"` or `'`. -### Keywords +#### Keywords A *keyword* is an identifier with predefined meaning. Carbon has a predefined set of keywords, that will be specified separately as part of the syntax rules. An identifier that is a keyword may also be declared explicitly in a source file. The same identifier shall not be used as both a keyword and as a non-raw -non-keyword identifier in a single source file. +non-keyword identifier in a single source file. As a consequence of these +rules, from a lexical standpoint there is no notion of keywords -- whether a +given identifier is a keyword depends on the syntactic structure of the source +file. [[why?]](#keywords-rationale) Example: @@ -312,9 +341,10 @@ the corresponding character with class `Pe`. Every open bracket is required to have a matching close bracket such that the bracketed regions form a tree structure. -There are 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket digraphs -(`(| |)`, `{| |}`, `[| |]`, `(: :)`, `{: :}`, `[: :]`), 108 bracket trigraphs, -and so on. +We do not intend to include any non-ASCII characters as part of Carbon's +syntax. This leaves 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket +digraphs (`(| |)`, `{| |}`, `[| |]`, `(: :)`, `{: :}`, `[: :]`), 108 bracket +trigraphs, and so on. For example: @@ -325,7 +355,10 @@ For example: |) ``` -TODO: Indentation restrictions for bracket matching? +Each non-empty line from the line containing an opening bracket to the line +containing the matching closing bracket (inclusive) shall have at least as much +indentation as the line containing the opening bracket. +[[why?]](#bracket-indentation-rationale) In addition, Carbon recognizes *bracket operators*, formed by a simple open bracket followed by one or more operator characters followed by a matching @@ -334,10 +367,23 @@ not brackets. ### Tokens -A *token* is a documentation comment, a literal, an identifier, a keyword, a -designator, an operator, or a bracket. Tokens are formed by a single -left-to-right scan of the source file, using a "max munch" rule: the longest -possible next token is formed at each step. +A *token* is a documentation comment, a literal, an identifier (which might be +a keyword), a designator, an operator, or a bracket. Tokens are formed by a +single left-to-right scan of the source file, using a "max munch" rule: the +longest possible next token is formed at each step, after skipping whitespace +and comments. + +#### Reserved tokens + +It is an error if token formation (after skipping whitespace and comments) is +attempted in the following circumstances: + + * When the first character does not have property `XID_Continue` or + `Pattern_Syntax`. + * As a special case of the prior bullet, when the first character is an + underscore. [[why?]](#underscores-rationale) + * When the first two characters are `r#` and neither a raw identifier nor a + raw string literal would be formed. ## Rationale @@ -345,15 +391,22 @@ possible next token is formed at each step. We intend to follow the Unicode Consortium's recommentations for identifiers in programming languages as described in Unicode 13.0.0 -[UAX#31](https://unicode.org/reports/tr31/) Revision 31. We do not see a reason +[UAX#31](https://unicode.org/reports/tr31/) Revision 33. We do not see a reason to be inventive in this regard, and delegating the complex considerations over how Unicode characters should be used to a group with greater expertise in that area seems appropriate. +As an exception, Carbon permits underscore as a continuation character in +identifiers. Usage of this character is sufficiently common in C++ identifiers +that excluding it conflicts with our interoperability goal. However, leading +underscores are not permitted in Carbon identifiers. +[[why?]](#underscores-rationale) + We observe UAX#31's requirements as follows: * UAX31-R1: requirement met. Identifiers are of the form `XID_Start` - `XID_Continue`\*. + `Continue`\*, using a profile in which `Continue` is `XID_Continue` plus + U+005F LOW LINE (`_`). * UAX31-R1a: requirement not met. Format characters are not permitted in identifiers. * UAX31-R1b: requirement not met. We intend for Carbon evolution to follow @@ -367,9 +420,13 @@ We observe UAX#31's requirements as follows: are formed exclusively from `Pattern_Syntax` characters. * UAX31-R4: requirement met. Carbon identifiers are required to be in NFC, so identifiers that are the same in NFC are equivalent. - * UAX31-R5: requirement not met. Carbon identifiers are case-sensitive. + * UAX31-R5: requirement not met. Carbon identifiers are case-sensitive, so + this requirement is inapplicable. * UAX31-R6: requirement met. No characters are excluded from normalization. - * UAX31-R7: requirement not met. Carbon identifiers are case-sensitive. + * UAX31-R7: requirement not met. Carbon identifiers are case-sensitive, so + this requirement is inapplicable. + * UAX31-R8: requirement not applicable. Carbon does not have hashtag + identifiers. ### Line continuation rationale @@ -430,17 +487,60 @@ non-breaking change. Block literals are a useful way of expressing multiline string content in a program. It's useful to treat block literals and raw literals as distinct concepts: even within multiline literals, explicit encoding of tab characters, -character escapes, and so forth can be useful or undesirable. +character escapes, and so forth can be useful or undesirable. For example: + +```carbon +var String: raw_code = r"""carbon + var String: example = "hello\n\tworld"; // Contains two backslashes. + """; +var String: expanded_code = """carbon + \tvar Int: n = 123; // Starts with tab and ends with newline.\n + """; +``` Further, separating the concepts permits us to disallow newlines in non-block raw string literals, which prevents one class of runaway lexing problem: the inability to find the end of a raw string literal can lead to scanning and -consuming the entirety of the source file. +consuming the entirety of the source file. It is better to ask the user to +explicitly express their intent than to assume that we're in the rare case +where a missing closing double-quote indicates a multi-line string. Removing the initial indentation from block string literals serves two primary purposes: it allows the lexer to intelligently abort and backtrack sooner if it -reaches the end of the indented region without seeing the end of string marker, -and it improves the readability of the code. +reaches the end of the indented region without seeing the end of string marker +(essentially enabling detection of runaway block string literals), and it +improves the readability of the code. + +### Underscores rationale + +Tokens beginning with an underscore are currently reserved, but are expected to +be used to represent wildcards in pattern matching. The precise rules are not +yet determined, so we are not committing to any lexical conventions for such +tokens yet. + +We could permit such tokens as identifiers for now, and claim them back as +keywords later if needed. There are some reasons not to follow that approach: + + * The semantic interpretation of such tokens may be quite different from that + of regular identifiers: they may not be subject to regular name lookup and + not required to be declared before being used, but also not part of a + predefined finite set of keywords with unique meaning. + * We may want to use different lexical rules for underscores, such as treating + `_` as a standalone token (even when followed by an identifier) or as an + operator character. The evolutionary path for such a change would be + challenging. + +Underscores in identifiers are common in C++ identifiers, which motivates +permitting them in Carbon to support our C++ interoperability goal. However, +leading underscores are rare in publicly-visible C++ identifiers, and result in +reserved identifiers in many contexts, so we do not have similar motivation to +permit those. + +Leading underscores are used in some C++ code to distinguish member names from +non-member names. In Carbon, we anticipate all identifiers being declared +locally and found by a simple lexical lookup rule, so use of leading +underscores to avoid name collisions should generally be unnecessary. As such, +we lack a strong motivation to permit such identifiers. ### Keywords rationale @@ -499,13 +599,13 @@ Treating such tokens as special at the lexical level has other beneficial effect * It prohibits whitespace between the period and the identifier. This enforces an intended stylistic convention. As a postfix unary operator, we will also enforce an absence of whitespace before designators in member access syntax. - * It frees up `.` for use as a binary operator, should we so desire. * It allows extremely fast lexical name lookup for all identifiers via string interning: the same lookup that checks whether an identifier is a keyword can also perform the complete lexical lookup if it's not an identifier. * It permits uniform typo correction, including correction to keywords, for all identifiers in all contexts, because we can identify typos from the lexing stage before we even reach the parser. + * It frees up `.` for use as a binary operator, should we so desire. ### Operators rationale @@ -538,6 +638,66 @@ Support for compound brackets requires that we make a concession: the bracket terminator characters cannot be used within prefix or postfix operators. For the two chosen symbols, this is unlikely to present a problem. +### Bracket indentation rationale + +We wish for Carbon to provide good error diagnosis and recovery, to support +tooling and analysis of incomplete source files, and to reject cases where we +can be confident that the intent of the programmer is not captured by the code. +To support these goals, we require that the indentation of code reflects the +logical structure of the code, and one of the ways we achieve this is by +requiring that the contents of a bracketed region are at least as indented as +the opening line of that bracketed region. For example, given: + +```carbon +fn f() { + if (cond) { + // ... +} + +fn g() { +``` + +we can be confident that the intention was for the closing brace on line 4 to +match the opening brace on line 1, not the opening brace on line 2. By +recognizing this early, we can produce improved diagnostics indicating that a +closing brace was missing, and tools that perform semantic analysis of +potentially-incomplete source files can recover by imagining that an additional +closing brace appeared before line 4. + +This indentation rule is insufficient to fully ensure that our interpretation +of the program matches the programmer's intent and the reader's expectations. +Generally, we wish for all continuation lines of any grammatical construct to +be at least as indented as the first line in that construct. + ## Alternatives considered -TODO: Consider alternatives +### String alternatives + +Block string literals could use explicit characters in the body to indicate the +amount of leading whitespace to be removed: + +```carbon +var String: x = """ + | starts with two spaces. + """; +``` + +This would allow the correct indentation to be determined as soon as the first +line after the opening `"""` is seen. However, this adds lexical complexity, +and most of the same benefit can be derived by simply requiring the indentation +of the string literal to be greater than that of its contents. + +We could choose to include the newline before the terminating `"""` as part of +the literal contents. However, expectations for whether it should be included +vary, and appear to be somewhat evenly split between the two options. If the +terminating newline is included in the string, it would be natural to permit +the terminating `"""` to not be preceded by a newline, which removes the most +natural vehicle by which we can determine the proper indentation to remove from +each line. + +### Operators alternatives + +We could use a "max munch" rule for operators that is restricted to only +recognize a known set of Carbon operators. This would permit constructs such as +`-*p` or `**p` without additional brackets or whitespace. This would improve +the language ergonomics, but would make language evolution more difficult. From 7f1c242a5bf6a3580b1365be6aaf8ecaa957141e Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Thu, 21 May 2020 16:41:28 -0700 Subject: [PATCH 06/11] New rules for block comments based on review feedback. --- docs/proposals/p0016.md | 170 ++++++++++++++++++++++++++++------------ 1 file changed, 121 insertions(+), 49 deletions(-) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index 337e94dd7707..4d8763d1c231 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -79,46 +79,113 @@ indentation of *B* is a proper prefix of the indentation of *A*. ### Comments -A *comment* in Carbon is either: - - * A *line comment*, beginning with `//` and running to the end of the line, or - * A *block comment*, beginning with `/*` and running to the matching `*/`. - +A *comment* begins with the characters `//` and runs to the end of the line. Carbon has no mechanism for physical line continuation, so a `//` comment -always ends at the next vertical whitespace character. +always ends at the next vertical whitespace character. There shall be no text +other than horizontal whitespace before the `//` characters introducing a +comment. (Either all of a line is a comment, or none of it.) [[why?]](#line-continuation-rationale) -If the character after the `/*` introducing a block comment is `{`, the comment -is a *code comment*. In a code comment, the following text is tokenized until a -matching `}*/` token is formed; such a token terminates the comment. (In -particular, such a token is not recognized if it is nested within another -comment or a literal.) Otherwise, the comment ends at the first matching `*/` -character sequence. -[[why?]](#nested-comments-rationale) +#### Text comments + +If the character after the comment introducer is whitespace (or the end of the +file), the comment is a *text comment*. Text comments are treated equivalently +to whitespace. Example: ```carbon -// This is a comment. -// The characters /* introduce a block comment. -This is not a comment. -/*{ - // The characters */ end a block comment. -}*/ +// This is a comment and is ignored. \ This is not a comment. + +var Int: x; // error, trailing comments not allowed +``` + +#### Documentation comments + +If the character after the comment introducer is an exclamation mark or another +`/`, in either case followed by whitespace, the comment is a documentation +comment. Documentation comments are tokens, and are recognized by the language +grammar only in specific locations, which determine the entity to which they +attach. +[[why?]](#documentation-comments-rationale) + +Example: + +```carbon +//! This is a documentation comment. +/// So is this. +fn DocumentedFunction() {} + +var Int + //! This is an error; a documentation comment cannot appear here. + : x; ``` -If the character after the comment introducer is an exclamation mark, the -comment is a documentation comment. Documentation comments are tokens, and are -recognized by the language grammar only in specific locations, which determine -the entity to which they attach. -[[why?]](#nested-comments-rationale) +**Open question:** Should we accept only one or the other kind of documentation +comment? + +#### Block comments + +If the character after the comment introducer is an open brace, the comment is +a block comment. Subsequent lines are tokenized until a comment introducer +followed by a close brace is found. The tokenization rules are adjusted as +follows: -Non-documentation comments are treated equivalently to whitespace. + * There is no requirement that brackets match. + * There is no requirement that code within brackets or text within block + string literals is properly indented. + * There is no restriction on forming [reserved tokens](#reserved-tokens); + instead, if no token can be formed, a placeholder token is formed from the + next character of the input line. + * There is no restriction on the appearance of [reserved + comments](#reserved-comments). Such comments have no effect. + * All tokens produced are discarded. -In addition to the cases above, a block comment introducer may be followed by -additional `*` characters. If the character after the comment introducer is not -one of those mentioned above, it shall be a whitespace character. +These rules apply recursively during the search for the closing comment marker; +block comments nest. +[[why?]](#block-comments-rationale) + +The opening `//{` shall be followed by whitespace. If any characters appear +between the closing `//}` and the end of its line, that sequence of characters +shall be the same as the characters between the opening `//{` and the end of +its line. + +Example: + +```carbon +//{ temp +fn CommentedOutFunction() { + // It's OK to include a //} here; it's not a comment introducer so doesn't + // end the block comment. + + //{ + Nested comment. + //} + + The single quote in this line doesn't match any of our token production + rules. A placeholder token is produced for that character. The token after + the placeholder is the identifier 't'. + + // This doesn't end the comment. + var String: close_comment_marker = """ + //} + """; +} +//} + +//{ mismatched + +// This is an error due to mismatched closing text. +//} temp + +//{ +error here (not at start of line): //} +``` + +#### Reserved comments + +Comment introducers that do not have one of the above forms are invalid. [[why?]](#comment-introducers-rationale) ### Literals @@ -245,6 +312,10 @@ A *character literal* is lexically identical to a simple string literal, except that it is enclosed in single quotation marks (`'`) instead of double quotation marks (`"`). +**Open question:** Do we need both character literals and string literals? +Unicode character literals in general require more than one code unit to +represent, so are somewhat more string-like than character-like. + ### Identifiers An *identifier* is a maximal sequence of characters beginning with a character @@ -452,28 +523,12 @@ continuation. Line continuation for comments in particular is a known source of gotchas in C and C++ programs. -## Nested comments rationale - -There is a known need for nested comment syntax: it is important to be able to -comment out a block of code confident in the knowledge that all text between -the comment markers (and exactly that text) was in fact commented out. +### Documentation comments rationale -The only facility for this in C and C++ is `#if 0` ... `#endif`, and that is -what is used in practice. As we do not want Carbon to have a textual -preprocessor, enabling nesting of some other comment scheme seems natural. - -Conversely, it is valuable to have a comment syntax for human-readable -commentary that can be used within a line and can span multiple lines, and we -see no reason to invent something other than `/* ... */` for this purpose. We -cannot use the same notation as nested comment syntax without compromising its -usage for human-readable commentary. For example: - -```carbon -f(1.0 / x /* can't be negative */, 2); -``` - -... would not be valid as a `/*{ ... }*/` comment, because it would contain an -unterminated character literal. +Carbon code is expected to include documentation comments. Specifying such +comments as part of the language definition allows a consistent interpretation +of the comments, and a consistent attachment of the comments to entities, to be +provided for Carbon programs. ### Comment introducers rationale @@ -482,6 +537,23 @@ future. Reserving syntactic space in comment syntax, in a way that is easy for programs to avoid, allows us to add such additional kinds of comment as a non-breaking change. +## Block comments rationale + +It is important to be able to comment out a block of code confident in the +knowledge that all text between the comment markers (and exactly that text) was +in fact commented out. This leads to the following requirements: + + * Block comments must nest. + * Closing comment markers in string literals and in any kind of nested comment + do not close the outer comment. + * Opening comment markers in string literals and in line comments do not + introduce an additional unintended level of commenting. + +These requirements force us to apply our lexical rules to the contents of block +comments. However, we would like to accept malformed and partially-formed code +within a block comment, so we relax the restrictions that we can reasonably +relax when handling them. + ### Block strings rationale Block literals are a useful way of expressing multiline string content in a From 29c5e4acf2e338c5523fad5297f2923499efaef8 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Thu, 21 May 2020 18:05:16 -0700 Subject: [PATCH 07/11] Add open question from review feedback. --- docs/proposals/p0016.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index 4d8763d1c231..308542efb712 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -183,6 +183,9 @@ fn CommentedOutFunction() { error here (not at start of line): //} ``` +**Open question:** Can we remove block comments, and require potential users of +them to add `//` to all affected lines instead? + #### Reserved comments Comment introducers that do not have one of the above forms are invalid. From b94320e82c44c36a0b6702f75ea0843e9f518ea2 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Thu, 21 May 2020 18:07:16 -0700 Subject: [PATCH 08/11] Apply suggestions from code review Co-authored-by: austern Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com> --- docs/proposals/p0016.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index 308542efb712..b160f0ca5521 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -74,7 +74,7 @@ contains two lines, and "foo\n\rbar" contains three, of which the middle line is empty. The *indentation* of a line is the sequence of horizontal whitespace characters -at the start of the line. A line *A* has more indentation than a line *B* the +at the start of the line. A line *A* has more indentation than a line *B* if the indentation of *B* is a proper prefix of the indentation of *A*. ### Comments @@ -278,10 +278,10 @@ For example: ```carbon fn f() { - """ + var String: w = """ This is a string literal. Its first character is 'T' and its last character is '.'. It contains one embedded newline, between 'character' and 'is'. - """ + """; var String: x = r#""" This is the content of the string. The 'T' is the first character of the string. @@ -296,6 +296,7 @@ fn f() { is a block string literal with file type 'This' and first character 'i'. """#; + // This is a block string literal. Its first two characters are spaces, and its last character is '.' It has a file type of 'c++'. var String: starts_with_whitespace = """c++ int x = 1; // This line starts with two spaces. int y = 2; // This line starts with two spaces. From cd7335ffac737e513b0f694ebc00fa7f41d44100 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Thu, 11 Jun 2020 23:53:07 -0700 Subject: [PATCH 09/11] Updates from review feedback. --- docs/proposals/p0016.md | 507 +++++++++++++++++++++++++++------------- 1 file changed, 344 insertions(+), 163 deletions(-) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index b160f0ca5521..8a4e6198cc28 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -22,26 +22,44 @@ Carbon source file: interpreting the contents of the file and forming ## Proposal Carbon source files are [UTF-8](#file-contents-and-encoding) text files whose -contents are divided into [whitespace](#whitespace), [comments](#comments), -[literals](#literals), [identifiers](#identifiers) (including -[keywords](#keywords)), [designators](#designators), [operators](#operators), -and [brackets](#brackets), as described below. +contents are divided into *lexical elements*: [whitespace](#whitespace), +[comments](#comments), [literals](#literals), [words](#words) +(including [identifiers](#identifiers) and [keywords](#keywords)), +[designators](#designators), [operators](#operators), and +[brackets](#brackets), as described in the [lexing](#lexing) section below. -## Details +Lexical elements are formed by a single left-to-right scan of the source file, +using a "max munch" rule: the longest possible next lexical element is formed +at each step. + +After division into these components, whitespace and text and block comments +are discarded, and words are classified as either identifiers or keywords; the +remaining lexical elements are [tokens](#tokens), and the result is the +*tokenized* form of the source file, which is the input to the parsing step. + +## Lexing ### File contents and encoding Carbon source files are Unicode text files encoded in UTF-8. An initial UTF-8 -BOM is permitted and ignored. All contents outside of [comments](#comments) and -[literals](#literals) shall be in Normalization Form C. +BOM is permitted and ignored. [[why?]](#encoding-rationale) +For implementation simplicitly, all contents outside of [comments](#comments) +and [literals](#literals) shall be in Normalization Form C. The Carbon formatter +tool will convert source text to NFC as necessary to satisfy this constraint. +[[why?]](#formatting-rationale) + Carbon is currently based on Unicode 13.0, and will adopt new Unicode versions as they are published. +**Open question:** Should we require source text to be in NFC, as C++ plans to +do (and GCC currently enforces with a warning), or should we normalize non-NFC +identifiers ourselves? + ### Whitespace -Characters are identified as whitespace if they have the Unicode +Characters are identified as whitespace if and only if they have the Unicode `Pattern_White_Space` property. These include the ASCII whitespace characters (recognized in C++): * Space and horizontal tab @@ -56,9 +74,9 @@ As of Unicode version 13, 5 additional characters are included: * U+2028 LINE SEPARATOR * U+2029 PARAGRAPH SEPARATOR -Space, horizontal tab, and the LTR and RTL marks are *horizontal whitespace* -characters. All other whitespace characters are *vertical whitespace* -characters. +Space, horizontal tab, and the LTR and RTL marks (see +[directionality](#directionality)) are *horizontal whitespace* characters. All +other whitespace characters are *vertical whitespace* characters. Characters with the Unicode property `White_Space` but not `Pattern_White_Space` are invalid outside comments and literals. Code @@ -75,22 +93,41 @@ is empty. The *indentation* of a line is the sequence of horizontal whitespace characters at the start of the line. A line *A* has more indentation than a line *B* if the -indentation of *B* is a proper prefix of the indentation of *A*. +indentation of *B* is a proper prefix of the indentation of *A*. (Note that +neither of "\t\t" and "\t " is considered more indented than the other.) + +**Open question:** Should we be more opinionated on whitespace? We could +potentially disallow everything other than space and newline (and, depending +on what we decide for [directionality](#directionality), perhaps LTR marks), +which would lead to a substantially simpler indentation rule. ### Comments A *comment* begins with the characters `//` and runs to the end of the line. -Carbon has no mechanism for physical line continuation, so a `//` comment -always ends at the next vertical whitespace character. There shall be no text -other than horizontal whitespace before the `//` characters introducing a -comment. (Either all of a line is a comment, or none of it.) +Carbon has no mechanism for physical line continuation, so a trailing `\` does +not extend a comment to subsequent lines. There shall be no text other than +horizontal whitespace before the `//` characters introducing a comment. (Either +all of a line is a comment, or none of it.) [[why?]](#line-continuation-rationale) +The *kind* of a comment is determined by the character(s) after the `//` +characters as follows: + + * whitespace: the comment is a [text comment](#text-comments) + * `/` or `!` followed by whitespace: the comment is a [documentation + comment](#documentation-comments) + * `\{` or `\}`: the comment is an opening or closing [block + comment](#block-comments) delimiter, respectively + * anything else: the input is invalid + +For the purpose of the above rule, the end of the file is considered to be +whitespace. The `//` characters followed by the above additional characters +form the *comment introducer*. + #### Text comments -If the character after the comment introducer is whitespace (or the end of the -file), the comment is a *text comment*. Text comments are treated equivalently -to whitespace. +A *text comment* is a comment introduced by `//` followed by whitespace. Text +comments do not result in tokens. Example: @@ -103,11 +140,10 @@ var Int: x; // error, trailing comments not allowed #### Documentation comments -If the character after the comment introducer is an exclamation mark or another -`/`, in either case followed by whitespace, the comment is a documentation -comment. Documentation comments are tokens, and are recognized by the language -grammar only in specific locations, which determine the entity to which they -attach. +A *documentation comment* is a comment introduced by `//` followed by either an +exclamation mark or another `/`, in either case followed by whitespace. +Documentation comments are tokens, and are recognized by the language grammar +only in specific locations, which determine the entity to which they attach. [[why?]](#documentation-comments-rationale) Example: @@ -125,67 +161,61 @@ var Int **Open question:** Should we accept only one or the other kind of documentation comment? + - In favor of `///`: it is easier to type, and likely to be more comfortable to + read, especially for larger comment blocks. + - In favor of `//!`: it is less likely to be confused with `//`; writing `//` + where `///` is intended is a common error in C++ code using Doxygen. + #### Block comments -If the character after the comment introducer is an open brace, the comment is -a block comment. Subsequent lines are tokenized until a comment introducer -followed by a close brace is found. The tokenization rules are adjusted as -follows: - - * There is no requirement that brackets match. - * There is no requirement that code within brackets or text within block - string literals is properly indented. - * There is no restriction on forming [reserved tokens](#reserved-tokens); - instead, if no token can be formed, a placeholder token is formed from the - next character of the input line. - * There is no restriction on the appearance of [reserved - comments](#reserved-comments). Such comments have no effect. - * All tokens produced are discarded. - -These rules apply recursively during the search for the closing comment marker; -block comments nest. +An *opening block comment line* is a line starting with `//\{`, with no +indentation. A *closing block comment line* is a line starting with `//\}`, +with no indentation. A comment starting with `//\{` or `//\}` shall form an +opening or closing block comment line. Opening and closing block comment lines +can only appear as part of block comments. (In particular, these lines cannot +appear within block string literals.) + +A *block comment* is a comment that starts with an opening block comment line and +ends with a closing block comment line. Block comments nest: every line for +which the total number of preceding opening block comment lines is greater than +the total number of preceding closign block comment lines is part of a block +comment. [[why?]](#block-comments-rationale) +[[alternatives]](#block-comment-alternatives) -The opening `//{` shall be followed by whitespace. If any characters appear -between the closing `//}` and the end of its line, that sequence of characters -shall be the same as the characters between the opening `//{` and the end of -its line. +If any characters appear between the closing `//\}` and the end of its line, +that sequence of characters shall be the same as the characters between the +opening `//\{` and the end of its line. Example: ```carbon -//{ temp +//\{ temp fn CommentedOutFunction() { - // It's OK to include a //} here; it's not a comment introducer so doesn't - // end the block comment. + // It's OK to include a //\} in the middle of this comment; it's not a + // comment introducer so doesn't end the block comment. - //{ - Nested comment. - //} + //\} is not a closing block comment line, so doesn't end the comment. - The single quote in this line doesn't match any of our token production - rules. A placeholder token is produced for that character. The token after - the placeholder is the identifier 't'. +//\{ + Nested comment. +//\} - // This doesn't end the comment. - var String: close_comment_marker = """ - //} + var String: closing_comment_marker = r""" + //\} """; } -//} +//\} -//{ mismatched +//\{ mismatched // This is an error due to mismatched closing text. -//} temp +//\} temp -//{ -error here (not at start of line): //} +// The next line is an error because the //\{ is not at the start of the line. + //\{ ``` -**Open question:** Can we remove block comments, and require potential users of -them to add `//` to all affected lines instead? - #### Reserved comments Comment introducers that do not have one of the above forms are invalid. @@ -200,8 +230,8 @@ predeclared identifiers.) A *literal* is a numeric literal, a character literal, or a string literal, as defined below. -A literal shall not be immediately followed an [identifier continuation -character](#identifiers). Carbon has no literal suffixes, but the corresponding +A literal shall not be immediately followed a [word continuation +character](#words). Carbon has no literal suffixes, but the corresponding lexical space is reserved for future extensions. #### Numbers @@ -213,28 +243,31 @@ Integers in other bases are written as a `0` followed by a base specifier character, followed by a sequence of digits in the corresponding base. The available base specifiers and corresponding bases are: -| Base specifier | Base | Digits | -| -------------- | ---- | ------------------------------------- | -| `b` or `B` | 2 | `0` and `1` | -| `o` | 8 | `0` ... `7` | -| `x` or `X` | 16 | `0` ... `9`, `a` ... `f`, `A` ... `F` | +| Base specifier | Base | Digits | +| -------------- | ---- | ------------------------ | +| `b` | 2 | `0` and `1` | +| `o` | 8 | `0` ... `7` | +| `x` | 16 | `0` ... `9`, `A` ... `F` | -[TODO: This doesn't belong here.] There are no size suffixes. Each literal has -a unique type that can be converted to any sufficiently-large integer type, but -operations on it are always exact. +Note that the above table is case-sensitive. `0O123` is invalid, as is `0Xa`. +[[why?]](#integers-rationale) Real numbers are written as a sequence of one or more decimal digits followed by a decimal point followed by a sequence of one or more decimal digits. +[[why?]](#real-numbers-rationale) A real number can be followed by an `e`, an optional `+` or `-` (defaulting to -`+`), and a decimal integer *N*; the effect is to multiply the given value by +`+`), and a character sequence matching the grammar of a decimal integer with +some value *N*; the effect is to multiply the given value by 10±*N*. A *numeric literal* is an integer or real number expressed as described above. -#### Strings +**Open question:** Should we allow digit separators? With what lexical syntax? +Should we require them to be evenly spaced? Spaced "naturally" (groups of 3 for +decimal, some power of 2 for octal and binary)? -[[alternatives]](#string-alternatives) +#### Strings A *simple string literal* is formed of a sequence of @@ -245,7 +278,8 @@ A *simple string literal* is formed of a sequence of enclosed in double quotation marks (`"`). Each escape sequence is replaced with the corresponding character sequence or encoding. -TODO: Table of escape sequences. +TODO: Add a table of escape sequences. `\{` and `\}` are invalid, in order to +avoid ambiguities between (non-raw) block string literals and block comments. A *raw string literal* starts with an `r` followed by *N* `#` characters followed by a double quotation mark, and ends with the first following @@ -265,9 +299,10 @@ indentation of the closing line from each (non-empty) content line, and concatenating the results with a line feed character added between each pair of lines. [[why?]](#block-strings-rationale) +[[alternatives]](#string-alternatives) -A *file type indicator* is a sequence of characters that are either [identifier -continuation characters](#identifiers) or [operator characters](#operators). +A *file type indicator* is a sequence of characters that are either [word +continuation characters](#words) or [operator characters](#operators). A *raw block string literal* is expressed analogously to a raw string literal, but for a block string literal. Escape sequences are ignored, but indentation @@ -290,8 +325,10 @@ fn f() { var String: y = r"Hello\"; // OK, final character is \ var String: z = r##"Raw strings r#"nesting"#"##; - // This string starts and ends with two "s. + // The contents of this string start and end with exactly two "s. var String: ambig1 = r#"""This is a raw string literal starting with """#; + // This string is a block raw string literal with file-type 'This', + // whose contents start with "is a ". var String: ambig2 = r#"""This is a block string literal with file type 'This' and first character 'i'. """#; @@ -310,6 +347,15 @@ error: insufficiently indented. } ``` +A raw block string literal is required to have non-empty indentation to avoid +ambiguity with block comments. +[[why?]](#block-comments-rationale) + +**Open question:** Should we only require raw string literals containing an +opening or closing block comment line to be indented? (An equivalent but +perhaps simpler formulation of the alternative rule: opening and closing block +comment lines are disallowed in block string.) + #### Characters A *character literal* is lexically identical to a simple string literal, except @@ -320,38 +366,22 @@ marks (`"`). Unicode character literals in general require more than one code unit to represent, so are somewhat more string-like than character-like. -### Identifiers +### Words -An *identifier* is a maximal sequence of characters beginning with a character -with Unicode property `XID_Start`, followed by zero or more *identifier -continuation characters*, which are characters that either have property -`XID_Continue` or are underscores (`_`). +A *word* is a maximal sequence of characters beginning with a character +with Unicode property `XID_Start`, followed by zero or more *word continuation +characters*, which are characters that either have property `XID_Continue` or +are underscores (`_`). A [raw identifier](#raw-identifiers), described below, +is also lexically a word. Notably, `XID_Start` does not include the underscore character. Tokens beginning with an underscore are [reserved](#reserved-tokens). [[why?]](#underscores-rationale) -Additionally, a *raw identifier* can be specified by prefixing an identifier -with `r#`, such as `r#requires`. Raw identifiers can be used to introduce and -use names that are lexically identical to keywords. -[[why?]](#keywords-rationale) - -All identifier tokens in all contexts are looked up using the same lexical -scoping rule. - -An identifier shall not be immediately followed by a `"` or `'`. - -#### Keywords - -A *keyword* is an identifier with predefined meaning. Carbon has a predefined -set of keywords, that will be specified separately as part of the syntax rules. - -An identifier that is a keyword may also be declared explicitly in a source -file. The same identifier shall not be used as both a keyword and as a non-raw -non-keyword identifier in a single source file. As a consequence of these -rules, from a lexical standpoint there is no notion of keywords -- whether a -given identifier is a keyword depends on the syntactic structure of the source -file. +A word is interpreted as either a keyword or an identifier. If a word is ever +declared within a source file, then it is interpreted as an identifier +throughout that source file; otherwise, it is interpreted as a keyword +throughout that source file. [[why?]](#keywords-rationale) Example: @@ -362,11 +392,40 @@ fn f() {} // error, 'fn' is not a keyword in this source file interface var {} // error, already used 'var' as a keyword in this source file ``` +A word shall not be immediately followed by a `"` or `'`. + +#### Identifiers + +Identifier tokens can appear in two different contexts: they either declare the +identifier, binding it to an entity, or they reference an entity that has +already been declared. Carbon's grammatical rules will make it straightforward +to locally distinguish between these two cases. + +All identifier tokens in all contexts that are referencing a prior declaration +are looked up using the same lexical scoping rule. + +#### Raw identifiers + +A *raw identifier* can be specified by prefixing a word with `r#`, such as +`r#requires`. Raw identifiers can be used to introduce and use names that are +lexically identical to keywords. The declaration of a raw identifier does not +prevent the base word from being interpreted as a keyword; otherwise, they +behave identically to the word formed by removing the `r#` prefix. +[[why?]](#keywords-rationale) + +#### Keywords + +A *keyword* is a word with predefined meaning. Carbon has a predefined set of +keywords, that will be specified separately as part of the syntax rules, and +that is expected to grow over time. + +We intend to restrict keywords to the characters `a` ... `z` and `_`. + ### Designators -A *designator* is a token formed by prefixing an identifier with a period -character, such as `.member`. The identifier after the period is the *member -name*, and is looked up in a context-dependent manner. +A *designator* is a token formed by prefixing a word with a period character, +such as `.member`. The identifier after the period is the *member name*, and is +looked up in a context-dependent manner. [[why?]](#designators-rationale) ### Operators @@ -376,6 +435,8 @@ An *operator* is a maximal sequence of characters with Unicode property `Pe` (for which, see [brackets](#brackets)), which we will refer to as *operator characters*. [[why?]](#operators-rationale) +[[alternatives]](#operators-alternatives) + We do not intend to define any operators containing non-ASCII characters. The ASCII operator characters are: @@ -389,35 +450,44 @@ operator characters, 400 digraphs, and so on. Bracket operators, described below, are also operators. +**Open question:** Instead of the "max munch" rule described here, should we +only lex operators that actually exist? For example, this would mean that `**p` +is lexed as three tokens (`*`, `*`, `p`) rather than two (nonexistent `**`, +`p`). + ### Brackets A *simple open bracket* is a character with Unicode property `Pattern_Syntax` -and character class `Ps`, such as `(` or `[`. +and character class `Ps`. We intend to restrict Carbon syntax to ASCII, leaving three such characters: `(`, `[`, and `{`. A *simple close bracket* is a character -with Unicode property `Pattern_Syntax` and character class `Pe`, such as `}`. -A *bracket terminator character* is one of `|` or `:`. +with Unicode property `Pattern_Syntax` and character class `Pe`. There are +three such characters in ASCII: `)`, `]`, and `}`. +A *bracket terminator character* is one of `|` or `:`. A *bracket continuation character* is an operator character that is not a -bracket terminator character. +bracket terminator character. Restricted to ASCII, that is one of: +``` +! # % & * + - . / ; < = > ? @ \ ^ ~ +``` A *compound open bracket* is a simple open bracket followed by zero or more bracket continuation characters followed by a bracket terminator character, -such as `[:`. +such as `[:`. A *compound close bracket* is a bracket terminator character followed by zero or more bracket continuation characters followed by a simple close bracket, -such as `|=)`. +such as `|=)`. [[why?]](#compound-brackets-rationale) An *open bracket* is either a simple open bracket or a compound open bracket. A *close bracket* is either a simple close bracket or a compound close bracket. The close bracket matching an open bracket is formed by reversing the character -sequence in the open bracket and replacing each caracter with class `Ps` with +sequence in the open bracket and replacing each character with class `Ps` with the corresponding character with class `Pe`. Every open bracket is required to have a matching close bracket such that the bracketed regions form a tree structure. -We do not intend to include any non-ASCII characters as part of Carbon's -syntax. This leaves 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket +Because we do not intend to include any non-ASCII characters as part of Carbon's +syntax, there are 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket digraphs (`(| |)`, `{| |}`, `[| |]`, `(: :)`, `{: :}`, `[: :]`), 108 bracket trigraphs, and so on. @@ -440,18 +510,10 @@ bracket followed by one or more operator characters followed by a matching simple close bracket, such as `[~>]` or `(*)`. Bracket operators are operators, not brackets. -### Tokens - -A *token* is a documentation comment, a literal, an identifier (which might be -a keyword), a designator, an operator, or a bracket. Tokens are formed by a -single left-to-right scan of the source file, using a "max munch" rule: the -longest possible next token is formed at each step, after skipping whitespace -and comments. +### Reserved lexical elements -#### Reserved tokens - -It is an error if token formation (after skipping whitespace and comments) is -attempted in the following circumstances: +It is an error if an attempt is made to form a lexical element in the following +circumstances: * When the first character does not have property `XID_Continue` or `Pattern_Syntax`. @@ -460,21 +522,26 @@ attempted in the following circumstances: * When the first two characters are `r#` and neither a raw identifier nor a raw string literal would be formed. +### Tokens + +A *token* is a documentation comment, a literal, a keyword, an identifier, a +designator, an operator, or a bracket. + ## Rationale ### Encoding rationale -We intend to follow the Unicode Consortium's recommentations for identifiers in -programming languages as described in Unicode 13.0.0 -[UAX#31](https://unicode.org/reports/tr31/) Revision 33. We do not see a reason -to be inventive in this regard, and delegating the complex considerations over -how Unicode characters should be used to a group with greater expertise in that -area seems appropriate. +We intend for words in Carbon to follow the Unicode Consortium's +recommentations for identifiers in programming languages as described in +Unicode 13.0.0 [UAX#31](https://unicode.org/reports/tr31/) Revision 33. We do +not see a reason to be inventive in this regard, and delegating the complex +considerations over how Unicode characters should be used to a group with +greater expertise in that area seems appropriate. As an exception, Carbon permits underscore as a continuation character in -identifiers. Usage of this character is sufficiently common in C++ identifiers -that excluding it conflicts with our interoperability goal. However, leading -underscores are not permitted in Carbon identifiers. +words. Usage of this character is sufficiently common in C++ identifiers that +excluding it conflicts with our interoperability goal. However, leading +underscores are not permitted in Carbon words. [[why?]](#underscores-rationale) We observe UAX#31's requirements as follows: @@ -503,6 +570,24 @@ We observe UAX#31's requirements as follows: * UAX31-R8: requirement not applicable. Carbon does not have hashtag identifiers. +### Formatting rationale + +We expect Carbon to ship with a code auto-formatter that is used routinely as +part of all Carbon code development. There is a very low burden on the +programmer from requiring that source code be in compliance with formatting +decisions made by the formatter: at worst, we'd expect them to see a diagnostic +instructing them to run `carbon-format`, but in most cases this should happen +before the code gets to the compiler (perhaps as an on-save hook in their +editor, and/or bound to a keyboard shortcut used while editing code). + +We can realize useful benefits by relying on code being properly-formatted, if +"formatting" is interpreted suitably generally. For example, we can ensure that +the code's appearance matches its meaning in many cases (avoiding both +deliberate and accidental problems) by ensuring that Unicode left-to-right marks +are used where necessary, that identifiers are properly normalized, and so on, +and we can simplify our implementation somewhat by only permitting input in a +single Unicode normalization form. + ### Line continuation rationale Line continuation in C++ is sometimes necessary in order to combine the needs @@ -541,22 +626,93 @@ future. Reserving syntactic space in comment syntax, in a way that is easy for programs to avoid, allows us to add such additional kinds of comment as a non-breaking change. -## Block comments rationale +### Block comments rationale -It is important to be able to comment out a block of code confident in the +It is important to be able to comment out a block of Carbon code confident in the knowledge that all text between the comment markers (and exactly that text) was in fact commented out. This leads to the following requirements: * Block comments must nest. - * Closing comment markers in string literals and in any kind of nested comment - do not close the outer comment. + * Closing comment markers in string literals and in any kind of nested + comment do not close the outer comment. * Opening comment markers in string literals and in line comments do not introduce an additional unintended level of commenting. -These requirements force us to apply our lexical rules to the contents of block -comments. However, we would like to accept malformed and partially-formed code -within a block comment, so we relax the restrictions that we can reasonably -relax when handling them. +In addition, block comment syntax should not require lexing the contents of the +comment. Therefore we need to disallow the block comment closing syntax from +appearing in other tokens (in particular, in block string literals). There are +at least two reasonable ways to do this: + + * Require block comment opening and closing lines to be unindented and require + string literals to be indented. + * Pick a syntax for block comment opening and closing lines that cannot appear + in string literals. + +Our chosen approach combines these alternatives: the `\{` and `\}` in a block +comment are both invalid in non-raw string literals (they can be expressed as +`\\{` or `\\}` if desired). However, we cannot disallow these comment markers +in raw string literals without harming their ability to represent arbitrary +text. Therefore we require all raw string literals to be indented at least one +space. + +### Integers rationale + +Carbon requires a base specifier for octal numbers. A common error in C +and C++ code is using a leading 0 to align numbers horizontally: + +``` +int vals[] = { + 1234, + 0567, + 8912 +}; +``` + +But this leads to the `0567` being interpreted as an octal number. We +could treat numbers with a leading `0` as decimal always, but this also +risks confusing programmers who are familiar with the C and C++ rules. +Therefore we require a base specifier for octal, reject any number that +starts with a leading `0` but no base specifier (other than `0` itself) +to keep the rule simple. + +The base specifier (if present) is required to be written in lowercase. +For binary, this avoids a visual confusion between `08` and `0B`. For +octal, it avoids a visual confusion between `0O` and `00`. And in +general, being opinionated here has very little cost and removes a +possible style argument. We require the hexadecimal digits in an +`0x`-prefixed number to be written in uppercase to keep them visually +distinct from the prefix, in the case where `A` ... `F` would follow +the prefix (it is easier to visually separate the digits from the base +specifier in `0xAB23` than in `0xab23`). + +### Real numbers rationale + +Real numbers in Carbon always require a decimal point, and require at +least one digit on each side of the decimal point. + +In C and C++, the decimal point may be omitted in a number with an +exponent, such as `1e6`, but a common source of errors is imagining +that this syntax produces an integer literal rather than a +floating-point literal. + +Requiring a digit on both sides of the decimal point improves +readability and avoids style arguments. In addition, disallowing a +literal from beginning with a period followed by a digit frees up `.0` +for future use as a designator for tuple indexing, and similarly +`4.ToString()` unambiguously lexes as an integer followed by a +designator followed by two parentheses. + +This rationale assumes that we will permit the initialization of a +floating-point variable with an integer literal. If we choose to +disallow that, concerns have been raised that permitting `1.` instead +of `1.0` may be desirabe for ergonomic reasons. + +See also the section on [floating-point +literals](https://google.github.io/styleguide/cppguide.html#Floating_Literals) +in the Google style guide, which argues for the same rule. + +As with base specifiers for integers, the `e` introducing an exponent +is required to be lowercase to improve readability. ### Block strings rationale @@ -606,11 +762,11 @@ keywords later if needed. There are some reasons not to follow that approach: operator character. The evolutionary path for such a change would be challenging. -Underscores in identifiers are common in C++ identifiers, which motivates -permitting them in Carbon to support our C++ interoperability goal. However, -leading underscores are rare in publicly-visible C++ identifiers, and result in -reserved identifiers in many contexts, so we do not have similar motivation to -permit those. +Underscores are common in C++ identifiers, which motivates permitting them in +Carbon to support our C++ interoperability goal. However, leading underscores +are rare in publicly-visible C++ identifiers, and result in reserved +identifiers in many contexts, so we do not have similar motivation to permit +those. Leading underscores are used in some C++ code to distinguish member names from non-member names. In Carbon, we anticipate all identifiers being declared @@ -630,18 +786,20 @@ automatable migration cost on the code that intends to use the new feature. The proposed approach to keywords intends to support such a migration story. Adding new keywords to Carbon is a non-breaking change. Because every identifier is locally declared using obvious syntax before it is used, it is -straightforward to detect, using simple rules, whether a particular identifier -is a keyword or not in a particular source file. +straightforward to detect, using simple rules, whether a particular word is a +keyword or not in a particular source file. Using a new keyword in an existing source file requires first replacing all -existing uses of that identifier with raw identifiers throughout the source -file, which is a mechanical, automatable change. +existing uses of that word with raw identifiers throughout the source file, +which is a mechanical, automatable change. For identifiers whose scopes are constrained to a single source file, raw identifiers are not necessary to permit such a transition. However, for -identifiers that are declared in one source file and consumed in another, we +identifiers that are declared in one source file and redeclared in another, we still need a mechanism to continue declaring a name as an identifier after it -has been claimed as a keyword. +has been claimed as a keyword. (Use of an identifier from a different source +file, or at the very least from a different package, is expected to typically +require use of a designator rather than a word.) Note that while this means that adding a new keyword is cheap in terms of migration cost, we should still think of adding a keyword as being a @@ -649,10 +807,18 @@ significant undertaking, as each keyword will occupy space in the mind of the Carbon programmer. However, we should not feel any pressure to reuse the same keyword for distinct purposes. -This approach brings one important restriction: in any syntax that introduces -an identifier, there should never be an optional keyword preceding the -identifier, and nor should the identifier be optional if it can be followed by -a keyword. +This approach brings one important restriction: in any syntax that declares +an identifier, it should always be straightforward to determine the identifier +that is being introduced, even if it lexically identical to a keyword. In +particular, there should never be an optional keyword preceding the identifier, +and nor should the identifier be optional if it can be followed by a keyword. + +It should be noted that this approach also introduces a novel risk of +underhanded code that appears to mean one thing but means a different thing, by +shadowing a keyword with an identifier. This risk is discussed in [Initial +Analysis of Underhanded Source Code (Wheeler +2020)](https://www.ida.org/-/media/feature/publications/i/in/initial-analysis-of-underhanded-source-code/d-13166.ashx) +(page 4-2). ### Designators rationale @@ -747,6 +913,21 @@ be at least as indented as the first line in that construct. ## Alternatives considered +### Block comment alternatives + +We considered various different options for block comments. Our primary goal +was to permit commenting out a large body of Carbon code, which may or may not +be well-formed. Alternatives considered included: + + * Fully line-oriented block comments, which would remove lines without regard + for whether they are nested within a string literal, with the novel feature + of allowing some of the contents of a block string literal to be commented + out. + * Fully lexed block comments, in which a token sequence between the opening + and closing comment marker is produced and discarded, with the lexing rules + relaxed somewhat to avoid rejecting ill-formed code. This would be analogous + to C and C++'s `#if 0` ... `#endif`. + ### String alternatives Block string literals could use explicit characters in the body to indicate the From 09117de4764c07f27ec7d54ed20350aa2ff79736 Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Fri, 12 Jun 2020 00:09:48 -0700 Subject: [PATCH 10/11] Add details on directionality based on discussion in the context of issue#19. --- docs/proposals/p0016.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index 8a4e6198cc28..e71fd9df6a33 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -527,6 +527,21 @@ circumstances: A *token* is a documentation comment, a literal, a keyword, an identifier, a designator, an operator, or a bracket. +### Directionality + +After tokens are formed, a final check is performed to ensure that the +appearance of the code matches its meaning, as follows. The Unicode +Bidirectional Algorithm, as described in Unicode 13.0.0 +[UAX#9](https://unicode.org/reports/tr9/), is applied to the source text. It is +an error if any part of a token would be displayed after any part of a later +token, or if any operator or bracket or the delimiters of a string literal or +comment does not have a resolved directionality of L (left-to-right). + +Such issues can be resolved by the insertion of explicit left-to-right marks. +The Carbon formatter tool will insert such marks as necessary to satisfy these +constraints. +[[why?]](#formatting-rationale) + ## Rationale ### Encoding rationale From fffd2fb415e6a65860691e0b9fdf7a953904107f Mon Sep 17 00:00:00 2001 From: Richard Smith Date: Fri, 12 Jun 2020 12:44:21 -0700 Subject: [PATCH 11/11] Expand directionality discussion, relocate to be adjacent to encoding discussion, and convert the suggestion that we enforce directionality to an open question. --- docs/proposals/p0016.md | 74 ++++++++++++++++++++++++++++++++--------- 1 file changed, 59 insertions(+), 15 deletions(-) diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md index e71fd9df6a33..c4ec504af385 100644 --- a/docs/proposals/p0016.md +++ b/docs/proposals/p0016.md @@ -101,6 +101,28 @@ potentially disallow everything other than space and newline (and, depending on what we decide for [directionality](#directionality), perhaps LTR marks), which would lead to a substantially simpler indentation rule. +#### Directionality + +Explicit left-to-right marks are permitted in order to allow the user to ensure +that the visual appearance of the code matches the actual parse order of the +tokens. +[[why?]](#directionality-rationale) + +The Carbon formatter tool will insert such marks as necessary in order +to guarantee this property. For example, left-to-right marks may be inserted +around identifiers containing right-to-left text to avoid adjacent operators +being reversed, and left-to-right marks may be inserted around string literals +in order to ensure that the delimiters are displayed at the beginning and end +of the literal. +[[why?]](#formatting-rationale) + +**Open question:** Should we require that in a well-formed Carbon program, the +appearance of the source code (as determined by the Unicode Bidirectional +Algorithm) matches the token order as interpreted by the Carbon implementation? +This would introduce implementation and compilation-time cost, but would allow +us to provide stronger guarantees that code does what a reader believes it to +do. + ### Comments A *comment* begins with the characters `//` and runs to the end of the line. @@ -527,21 +549,6 @@ circumstances: A *token* is a documentation comment, a literal, a keyword, an identifier, a designator, an operator, or a bracket. -### Directionality - -After tokens are formed, a final check is performed to ensure that the -appearance of the code matches its meaning, as follows. The Unicode -Bidirectional Algorithm, as described in Unicode 13.0.0 -[UAX#9](https://unicode.org/reports/tr9/), is applied to the source text. It is -an error if any part of a token would be displayed after any part of a later -token, or if any operator or bracket or the delimiters of a string literal or -comment does not have a resolved directionality of L (left-to-right). - -Such issues can be resolved by the insertion of explicit left-to-right marks. -The Carbon formatter tool will insert such marks as necessary to satisfy these -constraints. -[[why?]](#formatting-rationale) - ## Rationale ### Encoding rationale @@ -603,6 +610,43 @@ are used where necessary, that identifiers are properly normalized, and so on, and we can simplify our implementation somewhat by only permitting input in a single Unicode normalization form. +### Directionality rationale + +Source code containing right-to-left string literals and right-to-left +identifiers will often display in a way that differs from its interpretation as +code. For example: + +``` +// The left operand of the + is "مرحبا", the right operand is "بالعالم". +var String: x = "مرحبا" + "بالعالم"; +``` + +Here, when displaying the code, the Unicode Bidirectional Algorithm identifies +all the text from the first `"` to the last `"` as being a single right-to-left +context, and so reverses that entire substring (including the `" + "` in the +middle). + +Inserting a left-to-right mark after each string literal containing +right-to-left text fixes the problem in this case: + +``` +// Same example as before, but now a left-to-right mark has been inserted after +// each string literal. +var String: x = "مرحبا"‎ + "بالعالم"‎; +``` + +Similar things can happen with right-to-left identifiers. For example, in + +``` +var مرحبا: بالعالم; +``` + +the declared identifier and type are likely to be displayed in the opposite +order from how they would be interpreted by a Carbon implementation. + +If we allow explicit left-to-right marks in the source code and treat them as +whitespace, such issues can be fixed by the Carbon formatting tool. + ### Line continuation rationale Line continuation in C++ is sometimes necessary in order to combine the needs