From 735ac21ec74352ef923fa1530b4865c06c4c494c Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Mon, 18 May 2020 23:00:03 -0700
Subject: [PATCH 01/11] P0016 Lexical conventions, first draft.

---
 docs/proposals/p0016.md | 543 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 543 insertions(+)
 create mode 100644 docs/proposals/p0016.md

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
new file mode 100644
index 000000000000..d67d96946cfd
--- /dev/null
+++ b/docs/proposals/p0016.md
@@ -0,0 +1,543 @@
+<!--
+Part of the Carbon Language, under the Apache License v2.0 with LLVM Exceptions.
+See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+# Carbon: Lexical conventions
+
+- **Authors:** Richard Smith
+- **[Tracking issue](https://github.com/carbon-language/carbon-lang/issues/16)**
+- **Status:** RFC
+- **Created:** 2020-05-18
+
+**_PLEASE_ DO NOT SHARE OUTSIDE CARBON FORUMS**
+
+## Problem
+
+This document proposes a set of rules for the initial phase of processing a
+Carbon source file: interpreting the contents of the file and forming
+[tokens](#tokens).
+
+## Proposal
+
+Carbon source files are [UTF-8](#file-contents-and-encoding) text files whose
+contents are divided into [whitespace](#whitespace), [comments](#comments),
+[literals](#literals), [identifiers](#identifiers), [keywords](#keywords),
+[designators](#designators), [operators](#operators), and
+[brackets](#brackets), as described below.
+
+## Details
+
+### File contents and encoding
+
+Carbon source files are Unicode text files encoded in UTF-8. An initial UTF-8
+BOM is permitted and ignored. All contents outside of [comments](#comments) and
+[literals](#literals) shall be in Normalization Form C.
+[[why?]](#encoding-rationale)
+
+Carbon is currently based on Unicode 13.0, and will adopt new Unicode versions
+as they are published.
+
+### Whitespace
+
+Characters are identified as whitespace if they have the Unicode
+`Pattern_White_Space` property. This includes the C++ whitespace characters:
+
+ * Space and horizontal tab
+ * Carriage return and line feed (which C++ conflates as "new line")
+ * Vertical tab and form feed
+
+As of Unicode version 13, 5 additional characters are included:
+
+ * U+0085 NEXT LINE
+ * U+200E LEFT-TO-RIGHT MARK
+ * U+200F RIGHT-TO-LEFT MARK
+ * U+2028 LINE SEPARATOR
+ * U+2029 PARAGRAPH SEPARATOR
+
+Space, horizontal tab, and the LTR and RTL mark are *horizontal whitespace*
+characters. All other whitespace characters are *vertical whitespace*
+characters.
+
+Characters with the Unicode property `White_Space` but not
+`Pattern_White_Space` are invalid outside comments and literals. Code
+formatters are encouraged to convert them into recognized horizontal whitespace
+characters. Implementations are encouraged to recover from the error as if
+those characters were treated as horizontal whitespace.
+
+### Comments
+
+A *comment* in Carbon is either:
+
+ * A *line comment*, beginning with `//` and running to the end of the line, or
+ * A *block comment*, beginning with `/*` and running to the matching `*/`.
+
+Carbon has no mechanism for physical line continuation, so a `//` comment
+always ends at the next vertical whitespace character.
+[[why?]](#line-continuation-rationale)
+
+If the character after the `/*` introducing a block comment is `{`, the comment
+is a *code comment*. In a code comment, the following text is tokenized until a
+matching `}*/` token is formed; such a token terminates the comment. (In
+particular, such a token is not recognized if it is nested within another
+comment or a literal.) Otherwise, the comment ends at the first matching `*/`
+character sequence.
+[[why?]](#nested-comments-rationale)
+
+Example:
+
+```carbon
+// This is a comment.
+// The characters /* introduce a block comment.
+This is not a comment.
+/*{
+  // The characters */ end a block comment.
+}*/
+This is not a comment.
+```
+
+If the character after the comment introducer is an exclamation mark, the
+comment is a documentation comment. Documentation comments are tokens, and are
+recognized by the language grammar only in specific locations, which determine
+the entity to which they attach.
+[[why?]](#nested-comments-rationale)
+
+Non-documentation comments are treated equivalently to whitespace.
+
+In addition to the cases above, a block comment introducer may be followed by
+additional `*` characters. If the character after the comment introducer is not
+one of those mentioned above, it shall be a whitespace character.
+[[why?]](#comment-introducers-rationale)
+
+### Literals
+
+Carbon provides literal syntax for numbers, and for character and string data.
+(Additional constants, such as `True` and `Nullptr`, are exposed as keywords or
+predeclared identifiers.)
+
+A *literal* is a numeric literal, a character literal, or a string literal, as
+defined below.
+
+A literal shall not be immediately followed by a character with property
+`XID_Start`. Carbon has no literal suffixes, but the corresponding lexical
+space is reserved for future extensions.
+
+#### Numbers
+
+Decimal integers are written as a non-zero decimal digit followed by zero or
+more additional decimal digits.
+
+Integers in other bases are written as a `0` followed by a base specifier
+character, followed by a sequence of digits in the corresponding base. The
+available base specifiers and corresponding bases are:
+
+| Base specifier | Base | Digits                                |
+| -------------- | ---- | ------------------------------------- |
+| `b` or `B`     | 2    | `0` and `1`                           |
+| `o`            | 8    | `0` ... `7`                           |
+| `x` or `X`     | 16   | `0` ... `9`, `a` ... `f`, `A` ... `F` |
+
+[TODO: This doesn't belong here.] There are no size suffixes. Each literal has
+a unique type that can be converted to any sufficiently-large integer type, but
+operations on it are always exact.
+
+Real numbers are written as a sequence of one or more decimal digits followed
+by a decimal point followed by a sequence of one or more decimal digits.
+
+A real number can be followed by an `e`, an optional `+` or `-` (defaulting to
+`+`), and a decimal integer *N*; the effect is to multiply the given value by
+10<sup>*N*</sup>.
+
+A *numeric literal* is an integer or real number expressed as described above.
+
+#### Characters
+
+A *character literal* is formed of any single character other than a backslash
+(`\\`) or single quotation mark, enclosed in a pair of single quotation marks
+(`'`), or an escape sequence enclosed in a pair of single quotation marks.
+
+An escape sequence is replaced by the corresponding character sequence or
+encoding, which shall fit in a single character.
+
+TODO: Table of escape sequences.
+
+#### Strings
+
+A *simple string literal* is formed of a sequence of
+
+ * characters other than backslashes, double quotation marks, and vertical
+   whitespace
+ * escape sequences
+
+enclosed in double quotation marks (`"`). Each escape sequence is replaced with
+the corresponding character sequence or encoding.
+
+A *raw string literal* starts with an `r` followed by *N* `#` characters
+followed by a double quotation mark, and ends with the first following
+occurrence of a double quotation mark followed by *N* `#` characters. The text
+in between is not interpreted in any way.
+
+A *block string literal* starts with three double quotation marks followed by
+an optional sequence of non-whitespace characters, followed by a newline. Each
+following line within the literal shall start with the same initial sequence of
+zero or more horizontal whitespace characters and optionally one `|` character
+as were present on the first such line. The literal ends at the first instance
+of three double quotation mark characters (where the first such character is
+not escaped). The common initial horizontal whitespace is removed from each
+line, as is the terminating newline character.  Escape sequences are expanded
+as in a simple string literal. The initial sequence of characters before the
+newline is ignored, but can be used to indicate the formatting rules for a code
+formatter to use for the literal contents.
+[[why?]](#block-strings-rationale)
+
+A *raw block string literal* is expressed analogously to a raw string literal,
+but for a block string literal. Escape sequences are ignored.
+
+For example:
+
+```carbon
+fn f() {
+  var String: x = r#"""
+    This is the content of the string. The 'T' is the first character
+    of the string.
+    """ <-- This is not the end of the string. But this is --> """#;
+  var String: y = r"Hello\"; // OK, final character is \
+  var String: z = r##"Raw strings r#"nesting"#"##;
+
+  var String: starts_with_whitespace = """
+    |  int x = 1;
+    |  int y = 2;""";
+  var String: starts_with_pipe = r#"""
+    || is a pipe.
+    |\ is not a pipe."""#;
+
+  var String: code = """c++
+  const char *str = R"foo(hello)foo";
+  """;
+
+  var String: error = """
+This is invalid (insufficiently indented).""";
+}
+```
+
+### Identifiers
+
+An *identifier* is a maximal sequence of characters beginning with a character
+with Unicode property `XID_Start`, followed by zero or more characters with
+property `XID_Continue`.
+
+Notably, `XID_Start` does not include the underscore character. Tokens
+beginning with an identifier are reserved, but are expected to be used as
+pattern matching placeholders.
+
+Additionally, a *raw identifier* can be specified by prefixing an identifier
+with `r#`, such as `r#requires`. Raw identifiers can be used to introduce and
+use names that are lexically identical to keywords.
+
+All identifier tokens in all contexts are looked up using the same lexical
+scoping rule.
+
+An identifier shall not be immediately followed by a `"` or `'`.
+
+### Keywords
+
+A *keyword* is an identifier with predefined meaning. Carbon has a predefined
+set of keywords, that will be specified separately as part of the syntax rules.
+
+An identifier that is a keyword may also be declared explicitly in a source
+file. The same identifier shall not be used as both a keyword and as a non-raw
+non-keyword identifier in a single source file.
+[[why?]](#keywords-rationale)
+
+Example:
+
+```carbon
+var Int: fn = 3; // OK, variable named 'fn'
+fn f() {}        // error, 'fn' is not a keyword in this source file
+interface var {} // error, already used 'var' as a keyword in this source file
+```
+
+### Designators
+
+A *designator* is a token formed by prefixing an identifier with a period
+character, such as `.member`. The identifier after the period is the *member
+name*, and is looked up in a context-dependent manner.
+[[why?]](#designators-rationale)
+
+### Operators
+
+An *operator* is a maximal sequence of characters with Unicode property
+`Pattern_Syntax`, excluding `"` and `'` and those characters with class `Ps` or
+`Pe` (for which, see [brackets](#brackets)), which we will refer to as
+*operator characters*.
+[[why?]](#operators-rationale)
+We do not intend to define any operators containing non-ASCII characters. The
+ASCII operator characters are:
+
+```
+!  #  $  %  &  *  +  -  .  /  :  ;  <  =  >  ?  @  \  ^  `  |  ~
+```
+
+Of these, we intend to not use <code>\`</code> due to its common use to escape
+code, nor `$` due to its absence from many non-US keyboards. This leaves 20
+operator characters, 400 digraphs, and so on.
+
+Bracket operators, described below, are also operators.
+
+### Brackets
+
+A *simple open bracket* is a character with Unicode property `Pattern_Syntax`
+and character class `Ps`, such as `(` or `[`.
+A *simple close bracket* is a character
+with Unicode property `Pattern_Syntax` and character class `Pe`, such as `}`.
+A *bracket terminator character* is one of `|` or `:`.
+A *bracket continuation character* is an operator character that is not a
+bracket terminator character.
+
+A *compound open bracket* is a simple open bracket followed by zero or more
+bracket continuation characters followed by a bracket terminator character,
+such as `[:`.
+A *compound close bracket* is a bracket terminator character followed by zero
+or more bracket continuation characters followed by a simple close bracket,
+such as `|=)`.
+[[why?]](#compound-brackets-rationale)
+
+A *close bracket* is either a simple open bracket or a compound open bracket.
+A *close bracket* is either a simple close bracket or a compound close bracket.
+
+The close bracket matching an open bracket is formed by reversing the character
+sequence in the open bracket and replacing each caracter with class `Ps` with
+the corresponding character with class `Pe`. Every open bracket is required to
+have a matching close bracket such that the bracketed regions form a tree
+structure.
+
+There are 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket digraphs
+(`(| |)`, `{| |}`, `[| |]`, `(: :)`, `{: :}`, `[: :]`), 108 bracket trigraphs,
+and so on.
+
+For example:
+
+```carbon
+(this is within brackets {and this [this too]})
+(|this is a different kind of bracket {: and another :}
+   (**|lots of kinds of brackets can be built [=: this way :=]|**)
+ |)
+```
+
+TODO: Indentation restrictions for bracket matching?
+
+In addition, Carbon recognizes *bracket operators*, formed by a simple open
+bracket followed by one or more operator characters followed by a matching
+simple close bracket, such as `[~>]` or `(*)`. Bracket operators are operators,
+not brackets.
+
+### Tokens
+
+A *token* is a documentation comment, a literal, an identifier, a keyword, a
+designator, an operator, or a bracket. Tokens are formed by a single
+left-to-right scan of the source file, using a "max munch" rule: the longest
+possible next token is formed at each step.
+
+## Rationale
+
+### Encoding rationale
+
+We intend to follow the Unicode Consortium's recommentations for identifiers in
+programming languages as described in Unicode 13.0.0
+[UAX#31](https://unicode.org/reports/tr31/) Revision 31. We do not see a reason
+to be inventive in this regard, and delegating the complex considerations over
+how Unicode characters should be used to a group with greater expertise in that
+area seems appropriate.
+
+We observe UAX#31's requirements as follows:
+
+ * UAX31-R1: requirement met. Identifiers are of the form `XID_Start`
+   `XID_Continue`\*.
+ * UAX31-R1a: requirement not met. Format characters are not permitted in
+   identifiers.
+ * UAX31-R1b: requirement not met. We intend for Carbon evolution to follow
+   Unicode evolution, including removing identifier characters as appropriate
+   over time.
+ * UAX31-R2: requirement not met. We intend for Carbon evolution to follow
+   Unicode evolution, including adding identifier characters as appropriate
+   over time.
+ * UAX31-R3: requirement met. Carbon treats characters as whitespace if and
+   only if they are `Pattern_White_Space` characters, and all operator tokens
+   are formed exclusively from `Pattern_Syntax` characters.
+ * UAX31-R4: requirement met. Carbon identifiers are required to be in NFC,
+   so identifiers that are the same in NFC are equivalent.
+ * UAX31-R5: requirement not met. Carbon identifiers are case-sensitive.
+ * UAX31-R6: requirement met. No characters are excluded from normalization.
+ * UAX31-R7: requirement not met. Carbon identifiers are case-sensitive.
+
+### Line continuation rationale
+
+Line continuation in C++ is sometimes necessary in order to combine the needs
+of line-oriented parsing with the desire to meet a specific column limit or
+format code nicely. For example:
+
+```c++
+#define SOME_MACRO \
+  very long macro body \
+  split over multiple lines
+#define OTHER_MACRO \
+  if (pretty_code) \
+    do { wrap_lines() } while (false)
+```
+
+We do not have a commensurate need for line continuation in Carbon. We intend
+to include no line-oriented syntax. In reasonable cases where an individual
+token is longer than a natural column limit (such as for a long string
+literal), we will provide a mechanism to wrap the token without line
+continuation.
+
+Line continuation for comments in particular is a known source of gotchas in C
+and C++ programs.
+
+## Nested comments rationale
+
+There is a known need for nested comment syntax: it is important to be able to
+comment out a block of code confident in the knowledge that all text between
+the comment markers (and exactly that text) was in fact commented out.
+
+The only facility for this in C and C++ is `#if 0` ... `#endif`, and that is
+what is used in practice. As we do not want Carbon to have a textual
+preprocessor, enabling nesting of some other comment scheme seems natural.
+
+Conversely, it is valuable to have a comment syntax for human-readable
+commentary that can be used within a line and can span multiple lines, and we
+see no reason to invent something other than `/* ... */` for this purpose. We
+cannot use the same notation as nested comment syntax without compromising its
+usage for human-readable commentary. For example:
+
+```carbon
+f(1.0 / x /* can't be negative */, 2);
+```
+
+... would not be valid as a `/*{ ... }*/` comment, because it would contain an
+unterminated character literal.
+
+### Comment introducers rationale
+
+We anticipate the possibility of adding additional kinds of comment in the
+future. Reserving syntactic space in comment syntax, in a way that is easy for
+programs to avoid, allows us to add such additional kinds of comment as a
+non-breaking change.
+
+### Block strings rationale
+
+Block literals are a useful way of expressing multiline string content in a
+program. It's useful to treat block literals and raw literals as distinct
+concepts: even within multiline literals, explicit encoding of tab characters,
+character escapes, and so forth can be useful or undesirable.
+
+Further, separating the concepts permits us to disallow newlines in non-block
+raw string literals, which prevents one class of runaway lexing problem: the
+inability to find the end of a raw string literal can lead to scanning and
+consuming the entirety of the source file.
+
+Removing the initial indentation from block string literals serves two primary
+purposes: it allows the lexer to intelligently abort and backtrack sooner if it
+reaches the end of the indented region without seeing the end of string marker,
+and it improves the readability of the code.
+
+### Keywords rationale
+
+One of Carbon's most important goals is to support program and language
+evolution. We know that the set of keywords in Carbon will grow over time,
+and the easiest kind of language change from an evolutionary perspective is one
+that is known to break no programs, that lets programs migrate incrementally to
+the new language rule, and that either has no migration cost or only imposes
+automatable migration cost on the code that intends to use the new feature.
+
+The proposed approach to keywords intends to support such a migration story.
+Adding new keywords to Carbon is a non-breaking change. Because every
+identifier is locally declared using obvious syntax before it is used, it is
+straightforward to detect, using simple rules, whether a particular identifier
+is a keyword or not in a particular source file.
+
+Using a new keyword in an existing source file requires first replacing all
+existing uses of that identifier with raw identifiers throughout the source
+file, which is a mechanical, automatable change.
+
+For identifiers whose scopes are constrained to a single source file, raw
+identifiers are not necessary to permit such a transition. However, for
+identifiers that are declared in one source file and consumed in another, we
+still need a mechanism to continue declaring a name as an identifier after it
+has been claimed as a keyword.
+
+Note that while this means that adding a new keyword is cheap in terms of
+migration cost, we should still think of adding a keyword as being a
+significant undertaking, as each keyword will occupy space in the mind of the
+Carbon programmer. However, we should not feel any pressure to reuse the same
+keyword for distinct purposes.
+
+This approach brings one important restriction: in any syntax that introduces
+an identifier, there should never be an optional keyword preceding the
+identifier, and nor should the identifier be optional if it can be followed by
+a keyword.
+
+### Designators rationale
+
+We wish to have uniform scoping and name lookup rules throughout Carbon.
+However, we also wish to parse expressions such as `x.y`, where `x` is looked
+up as an identifier, and `y` has some other lookup rule. We also wish to use
+`.y` as a designator when initializing fields.
+
+It is reasonable to conclude that an identifier preceded by a period is a
+fundamentally different kind of token from a regular identifier: it has
+different name lookup rules (if it's looked up at all) and cannot simply be
+immediately resolved by lookup in the environment.
+
+Treating such tokens as special at the lexical level has other beneficial effects:
+
+ * It avoids a special case in the rules for binary operators. This would be
+   the only binary operator that affects how name lookup is performed on its
+   right-hand side, and would be the only binary operator for which we do not
+   expect (or perhaps require) whitespace on both sides.
+ * It prohibits whitespace between the period and the identifier. This enforces
+   an intended stylistic convention. As a postfix unary operator, we will also
+   enforce an absence of whitespace before designators in member access syntax.
+ * It frees up `.` for use as a binary operator, should we so desire.
+ * It allows extremely fast lexical name lookup for all identifiers via string
+   interning: the same lookup that checks whether an identifier is a keyword
+   can also perform the complete lexical lookup if it's not an identifier.
+ * It permits uniform typo correction, including correction to keywords, for
+   all identifiers in all contexts, because we can identify typos from the
+   lexing stage before we even reach the parser.
+
+### Operators rationale
+
+We use a strict "max munch" rule for operators, without regard for Carbon's
+current operator set. This requires parentheses in code that would apply
+multiple prefix or postfix operators in a row, such as `-*p`, but gives us the
+advantage that adding new operators is always a non-breaking change for all
+existing Carbon code.
+
+### Compound brackets rationale
+
+We intend for each notation in Carbon code to have exactly one meaning, and for
+the language to be evolvable in new directions. However, there are only three
+easily-typable sets of brackets for most Carbon programmers -- four if you
+include `<>`, which introduces a host of problems. We already know of more than
+this many different kinds of bracketed region we wish to support, and have
+started trying to play syntactic games to treat them as the same thing in order
+to get back down to only three.
+
+Including compound brackets allows us to solve these problems: we have an
+unbounded set of potential bracket pairs, without substantially increasing the
+complexity of parsing Carbon code. The bracket terminator characters are chosen
+such that a bracket followed by an operator can be easily visually separated by
+a reader of Carbon code: even in a tricky case such as `[*|*p|*]`, the
+expression `*p` within the `[*|` ... `|*]` brackets is reasonably readable. And
+we will not need to resort to three-character brackets unless we exhaust our 9
+kinds of two-character brackets.
+
+Support for compound brackets requires that we make a concession: the bracket
+terminator characters cannot be used within prefix or postfix operators. For
+the two chosen symbols, this is unlikely to present a problem.
+
+## Alternatives considered
+
+TODO: Consider alternatives

From 511a9729edd69991d8123d0ce802a9b3c22927d4 Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Tue, 19 May 2020 18:33:38 -0700
Subject: [PATCH 02/11] Rephrase introduction of list of ASCII whitespace
 characters.

Co-authored-by: Dmitri Gribenko <gribozavr@gmail.com>
---
 docs/proposals/p0016.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index d67d96946cfd..7420fa921695 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -42,7 +42,7 @@ as they are published.
 ### Whitespace
 
 Characters are identified as whitespace if they have the Unicode
-`Pattern_White_Space` property. This includes the C++ whitespace characters:
+`Pattern_White_Space` property. These include the ASCII whitespace characters (recognized in C++):
 
  * Space and horizontal tab
  * Carriage return and line feed (which C++ conflates as "new line")

From 22ed0b8e8bc0598f5e6a996bee9cf8c0aa66526a Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Wed, 20 May 2020 15:19:24 -0700
Subject: [PATCH 03/11] Apply suggestions from code review

Co-authored-by: Dmitri Gribenko <gribozavr@gmail.com>
Co-authored-by: josh11b <josh11b@users.noreply.github.com>
---
 docs/proposals/p0016.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index 7420fa921695..629a81f21cda 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -56,7 +56,7 @@ As of Unicode version 13, 5 additional characters are included:
  * U+2028 LINE SEPARATOR
  * U+2029 PARAGRAPH SEPARATOR
 
-Space, horizontal tab, and the LTR and RTL mark are *horizontal whitespace*
+Space, horizontal tab, and the LTR and RTL marks are *horizontal whitespace*
 characters. All other whitespace characters are *vertical whitespace*
 characters.
 
@@ -228,7 +228,7 @@ with Unicode property `XID_Start`, followed by zero or more characters with
 property `XID_Continue`.
 
 Notably, `XID_Start` does not include the underscore character. Tokens
-beginning with an identifier are reserved, but are expected to be used as
+beginning with an underscore are reserved, but are expected to be used as
 pattern matching placeholders.
 
 Additionally, a *raw identifier* can be specified by prefixing an identifier
@@ -303,7 +303,7 @@ or more bracket continuation characters followed by a simple close bracket,
 such as `|=)`.
 [[why?]](#compound-brackets-rationale)
 
-A *close bracket* is either a simple open bracket or a compound open bracket.
+An *open bracket* is either a simple open bracket or a compound open bracket.
 A *close bracket* is either a simple close bracket or a compound close bracket.
 
 The close bracket matching an open bracket is formed by reversing the character

From cc207d19d662ff8236c5f687fcfa240fe37292dc Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Tue, 19 May 2020 19:05:29 -0700
Subject: [PATCH 04/11] Recognize `0` as an integer literal.

---
 docs/proposals/p0016.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index 629a81f21cda..5a293fdda0d5 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -126,7 +126,7 @@ space is reserved for future extensions.
 #### Numbers
 
 Decimal integers are written as a non-zero decimal digit followed by zero or
-more additional decimal digits.
+more additional decimal digits, or as a single `0`.
 
 Integers in other bases are written as a `0` followed by a base specifier
 character, followed by a sequence of digits in the corresponding base. The

From bce8eb461854fe141b3791a029e256fea9fa6daf Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Wed, 20 May 2020 15:19:34 -0700
Subject: [PATCH 05/11] Updates based on review feedback.

---
 docs/proposals/p0016.md | 302 ++++++++++++++++++++++++++++++----------
 1 file changed, 231 insertions(+), 71 deletions(-)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index 5a293fdda0d5..337e94dd7707 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -23,9 +23,9 @@ Carbon source file: interpreting the contents of the file and forming
 
 Carbon source files are [UTF-8](#file-contents-and-encoding) text files whose
 contents are divided into [whitespace](#whitespace), [comments](#comments),
-[literals](#literals), [identifiers](#identifiers), [keywords](#keywords),
-[designators](#designators), [operators](#operators), and
-[brackets](#brackets), as described below.
+[literals](#literals), [identifiers](#identifiers) (including
+[keywords](#keywords)), [designators](#designators), [operators](#operators),
+and [brackets](#brackets), as described below.
 
 ## Details
 
@@ -66,6 +66,17 @@ formatters are encouraged to convert them into recognized horizontal whitespace
 characters. Implementations are encouraged to recover from the error as if
 those characters were treated as horizontal whitespace.
 
+A *line* is a possibly-empty sequence of characters preceded and followed by
+either vertical whitespace or the beginning or end of the file. As a special
+case, an empty sequence of characters preceded by a carriage return and
+followed by a line feed is not treated as a line. For example, "foo\r\nbar"
+contains two lines, and "foo\n\rbar" contains three, of which the middle line
+is empty.
+
+The *indentation* of a line is the sequence of horizontal whitespace characters
+at the start of the line. A line *A* has more indentation than a line *B* the
+indentation of *B* is a proper prefix of the indentation of *A*.
+
 ### Comments
 
 A *comment* in Carbon is either:
@@ -119,9 +130,9 @@ predeclared identifiers.)
 A *literal* is a numeric literal, a character literal, or a string literal, as
 defined below.
 
-A literal shall not be immediately followed by a character with property
-`XID_Start`. Carbon has no literal suffixes, but the corresponding lexical
-space is reserved for future extensions.
+A literal shall not be immediately followed an [identifier continuation
+character](#identifiers). Carbon has no literal suffixes, but the corresponding
+lexical space is reserved for future extensions.
 
 #### Numbers
 
@@ -147,23 +158,14 @@ by a decimal point followed by a sequence of one or more decimal digits.
 
 A real number can be followed by an `e`, an optional `+` or `-` (defaulting to
 `+`), and a decimal integer *N*; the effect is to multiply the given value by
-10<sup>*N*</sup>.
+10<sup>&plusmn;*N*</sup>.
 
 A *numeric literal* is an integer or real number expressed as described above.
 
-#### Characters
-
-A *character literal* is formed of any single character other than a backslash
-(`\\`) or single quotation mark, enclosed in a pair of single quotation marks
-(`'`), or an escape sequence enclosed in a pair of single quotation marks.
-
-An escape sequence is replaced by the corresponding character sequence or
-encoding, which shall fit in a single character.
-
-TODO: Table of escape sequences.
-
 #### Strings
 
+[[alternatives]](#string-alternatives)
+
 A *simple string literal* is formed of a sequence of
 
  * characters other than backslashes, double quotation marks, and vertical
@@ -173,81 +175,108 @@ A *simple string literal* is formed of a sequence of
 enclosed in double quotation marks (`"`). Each escape sequence is replaced with
 the corresponding character sequence or encoding.
 
+TODO: Table of escape sequences.
+
 A *raw string literal* starts with an `r` followed by *N* `#` characters
 followed by a double quotation mark, and ends with the first following
-occurrence of a double quotation mark followed by *N* `#` characters. The text
-in between is not interpreted in any way.
-
-A *block string literal* starts with three double quotation marks followed by
-an optional sequence of non-whitespace characters, followed by a newline. Each
-following line within the literal shall start with the same initial sequence of
-zero or more horizontal whitespace characters and optionally one `|` character
-as were present on the first such line. The literal ends at the first instance
-of three double quotation mark characters (where the first such character is
-not escaped). The common initial horizontal whitespace is removed from each
-line, as is the terminating newline character.  Escape sequences are expanded
-as in a simple string literal. The initial sequence of characters before the
-newline is ignored, but can be used to indicate the formatting rules for a code
-formatter to use for the literal contents.
+occurrence of a double quotation mark followed by *N* `#` characters on the
+same line. The text in between is not interpreted in any way.
+
+A *block string literal* starts with three double quotation marks, followed by
+an optional file type indicator, followed by a newline, and ends at the next
+instance of three double quotation marks. The closing `"""` shall be the first
+non-whitespace characters on that line. The lines between the opening line and
+the new line are *content lines*. Each non-empty content line shall be [at
+least as indented](#whitespace) as the line containing the closing `"""`. The
+closing line shall be at least as indented as the opening line, and shall be
+more indented if the opening `"""` are not the first non-whitespace characters
+on the opening line. The content of the literal is formed by removing the
+indentation of the closing line from each (non-empty) content line, and
+concatenating the results with a line feed character added between each pair of
+lines.
 [[why?]](#block-strings-rationale)
 
+A *file type indicator* is a sequence of characters that are either [identifier
+continuation characters](#identifiers) or [operator characters](#operators).
+
 A *raw block string literal* is expressed analogously to a raw string literal,
-but for a block string literal. Escape sequences are ignored.
+but for a block string literal. Escape sequences are ignored, but indentation
+is removed and each vertical whitespace character is replaced by a line feed
+as in a non-raw block string literal.
 
 For example:
 
 ```carbon
 fn f() {
+  """
+  This is a string literal. Its first character is 'T' and its last character
+  is '.'. It contains one embedded newline, between 'character' and 'is'.
+  """
   var String: x = r#"""
     This is the content of the string. The 'T' is the first character
     of the string.
-    """ <-- This is not the end of the string. But this is --> """#;
+    """ <-- This is not the end of the string.
+    """#; // <-- But this is.
   var String: y = r"Hello\"; // OK, final character is \
   var String: z = r##"Raw strings r#"nesting"#"##;
 
-  var String: starts_with_whitespace = """
-    |  int x = 1;
-    |  int y = 2;""";
-  var String: starts_with_pipe = r#"""
-    || is a pipe.
-    |\ is not a pipe."""#;
-
-  var String: code = """c++
-  const char *str = R"foo(hello)foo";
-  """;
-
-  var String: error = """
-This is invalid (insufficiently indented).""";
+  // This string starts and ends with two "s.
+  var String: ambig1 = r#"""This is a raw string literal starting with """#;
+  var String: ambig2 = r#"""This
+    is a block string literal with file type 'This' and first character 'i'.
+    """#;
+
+  var String: starts_with_whitespace = """c++
+      int x = 1; // This line starts with two spaces.
+      int y = 2; // This line starts with two spaces.
+    """;
+
+  var String: invalid1 = """
+error: insufficiently indented.
+""";
+  var String: invalid2 = r#"""
+    error: closing """ is not on its own line."""#;
 }
 ```
 
+#### Characters
+
+A *character literal* is lexically identical to a simple string literal, except
+that it is enclosed in single quotation marks (`'`) instead of double quotation
+marks (`"`).
+
 ### Identifiers
 
 An *identifier* is a maximal sequence of characters beginning with a character
-with Unicode property `XID_Start`, followed by zero or more characters with
-property `XID_Continue`.
+with Unicode property `XID_Start`, followed by zero or more *identifier
+continuation characters*, which are characters that either have property
+`XID_Continue` or are underscores (`_`).
 
 Notably, `XID_Start` does not include the underscore character. Tokens
-beginning with an underscore are reserved, but are expected to be used as
-pattern matching placeholders.
+beginning with an underscore are [reserved](#reserved-tokens).
+[[why?]](#underscores-rationale)
 
 Additionally, a *raw identifier* can be specified by prefixing an identifier
 with `r#`, such as `r#requires`. Raw identifiers can be used to introduce and
 use names that are lexically identical to keywords.
+[[why?]](#keywords-rationale)
 
 All identifier tokens in all contexts are looked up using the same lexical
 scoping rule.
 
 An identifier shall not be immediately followed by a `"` or `'`.
 
-### Keywords
+#### Keywords
 
 A *keyword* is an identifier with predefined meaning. Carbon has a predefined
 set of keywords, that will be specified separately as part of the syntax rules.
 
 An identifier that is a keyword may also be declared explicitly in a source
 file. The same identifier shall not be used as both a keyword and as a non-raw
-non-keyword identifier in a single source file.
+non-keyword identifier in a single source file. As a consequence of these
+rules, from a lexical standpoint there is no notion of keywords -- whether a
+given identifier is a keyword depends on the syntactic structure of the source
+file.
 [[why?]](#keywords-rationale)
 
 Example:
@@ -312,9 +341,10 @@ the corresponding character with class `Pe`. Every open bracket is required to
 have a matching close bracket such that the bracketed regions form a tree
 structure.
 
-There are 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket digraphs
-(`(| |)`, `{| |}`, `[| |]`, `(: :)`, `{: :}`, `[: :]`), 108 bracket trigraphs,
-and so on.
+We do not intend to include any non-ASCII characters as part of Carbon's
+syntax. This leaves 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket
+digraphs (`(| |)`, `{| |}`, `[| |]`, `(: :)`, `{: :}`, `[: :]`), 108 bracket
+trigraphs, and so on.
 
 For example:
 
@@ -325,7 +355,10 @@ For example:
  |)
 ```
 
-TODO: Indentation restrictions for bracket matching?
+Each non-empty line from the line containing an opening bracket to the line
+containing the matching closing bracket (inclusive) shall have at least as much
+indentation as the line containing the opening bracket.
+[[why?]](#bracket-indentation-rationale)
 
 In addition, Carbon recognizes *bracket operators*, formed by a simple open
 bracket followed by one or more operator characters followed by a matching
@@ -334,10 +367,23 @@ not brackets.
 
 ### Tokens
 
-A *token* is a documentation comment, a literal, an identifier, a keyword, a
-designator, an operator, or a bracket. Tokens are formed by a single
-left-to-right scan of the source file, using a "max munch" rule: the longest
-possible next token is formed at each step.
+A *token* is a documentation comment, a literal, an identifier (which might be
+a keyword), a designator, an operator, or a bracket. Tokens are formed by a
+single left-to-right scan of the source file, using a "max munch" rule: the
+longest possible next token is formed at each step, after skipping whitespace
+and comments.
+
+#### Reserved tokens
+
+It is an error if token formation (after skipping whitespace and comments) is
+attempted in the following circumstances:
+
+ * When the first character does not have property `XID_Continue` or
+   `Pattern_Syntax`.
+ * As a special case of the prior bullet, when the first character is an
+   underscore. [[why?]](#underscores-rationale)
+ * When the first two characters are `r#` and neither a raw identifier nor a
+   raw string literal would be formed.
 
 ## Rationale
 
@@ -345,15 +391,22 @@ possible next token is formed at each step.
 
 We intend to follow the Unicode Consortium's recommentations for identifiers in
 programming languages as described in Unicode 13.0.0
-[UAX#31](https://unicode.org/reports/tr31/) Revision 31. We do not see a reason
+[UAX#31](https://unicode.org/reports/tr31/) Revision 33. We do not see a reason
 to be inventive in this regard, and delegating the complex considerations over
 how Unicode characters should be used to a group with greater expertise in that
 area seems appropriate.
 
+As an exception, Carbon permits underscore as a continuation character in
+identifiers. Usage of this character is sufficiently common in C++ identifiers
+that excluding it conflicts with our interoperability goal. However, leading
+underscores are not permitted in Carbon identifiers.
+[[why?]](#underscores-rationale)
+
 We observe UAX#31's requirements as follows:
 
  * UAX31-R1: requirement met. Identifiers are of the form `XID_Start`
-   `XID_Continue`\*.
+   `Continue`\*, using a profile in which `Continue` is `XID_Continue` plus
+   U+005F LOW LINE (`_`).
  * UAX31-R1a: requirement not met. Format characters are not permitted in
    identifiers.
  * UAX31-R1b: requirement not met. We intend for Carbon evolution to follow
@@ -367,9 +420,13 @@ We observe UAX#31's requirements as follows:
    are formed exclusively from `Pattern_Syntax` characters.
  * UAX31-R4: requirement met. Carbon identifiers are required to be in NFC,
    so identifiers that are the same in NFC are equivalent.
- * UAX31-R5: requirement not met. Carbon identifiers are case-sensitive.
+ * UAX31-R5: requirement not met. Carbon identifiers are case-sensitive, so
+   this requirement is inapplicable.
  * UAX31-R6: requirement met. No characters are excluded from normalization.
- * UAX31-R7: requirement not met. Carbon identifiers are case-sensitive.
+ * UAX31-R7: requirement not met. Carbon identifiers are case-sensitive, so
+   this requirement is inapplicable.
+ * UAX31-R8: requirement not applicable. Carbon does not have hashtag
+   identifiers.
 
 ### Line continuation rationale
 
@@ -430,17 +487,60 @@ non-breaking change.
 Block literals are a useful way of expressing multiline string content in a
 program. It's useful to treat block literals and raw literals as distinct
 concepts: even within multiline literals, explicit encoding of tab characters,
-character escapes, and so forth can be useful or undesirable.
+character escapes, and so forth can be useful or undesirable. For example:
+
+```carbon
+var String: raw_code = r"""carbon
+  var String: example = "hello\n\tworld"; // Contains two backslashes.
+  """;
+var String: expanded_code = """carbon
+  \tvar Int: n = 123; // Starts with tab and ends with newline.\n
+  """;
+```
 
 Further, separating the concepts permits us to disallow newlines in non-block
 raw string literals, which prevents one class of runaway lexing problem: the
 inability to find the end of a raw string literal can lead to scanning and
-consuming the entirety of the source file.
+consuming the entirety of the source file. It is better to ask the user to
+explicitly express their intent than to assume that we're in the rare case
+where a missing closing double-quote indicates a multi-line string.
 
 Removing the initial indentation from block string literals serves two primary
 purposes: it allows the lexer to intelligently abort and backtrack sooner if it
-reaches the end of the indented region without seeing the end of string marker,
-and it improves the readability of the code.
+reaches the end of the indented region without seeing the end of string marker
+(essentially enabling detection of runaway block string literals), and it
+improves the readability of the code.
+
+### Underscores rationale
+
+Tokens beginning with an underscore are currently reserved, but are expected to
+be used to represent wildcards in pattern matching. The precise rules are not
+yet determined, so we are not committing to any lexical conventions for such
+tokens yet.
+
+We could permit such tokens as identifiers for now, and claim them back as
+keywords later if needed. There are some reasons not to follow that approach:
+
+ * The semantic interpretation of such tokens may be quite different from that
+   of regular identifiers: they may not be subject to regular name lookup and
+   not required to be declared before being used, but also not part of a
+   predefined finite set of keywords with unique meaning.
+ * We may want to use different lexical rules for underscores, such as treating
+   `_` as a standalone token (even when followed by an identifier) or as an
+   operator character. The evolutionary path for such a change would be
+   challenging.
+
+Underscores in identifiers are common in C++ identifiers, which motivates
+permitting them in Carbon to support our C++ interoperability goal. However,
+leading underscores are rare in publicly-visible C++ identifiers, and result in
+reserved identifiers in many contexts, so we do not have similar motivation to
+permit those.
+
+Leading underscores are used in some C++ code to distinguish member names from
+non-member names. In Carbon, we anticipate all identifiers being declared
+locally and found by a simple lexical lookup rule, so use of leading
+underscores to avoid name collisions should generally be unnecessary. As such,
+we lack a strong motivation to permit such identifiers.
 
 ### Keywords rationale
 
@@ -499,13 +599,13 @@ Treating such tokens as special at the lexical level has other beneficial effect
  * It prohibits whitespace between the period and the identifier. This enforces
    an intended stylistic convention. As a postfix unary operator, we will also
    enforce an absence of whitespace before designators in member access syntax.
- * It frees up `.` for use as a binary operator, should we so desire.
  * It allows extremely fast lexical name lookup for all identifiers via string
    interning: the same lookup that checks whether an identifier is a keyword
    can also perform the complete lexical lookup if it's not an identifier.
  * It permits uniform typo correction, including correction to keywords, for
    all identifiers in all contexts, because we can identify typos from the
    lexing stage before we even reach the parser.
+ * It frees up `.` for use as a binary operator, should we so desire.
 
 ### Operators rationale
 
@@ -538,6 +638,66 @@ Support for compound brackets requires that we make a concession: the bracket
 terminator characters cannot be used within prefix or postfix operators. For
 the two chosen symbols, this is unlikely to present a problem.
 
+### Bracket indentation rationale
+
+We wish for Carbon to provide good error diagnosis and recovery, to support
+tooling and analysis of incomplete source files, and to reject cases where we
+can be confident that the intent of the programmer is not captured by the code.
+To support these goals, we require that the indentation of code reflects the
+logical structure of the code, and one of the ways we achieve this is by
+requiring that the contents of a bracketed region are at least as indented as
+the opening line of that bracketed region. For example, given:
+
+```carbon
+fn f() {
+  if (cond) {
+    // ...
+}
+
+fn g() {
+```
+
+we can be confident that the intention was for the closing brace on line 4 to
+match the opening brace on line 1, not the opening brace on line 2. By
+recognizing this early, we can produce improved diagnostics indicating that a
+closing brace was missing, and tools that perform semantic analysis of
+potentially-incomplete source files can recover by imagining that an additional
+closing brace appeared before line 4.
+
+This indentation rule is insufficient to fully ensure that our interpretation
+of the program matches the programmer's intent and the reader's expectations.
+Generally, we wish for all continuation lines of any grammatical construct to
+be at least as indented as the first line in that construct.
+
 ## Alternatives considered
 
-TODO: Consider alternatives
+### String alternatives
+
+Block string literals could use explicit characters in the body to indicate the
+amount of leading whitespace to be removed:
+
+```carbon
+var String: x = """
+  |  starts with two spaces.
+  """;
+```
+
+This would allow the correct indentation to be determined as soon as the first
+line after the opening `"""` is seen. However, this adds lexical complexity,
+and most of the same benefit can be derived by simply requiring the indentation
+of the string literal to be greater than that of its contents.
+
+We could choose to include the newline before the terminating `"""` as part of
+the literal contents. However, expectations for whether it should be included
+vary, and appear to be somewhat evenly split between the two options. If the
+terminating newline is included in the string, it would be natural to permit
+the terminating `"""` to not be preceded by a newline, which removes the most
+natural vehicle by which we can determine the proper indentation to remove from
+each line.
+
+### Operators alternatives
+
+We could use a "max munch" rule for operators that is restricted to only
+recognize a known set of Carbon operators. This would permit constructs such as
+`-*p` or `**p` without additional brackets or whitespace. This would improve
+the language ergonomics, but would make language evolution more difficult.

From 7f1c242a5bf6a3580b1365be6aaf8ecaa957141e Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Thu, 21 May 2020 16:41:28 -0700
Subject: [PATCH 06/11] New rules for block comments based on review feedback.

---
 docs/proposals/p0016.md | 170 ++++++++++++++++++++++++++++------------
 1 file changed, 121 insertions(+), 49 deletions(-)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index 337e94dd7707..4d8763d1c231 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -79,46 +79,113 @@ indentation of *B* is a proper prefix of the indentation of *A*.
 
 ### Comments
 
-A *comment* in Carbon is either:
-
- * A *line comment*, beginning with `//` and running to the end of the line, or
- * A *block comment*, beginning with `/*` and running to the matching `*/`.
-
+A *comment* begins with the characters `//` and runs to the end of the line.
 Carbon has no mechanism for physical line continuation, so a `//` comment
-always ends at the next vertical whitespace character.
+always ends at the next vertical whitespace character. There shall be no text
+other than horizontal whitespace before the `//` characters introducing a
+comment. (Either all of a line is a comment, or none of it.)
 [[why?]](#line-continuation-rationale)
 
-If the character after the `/*` introducing a block comment is `{`, the comment
-is a *code comment*. In a code comment, the following text is tokenized until a
-matching `}*/` token is formed; such a token terminates the comment. (In
-particular, such a token is not recognized if it is nested within another
-comment or a literal.) Otherwise, the comment ends at the first matching `*/`
-character sequence.
-[[why?]](#nested-comments-rationale)
+#### Text comments
+
+If the character after the comment introducer is whitespace (or the end of the
+file), the comment is a *text comment*. Text comments are treated equivalently
+to whitespace.
 
 Example:
 
 ```carbon
-// This is a comment.
-// The characters /* introduce a block comment.
-This is not a comment.
-/*{
-  // The characters */ end a block comment.
-}*/
+// This is a comment and is ignored. \
 This is not a comment.
+
+var Int: x; // error, trailing comments not allowed
+```
+
+#### Documentation comments
+
+If the character after the comment introducer is an exclamation mark or another
+`/`, in either case followed by whitespace, the comment is a documentation
+comment. Documentation comments are tokens, and are recognized by the language
+grammar only in specific locations, which determine the entity to which they
+attach.
+[[why?]](#documentation-comments-rationale)
+
+Example:
+
+```carbon
+//! This is a documentation comment.
+/// So is this.
+fn DocumentedFunction() {}
+
+var Int
+  //! This is an error; a documentation comment cannot appear here.
+  : x;
 ```
 
-If the character after the comment introducer is an exclamation mark, the
-comment is a documentation comment. Documentation comments are tokens, and are
-recognized by the language grammar only in specific locations, which determine
-the entity to which they attach.
-[[why?]](#nested-comments-rationale)
+**Open question:** Should we accept only one or the other kind of documentation
+comment?
+
+#### Block comments
+
+If the character after the comment introducer is an open brace, the comment is
+a block comment. Subsequent lines are tokenized until a comment introducer
+followed by a close brace is found. The tokenization rules are adjusted as
+follows:
 
-Non-documentation comments are treated equivalently to whitespace.
+ * There is no requirement that brackets match.
+ * There is no requirement that code within brackets or text within block
+   string literals is properly indented.
+ * There is no restriction on forming [reserved tokens](#reserved-tokens);
+   instead, if no token can be formed, a placeholder token is formed from the
+   next character of the input line.
+ * There is no restriction on the appearance of [reserved
+   comments](#reserved-comments). Such comments have no effect.
+ * All tokens produced are discarded.
 
-In addition to the cases above, a block comment introducer may be followed by
-additional `*` characters. If the character after the comment introducer is not
-one of those mentioned above, it shall be a whitespace character.
+These rules apply recursively during the search for the closing comment marker;
+block comments nest.
+[[why?]](#block-comments-rationale)
+
+The opening `//{` shall be followed by whitespace. If any characters appear
+between the closing `//}` and the end of its line, that sequence of characters
+shall be the same as the characters between the opening `//{` and the end of
+its line.
+
+Example:
+
+```carbon
+//{ temp
+fn CommentedOutFunction() {
+  // It's OK to include a //} here; it's not a comment introducer so doesn't
+  // end the block comment.
+
+  //{
+    Nested comment.
+  //}
+
+  The single quote in this line doesn't match any of our token production
+  rules. A placeholder token is produced for that character. The token after
+  the placeholder is the identifier 't'.
+
+  // This doesn't end the comment.
+  var String: close_comment_marker = """
+  //}
+  """;
+}
+//}
+
+//{ mismatched
+
+// This is an error due to mismatched closing text.
+//} temp
+
+//{
+error here (not at start of line): //}
+```
+
+#### Reserved comments
+
+Comment introducers that do not have one of the above forms are invalid.
 [[why?]](#comment-introducers-rationale)
 
 ### Literals
@@ -245,6 +312,10 @@ A *character literal* is lexically identical to a simple string literal, except
 that it is enclosed in single quotation marks (`'`) instead of double quotation
 marks (`"`).
 
+**Open question:** Do we need both character literals and string literals?
+Unicode character literals in general require more than one code unit to
+represent, so are somewhat more string-like than character-like.
+
 ### Identifiers
 
 An *identifier* is a maximal sequence of characters beginning with a character
@@ -452,28 +523,12 @@ continuation.
 Line continuation for comments in particular is a known source of gotchas in C
 and C++ programs.
 
-## Nested comments rationale
-
-There is a known need for nested comment syntax: it is important to be able to
-comment out a block of code confident in the knowledge that all text between
-the comment markers (and exactly that text) was in fact commented out.
+### Documentation comments rationale
 
-The only facility for this in C and C++ is `#if 0` ... `#endif`, and that is
-what is used in practice. As we do not want Carbon to have a textual
-preprocessor, enabling nesting of some other comment scheme seems natural.
-
-Conversely, it is valuable to have a comment syntax for human-readable
-commentary that can be used within a line and can span multiple lines, and we
-see no reason to invent something other than `/* ... */` for this purpose. We
-cannot use the same notation as nested comment syntax without compromising its
-usage for human-readable commentary. For example:
-
-```carbon
-f(1.0 / x /* can't be negative */, 2);
-```
-
-... would not be valid as a `/*{ ... }*/` comment, because it would contain an
-unterminated character literal.
+Carbon code is expected to include documentation comments. Specifying such
+comments as part of the language definition allows a consistent interpretation
+of the comments, and a consistent attachment of the comments to entities, to be
+provided for Carbon programs.
 
 ### Comment introducers rationale
 
@@ -482,6 +537,23 @@ future. Reserving syntactic space in comment syntax, in a way that is easy for
 programs to avoid, allows us to add such additional kinds of comment as a
 non-breaking change.
 
+## Block comments rationale
+
+It is important to be able to comment out a block of code confident in the
+knowledge that all text between the comment markers (and exactly that text) was
+in fact commented out. This leads to the following requirements:
+
+ * Block comments must nest.
+ * Closing comment markers in string literals and in any kind of nested comment
+   do not close the outer comment.
+ * Opening comment markers in string literals and in line comments do not
+   introduce an additional unintended level of commenting.
+
+These requirements force us to apply our lexical rules to the contents of block
+comments. However, we would like to accept malformed and partially-formed code
+within a block comment, so we relax the restrictions that we can reasonably
+relax when handling them.
+
 ### Block strings rationale
 
 Block literals are a useful way of expressing multiline string content in a

From 29c5e4acf2e338c5523fad5297f2923499efaef8 Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Thu, 21 May 2020 18:05:16 -0700
Subject: [PATCH 07/11] Add open question from review feedback.

---
 docs/proposals/p0016.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index 4d8763d1c231..308542efb712 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -183,6 +183,9 @@ fn CommentedOutFunction() {
 error here (not at start of line): //}
 ```
 
+**Open question:** Can we remove block comments, and require potential users of
+them to add `//` to all affected lines instead?
+
 #### Reserved comments
 
 Comment introducers that do not have one of the above forms are invalid.

From b94320e82c44c36a0b6702f75ea0843e9f518ea2 Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Thu, 21 May 2020 18:07:16 -0700
Subject: [PATCH 08/11] Apply suggestions from code review

Co-authored-by: austern <austern@google.com>
Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
---
 docs/proposals/p0016.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index 308542efb712..b160f0ca5521 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -74,7 +74,7 @@ contains two lines, and "foo\n\rbar" contains three, of which the middle line
 is empty.
 
 The *indentation* of a line is the sequence of horizontal whitespace characters
-at the start of the line. A line *A* has more indentation than a line *B* the
+at the start of the line. A line *A* has more indentation than a line *B* if the
 indentation of *B* is a proper prefix of the indentation of *A*.
 
 ### Comments
@@ -278,10 +278,10 @@ For example:
 
 ```carbon
 fn f() {
-  """
+  var String: w = """
   This is a string literal. Its first character is 'T' and its last character
   is '.'. It contains one embedded newline, between 'character' and 'is'.
-  """
+  """;
   var String: x = r#"""
     This is the content of the string. The 'T' is the first character
     of the string.
@@ -296,6 +296,7 @@ fn f() {
     is a block string literal with file type 'This' and first character 'i'.
     """#;
 
+  // This is a block string literal. Its first two characters are spaces, and its last character is '.' It has a file type of 'c++'.
   var String: starts_with_whitespace = """c++
       int x = 1; // This line starts with two spaces.
       int y = 2; // This line starts with two spaces.

From cd7335ffac737e513b0f694ebc00fa7f41d44100 Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Thu, 11 Jun 2020 23:53:07 -0700
Subject: [PATCH 09/11] Updates from review feedback.

---
 docs/proposals/p0016.md | 507 +++++++++++++++++++++++++++-------------
 1 file changed, 344 insertions(+), 163 deletions(-)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index b160f0ca5521..8a4e6198cc28 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -22,26 +22,44 @@ Carbon source file: interpreting the contents of the file and forming
 ## Proposal
 
 Carbon source files are [UTF-8](#file-contents-and-encoding) text files whose
-contents are divided into [whitespace](#whitespace), [comments](#comments),
-[literals](#literals), [identifiers](#identifiers) (including
-[keywords](#keywords)), [designators](#designators), [operators](#operators),
-and [brackets](#brackets), as described below.
+contents are divided into *lexical elements*: [whitespace](#whitespace),
+[comments](#comments), [literals](#literals), [words](#words)
+(including [identifiers](#identifiers) and [keywords](#keywords)),
+[designators](#designators), [operators](#operators), and
+[brackets](#brackets), as described in the [lexing](#lexing) section below.
 
-## Details
+Lexical elements are formed by a single left-to-right scan of the source file,
+using a "max munch" rule: the longest possible next lexical element is formed
+at each step.
+
+After division into these components, whitespace and text and block comments
+are discarded, and words are classified as either identifiers or keywords; the
+remaining lexical elements are [tokens](#tokens), and the result is the
+*tokenized* form of the source file, which is the input to the parsing step.
+
+## Lexing
 
 ### File contents and encoding
 
 Carbon source files are Unicode text files encoded in UTF-8. An initial UTF-8
-BOM is permitted and ignored. All contents outside of [comments](#comments) and
-[literals](#literals) shall be in Normalization Form C.
+BOM is permitted and ignored.
 [[why?]](#encoding-rationale)
 
+For implementation simplicitly, all contents outside of [comments](#comments)
+and [literals](#literals) shall be in Normalization Form C. The Carbon formatter
+tool will convert source text to NFC as necessary to satisfy this constraint.
+[[why?]](#formatting-rationale)
+
 Carbon is currently based on Unicode 13.0, and will adopt new Unicode versions
 as they are published.
 
+**Open question:** Should we require source text to be in NFC, as C++ plans to
+do (and GCC currently enforces with a warning), or should we normalize non-NFC
+identifiers ourselves?
+
 ### Whitespace
 
-Characters are identified as whitespace if they have the Unicode
+Characters are identified as whitespace if and only if they have the Unicode
 `Pattern_White_Space` property. These include the ASCII whitespace characters (recognized in C++):
 
  * Space and horizontal tab
@@ -56,9 +74,9 @@ As of Unicode version 13, 5 additional characters are included:
  * U+2028 LINE SEPARATOR
  * U+2029 PARAGRAPH SEPARATOR
 
-Space, horizontal tab, and the LTR and RTL marks are *horizontal whitespace*
-characters. All other whitespace characters are *vertical whitespace*
-characters.
+Space, horizontal tab, and the LTR and RTL marks (see
+[directionality](#directionality)) are *horizontal whitespace* characters. All
+other whitespace characters are *vertical whitespace* characters.
 
 Characters with the Unicode property `White_Space` but not
 `Pattern_White_Space` are invalid outside comments and literals. Code
@@ -75,22 +93,41 @@ is empty.
 
 The *indentation* of a line is the sequence of horizontal whitespace characters
 at the start of the line. A line *A* has more indentation than a line *B* if the
-indentation of *B* is a proper prefix of the indentation of *A*.
+indentation of *B* is a proper prefix of the indentation of *A*. (Note that
+neither of "\t\t" and "\t " is considered more indented than the other.)
+
+**Open question:** Should we be more opinionated on whitespace? We could
+potentially disallow everything other than space and newline (and, depending
+on what we decide for [directionality](#directionality), perhaps LTR marks),
+which would lead to a substantially simpler indentation rule.
 
 ### Comments
 
 A *comment* begins with the characters `//` and runs to the end of the line.
-Carbon has no mechanism for physical line continuation, so a `//` comment
-always ends at the next vertical whitespace character. There shall be no text
-other than horizontal whitespace before the `//` characters introducing a
-comment. (Either all of a line is a comment, or none of it.)
+Carbon has no mechanism for physical line continuation, so a trailing `\` does
+not extend a comment to subsequent lines. There shall be no text other than
+horizontal whitespace before the `//` characters introducing a comment. (Either
+all of a line is a comment, or none of it.)
 [[why?]](#line-continuation-rationale)
 
+The *kind* of a comment is determined by the character(s) after the `//`
+characters as follows:
+
+ * whitespace: the comment is a [text comment](#text-comments)
+ * `/` or `!` followed by whitespace: the comment is a [documentation
+   comment](#documentation-comments)
+ * `\{` or `\}`: the comment is an opening or closing [block
+   comment](#block-comments) delimiter, respectively
+ * anything else: the input is invalid
+
+For the purpose of the above rule, the end of the file is considered to be
+whitespace. The `//` characters followed by the above additional characters
+form the *comment introducer*.
+
 #### Text comments
 
-If the character after the comment introducer is whitespace (or the end of the
-file), the comment is a *text comment*. Text comments are treated equivalently
-to whitespace.
+A *text comment* is a comment introduced by `//` followed by whitespace. Text
+comments do not result in tokens.
 
 Example:
 
@@ -103,11 +140,10 @@ var Int: x; // error, trailing comments not allowed
 
 #### Documentation comments
 
-If the character after the comment introducer is an exclamation mark or another
-`/`, in either case followed by whitespace, the comment is a documentation
-comment. Documentation comments are tokens, and are recognized by the language
-grammar only in specific locations, which determine the entity to which they
-attach.
+A *documentation comment* is a comment introduced by `//` followed by either an
+exclamation mark or another `/`, in either case followed by whitespace.
+Documentation comments are tokens, and are recognized by the language grammar
+only in specific locations, which determine the entity to which they attach.
 [[why?]](#documentation-comments-rationale)
 
 Example:
@@ -125,67 +161,61 @@ var Int
 **Open question:** Should we accept only one or the other kind of documentation
 comment?
 
+ - In favor of `///`: it is easier to type, and likely to be more comfortable to
+   read, especially for larger comment blocks.
+ - In favor of `//!`: it is less likely to be confused with `//`; writing `//`
+   where `///` is intended is a common error in C++ code using Doxygen.
+
 #### Block comments
 
-If the character after the comment introducer is an open brace, the comment is
-a block comment. Subsequent lines are tokenized until a comment introducer
-followed by a close brace is found. The tokenization rules are adjusted as
-follows:
-
- * There is no requirement that brackets match.
- * There is no requirement that code within brackets or text within block
-   string literals is properly indented.
- * There is no restriction on forming [reserved tokens](#reserved-tokens);
-   instead, if no token can be formed, a placeholder token is formed from the
-   next character of the input line.
- * There is no restriction on the appearance of [reserved
-   comments](#reserved-comments). Such comments have no effect.
- * All tokens produced are discarded.
-
-These rules apply recursively during the search for the closing comment marker;
-block comments nest.
+An *opening block comment line* is a line starting with `//\{`, with no
+indentation. A *closing block comment line* is a line starting with `//\}`,
+with no indentation. A comment starting with `//\{` or `//\}` shall form an
+opening or closing block comment line. Opening and closing block comment lines
+can only appear as part of block comments. (In particular, these lines cannot
+appear within block string literals.)
+
+A *block comment* is a comment that starts with an opening block comment line and
+ends with a closing block comment line. Block comments nest: every line for
+which the total number of preceding opening block comment lines is greater than
+the total number of preceding closign block comment lines is part of a block
+comment.
 [[why?]](#block-comments-rationale)
+[[alternatives]](#block-comment-alternatives)
 
-The opening `//{` shall be followed by whitespace. If any characters appear
-between the closing `//}` and the end of its line, that sequence of characters
-shall be the same as the characters between the opening `//{` and the end of
-its line.
+If any characters appear between the closing `//\}` and the end of its line,
+that sequence of characters shall be the same as the characters between the
+opening `//\{` and the end of its line.
 
 Example:
 
 ```carbon
-//{ temp
+//\{ temp
 fn CommentedOutFunction() {
-  // It's OK to include a //} here; it's not a comment introducer so doesn't
-  // end the block comment.
+  // It's OK to include a //\} in the middle of this comment; it's not a
+  // comment introducer so doesn't end the block comment.
 
-  //{
-    Nested comment.
-  //}
+  //\} is not a closing block comment line, so doesn't end the comment.
 
-  The single quote in this line doesn't match any of our token production
-  rules. A placeholder token is produced for that character. The token after
-  the placeholder is the identifier 't'.
+//\{
+    Nested comment.
+//\}
 
-  // This doesn't end the comment.
-  var String: close_comment_marker = """
-  //}
+  var String: closing_comment_marker = r"""
+  //\}
   """;
 }
-//}
+//\}
 
-//{ mismatched
+//\{ mismatched
 
 // This is an error due to mismatched closing text.
-//} temp
+//\} temp
 
-//{
-error here (not at start of line): //}
+// The next line is an error because the //\{ is not at the start of the line.
+  //\{
 ```
 
-**Open question:** Can we remove block comments, and require potential users of
-them to add `//` to all affected lines instead?
-
 #### Reserved comments
 
 Comment introducers that do not have one of the above forms are invalid.
@@ -200,8 +230,8 @@ predeclared identifiers.)
 A *literal* is a numeric literal, a character literal, or a string literal, as
 defined below.
 
-A literal shall not be immediately followed an [identifier continuation
-character](#identifiers). Carbon has no literal suffixes, but the corresponding
+A literal shall not be immediately followed a [word continuation
+character](#words). Carbon has no literal suffixes, but the corresponding
 lexical space is reserved for future extensions.
 
 #### Numbers
@@ -213,28 +243,31 @@ Integers in other bases are written as a `0` followed by a base specifier
 character, followed by a sequence of digits in the corresponding base. The
 available base specifiers and corresponding bases are:
 
-| Base specifier | Base | Digits                                |
-| -------------- | ---- | ------------------------------------- |
-| `b` or `B`     | 2    | `0` and `1`                           |
-| `o`            | 8    | `0` ... `7`                           |
-| `x` or `X`     | 16   | `0` ... `9`, `a` ... `f`, `A` ... `F` |
+| Base specifier | Base | Digits                   |
+| -------------- | ---- | ------------------------ |
+| `b`            | 2    | `0` and `1`              |
+| `o`            | 8    | `0` ... `7`              |
+| `x`            | 16   | `0` ... `9`, `A` ... `F` |
 
-[TODO: This doesn't belong here.] There are no size suffixes. Each literal has
-a unique type that can be converted to any sufficiently-large integer type, but
-operations on it are always exact.
+Note that the above table is case-sensitive. `0O123` is invalid, as is `0Xa`.
+[[why?]](#integers-rationale)
 
 Real numbers are written as a sequence of one or more decimal digits followed
 by a decimal point followed by a sequence of one or more decimal digits.
+[[why?]](#real-numbers-rationale)
 
 A real number can be followed by an `e`, an optional `+` or `-` (defaulting to
-`+`), and a decimal integer *N*; the effect is to multiply the given value by
+`+`), and a character sequence matching the grammar of a decimal integer with
+some value *N*; the effect is to multiply the given value by
 10<sup>&plusmn;*N*</sup>.
 
 A *numeric literal* is an integer or real number expressed as described above.
 
-#### Strings
+**Open question:** Should we allow digit separators? With what lexical syntax?
+Should we require them to be evenly spaced? Spaced "naturally" (groups of 3 for
+decimal, some power of 2 for octal and binary)?
 
-[[alternatives]](#string-alternatives)
+#### Strings
 
 A *simple string literal* is formed of a sequence of
 
@@ -245,7 +278,8 @@ A *simple string literal* is formed of a sequence of
 enclosed in double quotation marks (`"`). Each escape sequence is replaced with
 the corresponding character sequence or encoding.
 
-TODO: Table of escape sequences.
+TODO: Add a table of escape sequences. `\{` and `\}` are invalid, in order to
+avoid ambiguities between (non-raw) block string literals and block comments.
 
 A *raw string literal* starts with an `r` followed by *N* `#` characters
 followed by a double quotation mark, and ends with the first following
@@ -265,9 +299,10 @@ indentation of the closing line from each (non-empty) content line, and
 concatenating the results with a line feed character added between each pair of
 lines.
 [[why?]](#block-strings-rationale)
+[[alternatives]](#string-alternatives)
 
-A *file type indicator* is a sequence of characters that are either [identifier
-continuation characters](#identifiers) or [operator characters](#operators).
+A *file type indicator* is a sequence of characters that are either [word
+continuation characters](#words) or [operator characters](#operators).
 
 A *raw block string literal* is expressed analogously to a raw string literal,
 but for a block string literal. Escape sequences are ignored, but indentation
@@ -290,8 +325,10 @@ fn f() {
   var String: y = r"Hello\"; // OK, final character is \
   var String: z = r##"Raw strings r#"nesting"#"##;
 
-  // This string starts and ends with two "s.
+  // The contents of this string start and end with exactly two "s.
   var String: ambig1 = r#"""This is a raw string literal starting with """#;
+  // This string is a block raw string literal with file-type 'This',
+  // whose contents start with "is a ".
   var String: ambig2 = r#"""This
     is a block string literal with file type 'This' and first character 'i'.
     """#;
@@ -310,6 +347,15 @@ error: insufficiently indented.
 }
 ```
 
+A raw block string literal is required to have non-empty indentation to avoid
+ambiguity with block comments.
+[[why?]](#block-comments-rationale)
+
+**Open question:** Should we only require raw string literals containing an
+opening or closing block comment line to be indented? (An equivalent but
+perhaps simpler formulation of the alternative rule: opening and closing block
+comment lines are disallowed in block string.)
+
 #### Characters
 
 A *character literal* is lexically identical to a simple string literal, except
@@ -320,38 +366,22 @@ marks (`"`).
 Unicode character literals in general require more than one code unit to
 represent, so are somewhat more string-like than character-like.
 
-### Identifiers
+### Words
 
-An *identifier* is a maximal sequence of characters beginning with a character
-with Unicode property `XID_Start`, followed by zero or more *identifier
-continuation characters*, which are characters that either have property
-`XID_Continue` or are underscores (`_`).
+A *word* is a maximal sequence of characters beginning with a character
+with Unicode property `XID_Start`, followed by zero or more *word continuation
+characters*, which are characters that either have property `XID_Continue` or
+are underscores (`_`). A [raw identifier](#raw-identifiers), described below,
+is also lexically a word.
 
 Notably, `XID_Start` does not include the underscore character. Tokens
 beginning with an underscore are [reserved](#reserved-tokens).
 [[why?]](#underscores-rationale)
 
-Additionally, a *raw identifier* can be specified by prefixing an identifier
-with `r#`, such as `r#requires`. Raw identifiers can be used to introduce and
-use names that are lexically identical to keywords.
-[[why?]](#keywords-rationale)
-
-All identifier tokens in all contexts are looked up using the same lexical
-scoping rule.
-
-An identifier shall not be immediately followed by a `"` or `'`.
-
-#### Keywords
-
-A *keyword* is an identifier with predefined meaning. Carbon has a predefined
-set of keywords, that will be specified separately as part of the syntax rules.
-
-An identifier that is a keyword may also be declared explicitly in a source
-file. The same identifier shall not be used as both a keyword and as a non-raw
-non-keyword identifier in a single source file. As a consequence of these
-rules, from a lexical standpoint there is no notion of keywords -- whether a
-given identifier is a keyword depends on the syntactic structure of the source
-file.
+A word is interpreted as either a keyword or an identifier. If a word is ever
+declared within a source file, then it is interpreted as an identifier
+throughout that source file; otherwise, it is interpreted as a keyword
+throughout that source file.
 [[why?]](#keywords-rationale)
 
 Example:
@@ -362,11 +392,40 @@ fn f() {}        // error, 'fn' is not a keyword in this source file
 interface var {} // error, already used 'var' as a keyword in this source file
 ```
 
+A word shall not be immediately followed by a `"` or `'`.
+
+#### Identifiers
+
+Identifier tokens can appear in two different contexts: they either declare the
+identifier, binding it to an entity, or they reference an entity that has
+already been declared. Carbon's grammatical rules will make it straightforward
+to locally distinguish between these two cases.
+
+All identifier tokens in all contexts that are referencing a prior declaration
+are looked up using the same lexical scoping rule.
+
+#### Raw identifiers
+
+A *raw identifier* can be specified by prefixing a word with `r#`, such as
+`r#requires`. Raw identifiers can be used to introduce and use names that are
+lexically identical to keywords. The declaration of a raw identifier does not
+prevent the base word from being interpreted as a keyword; otherwise, they
+behave identically to the word formed by removing the `r#` prefix.
+[[why?]](#keywords-rationale)
+
+#### Keywords
+
+A *keyword* is a word with predefined meaning. Carbon has a predefined set of
+keywords, that will be specified separately as part of the syntax rules, and
+that is expected to grow over time.
+
+We intend to restrict keywords to the characters `a` ... `z` and `_`.
+
 ### Designators
 
-A *designator* is a token formed by prefixing an identifier with a period
-character, such as `.member`. The identifier after the period is the *member
-name*, and is looked up in a context-dependent manner.
+A *designator* is a token formed by prefixing a word with a period character,
+such as `.member`. The identifier after the period is the *member name*, and is
+looked up in a context-dependent manner.
 [[why?]](#designators-rationale)
 
 ### Operators
@@ -376,6 +435,8 @@ An *operator* is a maximal sequence of characters with Unicode property
 `Pe` (for which, see [brackets](#brackets)), which we will refer to as
 *operator characters*.
 [[why?]](#operators-rationale)
+[[alternatives]](#operators-alternatives)
+
 We do not intend to define any operators containing non-ASCII characters. The
 ASCII operator characters are:
 
@@ -389,35 +450,44 @@ operator characters, 400 digraphs, and so on.
 
 Bracket operators, described below, are also operators.
 
+**Open question:** Instead of the "max munch" rule described here, should we
+only lex operators that actually exist? For example, this would mean that `**p`
+is lexed as three tokens (`*`, `*`, `p`) rather than two (nonexistent `**`,
+`p`).
+
 ### Brackets
 
 A *simple open bracket* is a character with Unicode property `Pattern_Syntax`
-and character class `Ps`, such as `(` or `[`.
+and character class `Ps`. We intend to restrict Carbon syntax to ASCII, leaving three such characters: `(`, `[`, and `{`.  
 A *simple close bracket* is a character
-with Unicode property `Pattern_Syntax` and character class `Pe`, such as `}`.
-A *bracket terminator character* is one of `|` or `:`.
+with Unicode property `Pattern_Syntax` and character class `Pe`. There are
+three such characters in ASCII: `)`, `]`, and `}`.  
+A *bracket terminator character* is one of `|` or `:`.  
 A *bracket continuation character* is an operator character that is not a
-bracket terminator character.
+bracket terminator character. Restricted to ASCII, that is one of:
+```
+!  #  %  &  *  +  -  .  /  ;  <  =  >  ?  @  \  ^  ~
+```
 
 A *compound open bracket* is a simple open bracket followed by zero or more
 bracket continuation characters followed by a bracket terminator character,
-such as `[:`.
+such as `[:`.  
 A *compound close bracket* is a bracket terminator character followed by zero
 or more bracket continuation characters followed by a simple close bracket,
-such as `|=)`.
+such as `|=)`.  
 [[why?]](#compound-brackets-rationale)
 
 An *open bracket* is either a simple open bracket or a compound open bracket.
 A *close bracket* is either a simple close bracket or a compound close bracket.
 
 The close bracket matching an open bracket is formed by reversing the character
-sequence in the open bracket and replacing each caracter with class `Ps` with
+sequence in the open bracket and replacing each character with class `Ps` with
 the corresponding character with class `Pe`. Every open bracket is required to
 have a matching close bracket such that the bracketed regions form a tree
 structure.
 
-We do not intend to include any non-ASCII characters as part of Carbon's
-syntax. This leaves 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket
+Because we do not intend to include any non-ASCII characters as part of Carbon's
+syntax, there are 3 single-character brackets (`()`, `{}`, `[]`), 6 bracket
 digraphs (`(| |)`, `{| |}`, `[| |]`, `(: :)`, `{: :}`, `[: :]`), 108 bracket
 trigraphs, and so on.
 
@@ -440,18 +510,10 @@ bracket followed by one or more operator characters followed by a matching
 simple close bracket, such as `[~>]` or `(*)`. Bracket operators are operators,
 not brackets.
 
-### Tokens
-
-A *token* is a documentation comment, a literal, an identifier (which might be
-a keyword), a designator, an operator, or a bracket. Tokens are formed by a
-single left-to-right scan of the source file, using a "max munch" rule: the
-longest possible next token is formed at each step, after skipping whitespace
-and comments.
+### Reserved lexical elements
 
-#### Reserved tokens
-
-It is an error if token formation (after skipping whitespace and comments) is
-attempted in the following circumstances:
+It is an error if an attempt is made to form a lexical element in the following
+circumstances:
 
  * When the first character does not have property `XID_Continue` or
    `Pattern_Syntax`.
@@ -460,21 +522,26 @@ attempted in the following circumstances:
  * When the first two characters are `r#` and neither a raw identifier nor a
    raw string literal would be formed.
 
+### Tokens
+
+A *token* is a documentation comment, a literal, a keyword, an identifier, a
+designator, an operator, or a bracket.
+
 ## Rationale
 
 ### Encoding rationale
 
-We intend to follow the Unicode Consortium's recommentations for identifiers in
-programming languages as described in Unicode 13.0.0
-[UAX#31](https://unicode.org/reports/tr31/) Revision 33. We do not see a reason
-to be inventive in this regard, and delegating the complex considerations over
-how Unicode characters should be used to a group with greater expertise in that
-area seems appropriate.
+We intend for words in Carbon to follow the Unicode Consortium's
+recommentations for identifiers in programming languages as described in
+Unicode 13.0.0 [UAX#31](https://unicode.org/reports/tr31/) Revision 33. We do
+not see a reason to be inventive in this regard, and delegating the complex
+considerations over how Unicode characters should be used to a group with
+greater expertise in that area seems appropriate.
 
 As an exception, Carbon permits underscore as a continuation character in
-identifiers. Usage of this character is sufficiently common in C++ identifiers
-that excluding it conflicts with our interoperability goal. However, leading
-underscores are not permitted in Carbon identifiers.
+words. Usage of this character is sufficiently common in C++ identifiers that
+excluding it conflicts with our interoperability goal. However, leading
+underscores are not permitted in Carbon words.
 [[why?]](#underscores-rationale)
 
 We observe UAX#31's requirements as follows:
@@ -503,6 +570,24 @@ We observe UAX#31's requirements as follows:
  * UAX31-R8: requirement not applicable. Carbon does not have hashtag
    identifiers.
 
+### Formatting rationale
+
+We expect Carbon to ship with a code auto-formatter that is used routinely as
+part of all Carbon code development. There is a very low burden on the
+programmer from requiring that source code be in compliance with formatting
+decisions made by the formatter: at worst, we'd expect them to see a diagnostic
+instructing them to run `carbon-format`, but in most cases this should happen
+before the code gets to the compiler (perhaps as an on-save hook in their
+editor, and/or bound to a keyboard shortcut used while editing code).
+
+We can realize useful benefits by relying on code being properly-formatted, if
+"formatting" is interpreted suitably generally. For example, we can ensure that
+the code's appearance matches its meaning in many cases (avoiding both
+deliberate and accidental problems) by ensuring that Unicode left-to-right marks
+are used where necessary, that identifiers are properly normalized, and so on,
+and we can simplify our implementation somewhat by only permitting input in a
+single Unicode normalization form.
+
 ### Line continuation rationale
 
 Line continuation in C++ is sometimes necessary in order to combine the needs
@@ -541,22 +626,93 @@ future. Reserving syntactic space in comment syntax, in a way that is easy for
 programs to avoid, allows us to add such additional kinds of comment as a
 non-breaking change.
 
-## Block comments rationale
+### Block comments rationale
 
-It is important to be able to comment out a block of code confident in the
+It is important to be able to comment out a block of Carbon code confident in the
 knowledge that all text between the comment markers (and exactly that text) was
 in fact commented out. This leads to the following requirements:
 
  * Block comments must nest.
- * Closing comment markers in string literals and in any kind of nested comment
-   do not close the outer comment.
+ * Closing comment markers in string literals and in any kind of nested
+   comment do not close the outer comment.
  * Opening comment markers in string literals and in line comments do not
    introduce an additional unintended level of commenting.
 
-These requirements force us to apply our lexical rules to the contents of block
-comments. However, we would like to accept malformed and partially-formed code
-within a block comment, so we relax the restrictions that we can reasonably
-relax when handling them.
+In addition, block comment syntax should not require lexing the contents of the
+comment. Therefore we need to disallow the block comment closing syntax from
+appearing in other tokens (in particular, in block string literals). There are
+at least two reasonable ways to do this:
+
+ * Require block comment opening and closing lines to be unindented and require
+   string literals to be indented.
+ * Pick a syntax for block comment opening and closing lines that cannot appear
+   in string literals.
+
+Our chosen approach combines these alternatives: the `\{` and `\}` in a block
+comment are both invalid in non-raw string literals (they can be expressed as
+`\\{` or `\\}` if desired). However, we cannot disallow these comment markers
+in raw string literals without harming their ability to represent arbitrary
+text. Therefore we require all raw string literals to be indented at least one
+space.
+
+### Integers rationale
+
+Carbon requires a base specifier for octal numbers. A common error in C
+and C++ code is using a leading 0 to align numbers horizontally:
+
+```
+int vals[] = {
+  1234,
+  0567,
+  8912
+};
+```
+
+But this leads to the `0567` being interpreted as an octal number. We
+could treat numbers with a leading `0` as decimal always, but this also
+risks confusing programmers who are familiar with the C and C++ rules.
+Therefore we require a base specifier for octal, reject any number that
+starts with a leading `0` but no base specifier (other than `0` itself)
+to keep the rule simple.
+
+The base specifier (if present) is required to be written in lowercase.
+For binary, this avoids a visual confusion between `08` and `0B`. For
+octal, it avoids a visual confusion between `0O` and `00`. And in
+general, being opinionated here has very little cost and removes a
+possible style argument. We require the hexadecimal digits in an
+`0x`-prefixed number to be written in uppercase to keep them visually
+distinct from the prefix, in the case where `A` ... `F` would follow
+the prefix (it is easier to visually separate the digits from the base
+specifier in `0xAB23` than in `0xab23`).
+
+### Real numbers rationale
+
+Real numbers in Carbon always require a decimal point, and require at
+least one digit on each side of the decimal point.
+
+In C and C++, the decimal point may be omitted in a number with an
+exponent, such as `1e6`, but a common source of errors is imagining
+that this syntax produces an integer literal rather than a
+floating-point literal.
+
+Requiring a digit on both sides of the decimal point improves
+readability and avoids style arguments. In addition, disallowing a
+literal from beginning with a period followed by a digit frees up `.0`
+for future use as a designator for tuple indexing, and similarly
+`4.ToString()` unambiguously lexes as an integer followed by a
+designator followed by two parentheses.
+
+This rationale assumes that we will permit the initialization of a
+floating-point variable with an integer literal. If we choose to
+disallow that, concerns have been raised that permitting `1.` instead
+of `1.0` may be desirabe for ergonomic reasons.
+
+See also the section on [floating-point
+literals](https://google.github.io/styleguide/cppguide.html#Floating_Literals)
+in the Google style guide, which argues for the same rule.
+
+As with base specifiers for integers, the `e` introducing an exponent
+is required to be lowercase to improve readability.
 
 ### Block strings rationale
 
@@ -606,11 +762,11 @@ keywords later if needed. There are some reasons not to follow that approach:
    operator character. The evolutionary path for such a change would be
    challenging.
 
-Underscores in identifiers are common in C++ identifiers, which motivates
-permitting them in Carbon to support our C++ interoperability goal. However,
-leading underscores are rare in publicly-visible C++ identifiers, and result in
-reserved identifiers in many contexts, so we do not have similar motivation to
-permit those.
+Underscores are common in C++ identifiers, which motivates permitting them in
+Carbon to support our C++ interoperability goal. However, leading underscores
+are rare in publicly-visible C++ identifiers, and result in reserved
+identifiers in many contexts, so we do not have similar motivation to permit
+those.
 
 Leading underscores are used in some C++ code to distinguish member names from
 non-member names. In Carbon, we anticipate all identifiers being declared
@@ -630,18 +786,20 @@ automatable migration cost on the code that intends to use the new feature.
 The proposed approach to keywords intends to support such a migration story.
 Adding new keywords to Carbon is a non-breaking change. Because every
 identifier is locally declared using obvious syntax before it is used, it is
-straightforward to detect, using simple rules, whether a particular identifier
-is a keyword or not in a particular source file.
+straightforward to detect, using simple rules, whether a particular word is a
+keyword or not in a particular source file.
 
 Using a new keyword in an existing source file requires first replacing all
-existing uses of that identifier with raw identifiers throughout the source
-file, which is a mechanical, automatable change.
+existing uses of that word with raw identifiers throughout the source file,
+which is a mechanical, automatable change.
 
 For identifiers whose scopes are constrained to a single source file, raw
 identifiers are not necessary to permit such a transition. However, for
-identifiers that are declared in one source file and consumed in another, we
+identifiers that are declared in one source file and redeclared in another, we
 still need a mechanism to continue declaring a name as an identifier after it
-has been claimed as a keyword.
+has been claimed as a keyword. (Use of an identifier from a different source
+file, or at the very least from a different package, is expected to typically
+require use of a designator rather than a word.)
 
 Note that while this means that adding a new keyword is cheap in terms of
 migration cost, we should still think of adding a keyword as being a
@@ -649,10 +807,18 @@ significant undertaking, as each keyword will occupy space in the mind of the
 Carbon programmer. However, we should not feel any pressure to reuse the same
 keyword for distinct purposes.
 
-This approach brings one important restriction: in any syntax that introduces
-an identifier, there should never be an optional keyword preceding the
-identifier, and nor should the identifier be optional if it can be followed by
-a keyword.
+This approach brings one important restriction: in any syntax that declares
+an identifier, it should always be straightforward to determine the identifier
+that is being introduced, even if it lexically identical to a keyword. In
+particular, there should never be an optional keyword preceding the identifier,
+and nor should the identifier be optional if it can be followed by a keyword.
+
+It should be noted that this approach also introduces a novel risk of
+underhanded code that appears to mean one thing but means a different thing, by
+shadowing a keyword with an identifier. This risk is discussed in [Initial
+Analysis of Underhanded Source Code (Wheeler
+2020)](https://www.ida.org/-/media/feature/publications/i/in/initial-analysis-of-underhanded-source-code/d-13166.ashx)
+(page 4-2).
 
 ### Designators rationale
 
@@ -747,6 +913,21 @@ be at least as indented as the first line in that construct.
 
 ## Alternatives considered
 
+### Block comment alternatives
+
+We considered various different options for block comments. Our primary goal
+was to permit commenting out a large body of Carbon code, which may or may not
+be well-formed. Alternatives considered included:
+
+ * Fully line-oriented block comments, which would remove lines without regard
+   for whether they are nested within a string literal, with the novel feature
+   of allowing some of the contents of a block string literal to be commented
+   out.
+ * Fully lexed block comments, in which a token sequence between the opening
+   and closing comment marker is produced and discarded, with the lexing rules
+   relaxed somewhat to avoid rejecting ill-formed code. This would be analogous
+   to C and C++'s `#if 0` ... `#endif`.
+
 ### String alternatives
 
 Block string literals could use explicit characters in the body to indicate the

From 09117de4764c07f27ec7d54ed20350aa2ff79736 Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Fri, 12 Jun 2020 00:09:48 -0700
Subject: [PATCH 10/11] Add details on directionality based on discussion in
 the context of issue#19.

---
 docs/proposals/p0016.md | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index 8a4e6198cc28..e71fd9df6a33 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -527,6 +527,21 @@ circumstances:
 A *token* is a documentation comment, a literal, a keyword, an identifier, a
 designator, an operator, or a bracket.
 
+### Directionality
+
+After tokens are formed, a final check is performed to ensure that the
+appearance of the code matches its meaning, as follows. The Unicode
+Bidirectional Algorithm, as described in Unicode 13.0.0
+[UAX#9](https://unicode.org/reports/tr9/), is applied to the source text. It is
+an error if any part of a token would be displayed after any part of a later
+token, or if any operator or bracket or the delimiters of a string literal or
+comment does not have a resolved directionality of L (left-to-right).
+
+Such issues can be resolved by the insertion of explicit left-to-right marks.
+The Carbon formatter tool will insert such marks as necessary to satisfy these
+constraints.
+[[why?]](#formatting-rationale)
+
 ## Rationale
 
 ### Encoding rationale

From fffd2fb415e6a65860691e0b9fdf7a953904107f Mon Sep 17 00:00:00 2001
From: Richard Smith <richard@metafoo.co.uk>
Date: Fri, 12 Jun 2020 12:44:21 -0700
Subject: [PATCH 11/11] Expand directionality discussion, relocate to be
 adjacent to encoding discussion, and convert the suggestion that we enforce
 directionality to an open question.

---
 docs/proposals/p0016.md | 74 ++++++++++++++++++++++++++++++++---------
 1 file changed, 59 insertions(+), 15 deletions(-)

diff --git a/docs/proposals/p0016.md b/docs/proposals/p0016.md
index e71fd9df6a33..c4ec504af385 100644
--- a/docs/proposals/p0016.md
+++ b/docs/proposals/p0016.md
@@ -101,6 +101,28 @@ potentially disallow everything other than space and newline (and, depending
 on what we decide for [directionality](#directionality), perhaps LTR marks),
 which would lead to a substantially simpler indentation rule.
 
+#### Directionality
+
+Explicit left-to-right marks are permitted in order to allow the user to ensure
+that the visual appearance of the code matches the actual parse order of the
+tokens.
+[[why?]](#directionality-rationale)
+
+The Carbon formatter tool will insert such marks as necessary in order
+to guarantee this property. For example, left-to-right marks may be inserted
+around identifiers containing right-to-left text to avoid adjacent operators
+being reversed, and left-to-right marks may be inserted around string literals
+in order to ensure that the delimiters are displayed at the beginning and end
+of the literal.
+[[why?]](#formatting-rationale)
+
+**Open question:** Should we require that in a well-formed Carbon program, the
+appearance of the source code (as determined by the Unicode Bidirectional
+Algorithm) matches the token order as interpreted by the Carbon implementation?
+This would introduce implementation and compilation-time cost, but would allow
+us to provide stronger guarantees that code does what a reader believes it to
+do.
+
 ### Comments
 
 A *comment* begins with the characters `//` and runs to the end of the line.
@@ -527,21 +549,6 @@ circumstances:
 A *token* is a documentation comment, a literal, a keyword, an identifier, a
 designator, an operator, or a bracket.
 
-### Directionality
-
-After tokens are formed, a final check is performed to ensure that the
-appearance of the code matches its meaning, as follows. The Unicode
-Bidirectional Algorithm, as described in Unicode 13.0.0
-[UAX#9](https://unicode.org/reports/tr9/), is applied to the source text. It is
-an error if any part of a token would be displayed after any part of a later
-token, or if any operator or bracket or the delimiters of a string literal or
-comment does not have a resolved directionality of L (left-to-right).
-
-Such issues can be resolved by the insertion of explicit left-to-right marks.
-The Carbon formatter tool will insert such marks as necessary to satisfy these
-constraints.
-[[why?]](#formatting-rationale)
-
 ## Rationale
 
 ### Encoding rationale
@@ -603,6 +610,43 @@ are used where necessary, that identifiers are properly normalized, and so on,
 and we can simplify our implementation somewhat by only permitting input in a
 single Unicode normalization form.
 
+### Directionality rationale
+
+Source code containing right-to-left string literals and right-to-left
+identifiers will often display in a way that differs from its interpretation as
+code. For example:
+
+```
+// The left operand of the + is "مرحبا", the right operand is "بالعالم".
+var String: x = "مرحبا" + "بالعالم";
+```
+
+Here, when displaying the code, the Unicode Bidirectional Algorithm identifies
+all the text from the first `"` to the last `"` as being a single right-to-left
+context, and so reverses that entire substring (including the `" + "` in the
+middle).
+
+Inserting a left-to-right mark after each string literal containing
+right-to-left text fixes the problem in this case:
+
+```
+// Same example as before, but now a left-to-right mark has been inserted after
+// each string literal.
+var String: x = "مرحبا"‎ + "بالعالم"‎;
+```
+
+Similar things can happen with right-to-left identifiers. For example, in
+
+```
+var مرحبا: بالعالم;
+```
+
+the declared identifier and type are likely to be displayed in the opposite
+order from how they would be interpreted by a Carbon implementation.
+
+If we allow explicit left-to-right marks in the source code and treat them as
+whitespace, such issues can be fixed by the Carbon formatting tool.
+
 ### Line continuation rationale
 
 Line continuation in C++ is sometimes necessary in order to combine the needs