Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carbon: Lexical conventions #17

Closed
wants to merge 11 commits into from
Closed

Carbon: Lexical conventions #17

wants to merge 11 commits into from

Conversation

zygoloid
Copy link
Contributor

@zygoloid zygoloid commented May 19, 2020

Possible set of lexical conventions for Carbon. Early draft circulated for initial feedback.

The primary principles leading to this approach are to make language evolution (adding keywords, operators, brackets, new kinds of comments) as easy as possible, and to make lexing and parsing as straightforward and efficient as we reasonably can.

RFC: https://forums.carbon-lang.dev/t/rfc-lexical-conventions/67

Fixes #16.

docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
This is not a comment.
```

If the character after the comment introducer is an exclamation mark, the
Copy link
Contributor

@gribozavr gribozavr May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other languages, the choice of /// for documentation comments is vastly more popular than //!.

Regarding /*! vs. /** I'm not sure there is a specific majority, but /** is just ever so slightly easier to type than /*!.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/** is frequently used to start a /*********-style banner comment. As such, I don't think we should use it for documentation -- we could treat / followed by exactly 2 *s as a special case, but that feels ugly to me. Obviously there's no concern here if we choose to not have block comments.

I prefer //! over ///, because /// is easy to mistake for // and vice versa. I've seen a lot of code use one when they meant the other, and the mistake went unnoticed by the reviewer. I think this is especially important as the difference would affect program validity, and could even theoretically change how we parse things (though I'd want our grammar to avoid that).

That said, I do find /// more aesthetically appealing for long documentation comments than //!. We could also pick something new -- we don't need to follow Doxygen convention here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could treat / followed by exactly 2 *s as a special case, but that feels ugly to me.

Specifically, /** followed by a newline would be my suggestion for a doc comment.

I've seen a lot of code use one when they meant the other, and the mistake went unnoticed by the reviewer.

Me too, but I don't think a different comment marker would help. I think this type of issue is best addressed by a linter (warn on a regular comment that appears in a doc comment position, and if the user really meant a plain comment, suggest to add an extra newline after it, or a "nondoc" marker within the comment)

Copy link
Contributor

@jonmeow jonmeow May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer /// over //! for ergonomics. Odd comment I know, but it's an odd series of characters in a row doing //<shift> (with the right pinky) for both QWERTY and DVORAK, and people would type it repeatedly. /// is better on this axis because you're just tapping the same char repeatedly.

Although, as a different choice, could we dictate the opposite? i.e., any regular line comment (//) in a place where a doc comment allowed is a doc comment. Require something like //! or // end-doc to end the doc comments, if people want to write comments there that should be treated as whitespace (which I'd expect to be relatively rare).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've listed /// versus //! as an open question. I've not yet formed an opinion on @jonmeow's alternative choice. It would be challenging from a parsing perspective to have // mean both doc and non-doc comments (Clang can do that, but it's a pain), so if we go that way, I'd prefer we find another introducer for non-doc comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe //# if you want a clear difference? # is a traditional comment char, and feels slightly easier to type... (the current # thread made me think of this)

docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved

A real number can be followed by an `e`, an optional `+` or `-` (defaulting to
`+`), and a decimal integer *N*; the effect is to multiply the given value by
10<sup>*N*</sup>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd also need hex floats at some point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd welcome that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we like the C and C++ notation for hex floats, or would we prefer something else?

Copy link
Contributor

@tkoeppe tkoeppe May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's reasonably wide-spread through other languages. Are there languages that have hexfloats but use a different syntax? So I wouldn't get too creative here. (Consider also formated I/O and interop. It'd maybe be a bit odd to output one format but have source code in a different one.)

Copy link
Contributor

@gribozavr gribozavr May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have an opinion about hexfloats specifics, but them being a niche and expert feature (meaning that they will be used rarely and by people who know what they are doing and likely are familiar with the concept in other languages), and the likelihood of hexfloats being copied from code in other languages into Carbon code, I think we should have really good reasons to deviate from the de-facto standards here.

more additional decimal digits.

Integers in other bases are written as a `0` followed by a base specifier
character, followed by a sequence of digits in the corresponding base. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on digit separators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts:

  1. If we allow them, what symbol do we use? , seems problematic (perhaps lexable, but undesirable). I personally prefer C++'s ' over the _ used by other languages.
  2. Where do we permit them to appear? I would be inclined to require them to appear at "natural" positions within the number -- evenly spaced, groups of 3 for decimal and groups of 4 or 8 for binary and hexadecimal -- but that's a very Anglocentric perspective. (I think that could actually be OK, though: our keywords, use of . rather than , as a decimal point, use of " as quotation marks, and so on are also very Anglocentric.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: placement, I've personally used digit separators in binary numbers in Swift to mark bitfield boundaries: https://github.com/apple/swift/blob/39397860a57bf64c45d49be08ba401b94d07be5e/stdlib/public/core/UTF16.swift#L339

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, and doubled digit separators at that. (That would not be valid in C++, where digit separators are required to actually separate digits.) In most cases, I'd think it's a language design issue if literals are being written for bit-fields, but UTF-8 encoding (and things of its ilk) are perhaps a different case.

On balance, I think the benefit of requiring the digit separators to be properly placed (ie, rejecting mistakes like 0xffff'00000'0000) is worth disallowing the more nuanced cases such as your example. But I don't feel strongly about it.

Copy link
Contributor

@gribozavr gribozavr May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think requiring digit separators to be regularly placed in decimal and hexadecimal numbers is the right choice. But binary numbers are different, because serialization/deserialization code that has to deal with bitfields is common.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ situation has always struck me as odd. We have prefix-based notations built in to the language, and we have suffix-based notations for libraries. So the language gives me 19, 0xFA, 026, and 0b1010110, but if I want to define a trinary literal it's 20021_t. That inconsistency has always bothered me. The document mentions above that we could potentially allow something like user-defined literals, which would bring back this inconsistency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial feeling is that I'm weakly against enforcing a particular separation style in integers. It's fairly easy to change the rule in either direction, though, since it's easy to write a migration tool.


#### Characters

A *character literal* is formed of any single character other than a backslash
Copy link
Contributor

@gribozavr gribozavr May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a more precise definition of what a "character" is -- a Unicode scalar, an extended grapheme cluster etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, but I'm not sure this is the right place to consider that. I imagine we'll have a later proposal on character and string types. Perhaps the best thing to do in the context of this document is to model character literals the same as simple string literals, and let the proposal that deals with character and string types worry about what happens if the value is unrepresentable as a "single character" for whatever data model it's using.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible to say what a character literal is without saying what a single character is. We can't defer to the Unicode standard on that question, because, if I understand correctly, "character" just isn't a concept in Unicode. There are code points and extended grapheme clusters and such. Thinking of code points as characters doesn't quite work: I'd probably want to think of both 'õ' and 'x̤' as single characters, but one of them is a single code point and the other is two code points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode has a notion of "character" -- those are the things to which code points are assigned. (See https://www.unicode.org/versions/Unicode13.0.0/ch01.pdf, which uses the word "character" extensively.) That's generally what I mean whenever I say "character" in this document (except perhaps -- ironically -- for the term "character literal", which is certainly underspecified).

I think that this is the wrong document to be specifying how character and string types work and are represented, and the semantics of literals of those types. I only want to cover the syntax here. From that perspective, I think the answer is relatively easy: if we allow character literals at all (which I now list as an open question), then they have the same morphology as simple string literals, other than having different delimiters. It's then up to the semantic interpretation of them to determine whether a character literal is valid.

I could imagine we might want multiple different kinds of character type, representing ASCII characters, Unicode characters, Unicode grapheme clusters, and a number of other things, and each of them might want to interpret and validate the contents of a character literal in a different way. So I think it would not be appropriate to specify anything here other than a lexical convention. Maybe we will choose to not have character literals at all. That'd be nice; we could free up ' for other purposes, as @gribozavr pointed out elsewhere. But for now I'd like to reserve it for character literals -- to give them "first dibs" as it were.

Copy link
Contributor

@gribozavr gribozavr May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode has a notion of "character" -- those are the things to which code points are assigned.

I just scanned it, and I think chapter 1 uses the term "character" informally -- it is an introductory chapter after all. I could not find a definition of it. You can find a definition of, for example, an "encoded character" -- which is a term of art, as indicated by the italics in the text.

I think it is fair to punt the specifics to future proposals, as long as we're explicit about not making any particular commitment in this proposal.


A *character literal* is formed of any single character other than a backslash
(`\\`) or single quotation mark, enclosed in a pair of single quotation marks
(`'`), or an escape sequence enclosed in a pair of single quotation marks.
Copy link
Contributor

@gribozavr gribozavr May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate character literal syntax? In Swift, for example, string literals that use the double quotation mark are polymorphic: when a string literal contains only one Unicode character, the string literal is also a character literal.

https://github.com/apple/swift/blob/0968d16a99f812d46782eee84b6c7899ca88e185/test/stdlib/Character.swift#L71-L120

(This is a test for Character which is an extended grapheme cluster, UnicodeScalar is a single Unicode scalar and it works similarly.)

Not using a single quote for character literals frees up one special character so that we could use it for something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I'm not a huge fan of this kind of punning between different types: a character and a string are ontologically different things, and I think the programmer should be expressing which one they want. (Python's full-scale unifying of characters and strings is not a good thing. The Swift approach seems a lot better, but I'm still concerned.) I'm not completely opposed, mind you -- freeing up ' for other uses might be nice -- but I'm certainly not sold on this idea.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it that different from int's going to their own unique type that than convert to size ints?

Copy link
Contributor

@mconst mconst May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think the Swift approach is pretty reasonable. An ASCII string is an array of characters, but a UTF-8 string is not -- UTF-8 characters are variable-length, so to handle them efficiently, you need to treat them like substrings rather than like elements of an array. (And that's doubly true if you're talking about extended grapheme clusters rather than individual code points.) Efficient code that iterates through the characters of a Unicode string looks very similar to code that iterates through the words of an ASCII string.

As a result, Unicode characters feel like a special case of strings to me, rather than a fundamentally different data type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this as an open question. I think this depends on the design of things we will consider later (string and character types), but for now I'd like to at least reserve ' for character literals, so we don't use them for anything else until / unless we decide we don't want a separate character literal syntax.

docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
Comment on lines 477 to 478
an identifier, there should never be an optional keyword preceding the
identifier, and nor should the identifier be optional if it can be followed by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should never be an optional keyword preceding the identifier,

Why not? As long as that optional keyword is included in the language version beyond the "compatibility horizon", it should be feasible to recognize it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to use the same consistent set of rules for all keywords throughout the entire language, rather than giving different behavior to recently introduced keywords compared to older keywords. I think we should aim for our intended evolutionary path to not introduce scar tissue -- places where you can see that something used to be different and changed. And I think that means that all keywords should behave as if they're new keywords.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose another perspective on this is: while we could have such an optional keyword now, we could never add another one as a point change. We would need to first add it, then wait for the compatibility horizon to expire (which could potentially be years), and then start using it. That would put a lot of pressure on us to reuse an existing keyword, which we explicitly do not want. If we disallow such changes forever, then the pressure to reuse keywords is gone, because reusing a keyword doesn't help solve the problem.

docs/proposals/p0016.md Show resolved Hide resolved
Comment on lines +63 to +82
Characters with the Unicode property `White_Space` but not
`Pattern_White_Space` are invalid outside comments and literals. Code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those of us who aren't as familiar with these Unicode properties, which of the characters above are we talking about here (as of Unicode version 13)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're these ones:

00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

For the U+2000..U+200A, see https://www.compart.com/en/unicode/block/U+2000

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I, too, had a hard time understanding the significance of this rule. Is it possible to explain it in a way that makes the underlying intent clearer?

A *comment* in Carbon is either:

* A *line comment*, beginning with `//` and running to the end of the line, or
* A *block comment*, beginning with `/*` and running to the matching `*/`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we might get away with just line comments? Block comments open up questions about nesting, and can allow some underhanded code -- where someone tries to fool you that code is safe by tricking you into thinking something is commented out, but there is actually something after the comment start on the same line that is live code. This attack is specifically discussed in https://www.ida.org/-/media/feature/publications/i/in/initial-analysis-of-underhanded-source-code/d-13166.ashx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this thread unresolved for a bit in the hope of attracting more opinions, but I'm tentatively in favor of adopting this suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @gribozavr:

Regardless of specific syntax, I'm not sure that we need multiline comments that are not code comments.

Copy link
Contributor

@jonmeow jonmeow May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Josh, the comment-related attack I see in the paper is using // in Python (which uses # for comments, and // is an operator). I don't see block comments or /* mentioned for attacks. Am I missing the attack you're referring to?

I think block comments have a lot of utility in commenting out large sections of code, e.g. while debugging. We could make them more readable by only allowing block comments as the only thing on a line. That kind of approach would probably still beg an alternative for:

Bang(foo, /*bar=*/baz);

(also, good syntax highlighting, based on easy parsing, should help mitigate attacks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've removed /* ... */ comments. Will bring them back if we establish consensus that we want them.

I'm optimistic that we can address the need for Bang(foo, /*bar=*/baz) comments a different way (eg, with designators for parameters: Bang(foo, .bar=baz)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonmeow Comment-related attacks is a whole category, described as:
"Many samples worked by confusing humans about comments (e.g.,
misleading humans about where the comments started or having active code
embedded in a comment)."

He elsewhere talks about "active code is hiding within a comment" or "non-comments hidden in comments" which is more descriptive of what you can do with /* ... */ than that specific Python exploit.

It is much easier to confuse humans about where a comment ends with /* ... */ than something that terminates with a newline (in fact one mitigation in the doc is to reformat so comments are on their own lines). As an example, what is /*/*/*/*/*/ equivalent to? It looks a bit like a fancy separator comment, but it gets parsed as *.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it, page B-10: "Misformatted comment (early termination due to an embedded */)...."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example /*/*/*/*/*/ would be addressed by requiring block comments as the only thing on the line - while it's an issue in C++, I don't see why it should be a barrier for Carbon.

@zygoloid If you're removing block comments, can you please explicitly address it in an "Alternative considered" or otherwise clarify the disposition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping -- this proposal is now in RFC, I would've expected this to be addressed. I don't see /* mentioned at all in the proposal, even though it's an obvious alternative.

left-to-right scan of the source file, using a "max munch" rule: the longest
possible next token is formed at each step.

## Rationale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving all the rationales to a separate section was not a win for my reading of this document. I would have preferred them as subsections next to the thing they are the rationale for. In the one case of the "encoding rationale", I probably would have been happier to read it later instead of when I found the [why?] link to it. However, I didn't know that when I saw the [why?] link, so I ended up jumping back and forth anyway.

Perhaps the ideal would have been a collapsible section, but I don't have any idea if those are available in the markdown variants we are using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm attempting to follow https://github.com/carbon-language/carbon-lang/blob/master/docs/project/evolution.md#make-a-proposal -- the BLUF / Inverted Pyramid style seems to encourage putting all of the "what" before any of the "why". I was also taking the perspective of imagining that the non-rationale section (somewhat formalized) would eventually become part of the specification, with the rationale separated out into a distinct document.

I'm not sure this is the right balance. I'm much happier with this structure than that of the "Operators" document (which ended up pretty mangled because it started as an exploration of what we could do about precedence and then got hit by major scope creep, and needs some fundamental restructuring as a result).

We can introduce collapsible sections in GitHub-flavoured markdown via inline HTML -- see for example the "Digression" section in https://github.com/zygoloid/carbon-proposals/blob/operators/operators/operators.md#unary-negation -- and I could try switching to those if people would generally prefer that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I'm in agreement with Josh. My thought is that there should be a brief overview for the "what". All the details here are really the "why".

Collapsible or not seems minor -- this is in a proposal doc, and so will typically be seen regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document structure you have is:

CompleteSpec1
CompleteSpec2
...
Rationale1
Rationale2
...

An alternative that I think would serve the "BLUF" goal better would be:

Summary1
Summary2
...
CompleteSpec1
Rationale1
CompleteSpec2
Rationale2
...

The summary would be very brief: one or two sentences of explanation and a short code snippet example, and maybe have a link to the full spec and rationale section below. E.g.:

Literals

  • Integer literals are written like: 42, 0x3A (hex), 0o777 (octal), 0b01011010 (binary); not 01, 0x3a, 0X3A (upper vs. lower case matters).
  • Real literals are written like: 1.2, 1.0e5, 2.0e-3, 3.1e+1; not 1. (at least one digit after the .), 1.0e05 (exponent can't start with 0), 1.0E5 (e must be lowercase).
  • String literals are written like: "abc", "xyz\n123" (\ introduces escapes), and may contain utf-8.

...
See the complete literals spec below.

I think this is the approach used by the language overview, and I think would be enough for someone to read the summary and be able to parse example code. The detail level of the spec is not needed for understanding generally the intended syntax, and is better paired with the explanations that justify those details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly concur with Josh's suggestion that the proposed rules need to be closer to the associated rationale, but I think it needs to involve more than just moving blocks of text around. The main issue I'm having with this doc is that, although the proposed rules are clearly stated, in at least some cases I can't usefully evaluate the rules without more background information, and/or more explicit discussion of their consequences (I've pointed out one instance in more detail above, namely the discussion of directionality and indentation). Rather than structuring this as a series of rules, followed by rationales for those rules, I'd recommend thinking of it more like a series of questions/problems, followed by the proposed answers/solutions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoffromer I tried something like that with the operators document, and it didn't work well. In particular, the big problem with the operators doc was that it wasn't clear exactly what was being proposed, precisely because the background and rationale and exploration of alternatives was included in the same running text as the proposal itself.

I'll try Josh's approach, and we can see how that works out. That approach doesn't address the problem that rationale and specification are not in 1<->1 correspondence, but I think that can be handled on a case-by-case basis.

Comment on lines 80 to 86
If the character after the `/*` introducing a block comment is `{`, the comment
is a *code comment*. In a code comment, the following text is tokenized until a
matching `}*/` token is formed; such a token terminates the comment. (In
particular, such a token is not recognized if it is nested within another
comment or a literal.) Otherwise, the comment ends at the first matching `*/`
character sequence.
[[why?]](#nested-comments-rationale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find this compelling. I wonder if we might start with a simple story for comments (only //), and rely on IDEs to support adding them to a whole block of code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use #if 0 to temporarily comment out code quite a lot. I'd be a little unhappy if we didn't have a comparable mechanism. Having a different way to say "this is commented out code" versus "this is random text" is, I think, useful, because you can still syntax-highlight and format inside such commented-out regions. (Imagine you indent a region containing commented-out text and need to reflow it.)

But I don't think it's an absolute must-have. I would welcome more input on this, so I know whether I need to reinforce the rationale or change the proposal :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I use #if 0 heavily during debugging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rearranged block comment support based on discord discussion. I think the question of whether we could get away with only // is one we should explicitly consider, though. I'll add that as an open question.

docs/proposals/p0016.md Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
Comment on lines +513 to +915
current operator set. This requires parentheses in code that would apply
multiple prefix or postfix operators in a row, such as `-*p`, but gives us the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little concerning, but I don't know how common such things are in existing C++ code. Would a space here be allowed instead of parens?

Copy link
Contributor Author

@zygoloid zygoloid May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination is to say no, on the basis that a - with a space on the right should be a postfix or infix operator. Swift follows this same rule, and doesn't permit a space to be used to split the token in two.

@gribozavr Do you have any data on whether and to what extent this is a problem in Swift? Any user feedback you can point us at? (And if this is fine in Swift, do you think the reduced emphasis on pointers contributes to that?)

Copy link
Contributor

@gribozavr gribozavr May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swift has only a few prefix operators: Policy.swift

Swift does not have the issue that you're describing in practice (having to disambiguate by adding parentheses) because I don't think there is any way to string these prefix operators together in an expression that is useful in practice. Sure you can theoretically combine prefix negation and prefix bitwise complement, but is that realistic? In C++, on the other hand, we have prefix increment and pointer dereference operators that can compose in real world programs with other prefix operators.

Furthermore, if someone does fall into this trap, the compiler provides a custom error message: test/decl/func/operator.swift (implementation)

Comment on lines 128 to 129
Decimal integers are written as a non-zero decimal digit followed by zero or
more additional decimal digits.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leading +/- are a question I have too


An *identifier* is a maximal sequence of characters beginning with a character
with Unicode property `XID_Start`, followed by zero or more characters with
property `XID_Continue`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to forbid non-ascii identitifiers?

Copy link
Contributor Author

@zygoloid zygoloid May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to allow non-English identifiers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry quite a bit about adversarial code in cases like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this on #19

docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

1 similar comment
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

* U+2028 LINE SEPARATOR
* U+2029 PARAGRAPH SEPARATOR

Space, horizontal tab, and the LTR and RTL mark are *horizontal whitespace*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #17 (comment) , should we forbid RTL marks outside of strings, or at least have some restrictions?

docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
docs/proposals/p0016.md Outdated Show resolved Hide resolved
is empty.

The *indentation* of a line is the sequence of horizontal whitespace characters
at the start of the line. A line *A* has more indentation than a line *B* the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I marvel at the realization that "indentation" is a partial ordering!)

@jonmeow
Copy link
Contributor

jonmeow commented Jul 30, 2020

Small note: it'd probably be good to break this PR up into child proposals for readability/compactness, maybe corresponding to separate files (possibly in the same directory?) in the design dir.

Also, given the unicorns I keep seeing trying to load this PR, splitting may be kinder to GitHub. ;)

@zygoloid
Copy link
Contributor Author

Small note: it'd probably be good to break this PR up into child proposals for readability/compactness, maybe corresponding to separate files (possibly in the same directory?) in the design dir.

Also, given the unicorns I keep seeing trying to load this PR, splitting may be kinder to GitHub. ;)

Agreed; I've split a couple of pieces out and will continue to do so. Closing this on the basis that I have no intention of ever taking the "big picture" proposal to a decision.

@zygoloid zygoloid closed this Oct 22, 2020
Copy link
Contributor Author

@zygoloid zygoloid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Looks like I forgot to submit these old comments.)

using a "max munch" rule: the longest possible next lexical element is formed
at each step.

After division into these components, whitespace and text and block comments
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it would be good to avoid referencing terms I've not yet defined. However, the suggested change doesn't match the intent: documentation comments are not discarded at this stage.


Carbon source files are Unicode text files encoded in UTF-8. An initial UTF-8
BOM is permitted and ignored.
[[why?]](#encoding-rationale)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been intentionally keeping the [why?] links on their own line to improve the readability and maintainability of the Markdown source. Do you think the source would be improved by moving this onto the previous line? (I read the instruction that Prettier "should" be used as permitting me to use a different style if there is justification.)

Carbon is currently based on Unicode 13.0, and will adopt new Unicode versions
as they are published.

**Open question:** Should we require source text to be in NFC, as C++ plans to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document says to use NFC, but there have been questions raised as to whether that's the right thing, so I want an explicit discussion and decision on this question. I'm expecting that the outcome from the review will be that revisions are necessary -- in particular, if we choose to normalize identifiers ourselves, this has ripple effects throughout the document that will require revision in various places.

If you'd like this presented in a different way, let me know, but I'm reluctant to remove the wording describing the consequences from this decision unless there's some indication that we want a different outcome than the one I suggest. I'm not really sure yet how the proposal process should work when we have open questions that will need to be answered before the proposal can be considered complete.

Perhaps instead of describing this as an open question, I could describe it as a known point of potential dissent from the proposal?

Comment on lines +120 to +121
appearance of the source code (as determined by the Unicode Bidirectional
Algorithm) matches the token order as interpreted by the Carbon implementation?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That changes the binding of "as determined [...]" from "appearance of the source code" to "Carbon implementation". Would replacing the parentheses with commas help?

on what we decide for [directionality](#directionality), perhaps LTR marks),
which would lead to a substantially simpler indentation rule.

#### Directionality
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A previous revision went into more depth here, and required Carbon source to have proper directionality in order to be valid, but I ended up deciding that the details are far too involved and messy for it to be reasonable to define them here.

@mconst previously suggested a stricter rule: the directionality for all characters outside identifiers and the contents of string literals and comments is required to be left-to-right. I think that'd be somewhat easier to specify and something that we could make mandatory.

left-to-right scan of the source file, using a "max munch" rule: the longest
possible next token is formed at each step.

## Rationale
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoffromer I tried something like that with the operators document, and it didn't work well. In particular, the big problem with the operators doc was that it wasn't clear exactly what was being proposed, precisely because the background and rationale and exploration of alternatives was included in the same running text as the proposal itself.

I'll try Josh's approach, and we can see how that works out. That approach doesn't address the problem that rationale and specification are not in 1<->1 correspondence, but I think that can be handled on a case-by-case basis.


## Rationale

### Encoding rationale
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Do you have a better word than "encoding" in mind? "character sets" maybe?

order from how they would be interpreted by a Carbon implementation.

If we allow explicit left-to-right marks in the source code and treat them as
whitespace, such issues can be fixed by the Carbon formatting tool.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I don't think we can simply remove them before tokenization, because we want to retain them in string literals. I suppose we could either remove them before tokenization and then put them back within string literals, or we could reject programs where two tokens are separated only by zero-width characters and would be tokenized differently if those characters were removed.

(This choice seems to be a little at odds with UAX31-R3 "To meet this requirement, an implementation shall use Pattern_White_Space characters as all and only those characters interpreted as whitespace in parsing". That rule alternatively allows an implementation to use a profile to determine a set of whitespace characters, but doesn't seem to have a provision for requiring at least one non-zero-width character in any run of whitespace.)


### Comment introducers rationale

We anticipate the possibility of adding additional kinds of comment in the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. To me, either phrasing seems correct, but the current phrasing reads better. I'm thinking of this in the context of: "We have three kinds of comment right now, and might introduce additional kinds of comment in the future." For me, the use of the singular "comment" rather than the plural "comments" brings to mind the abstract notion of comments, rather than about a particular set of extant comments being divided into kinds.

https://ell.stackexchange.com/a/1276/7958 has a similar take, but is citing "The Cambridge Guide to English Usage". Maybe this is a regional difference? Do you consider the current form to be wrong, or just unusual?

### Block comment alternatives

We considered various different options for block comments. Our primary goal
was to permit commenting out a large body of Carbon code, which may or may not
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a concrete alternative in mind?

@jonmeow jonmeow mentioned this pull request Nov 24, 2020
chandlerc added a commit that referenced this pull request Dec 8, 2020
The only change here is to update the fuzzer build extension path.

The main original commit message:

> Add an initial lexer. (#17)
>
> The specific logic here hasn't been updated to track the latest
> discussed changes, much less implement many aspects of things like
> Unicode support.
>
> However, this should lay out a reasonable framework and set of APIs.
> It gives an idea of the overall lexer architecture being proposed. The
> actual lexing algorithm is a relatively boring and naive hand written
> loop. It may make sense to replace this with something generated or
> other more advanced approach in the future, getting the implementation
> right was not the primary goal here. Instead, the focus was entirely
> on the architecture, encapsulation, APIs, and the testing
> infrastructure.
>
> The architecture of the lexer differs from "classical" high
> performance lexers in compilers. A high level summary:
>
> -   It is eager rather than lazy, lexing an entire file.
> -   Tokens intrinsically know their source location.
> -   Grouping lexical symbols are tracked within the lexer.
> -   Indentation is tracked within the lexer.
>
> Tracking of grouping and indentation is intended to simplify the
> strategies used for recovery of mismatched grouping tokens, and
> eventually use indentation.
>
> Folding source location into the token itself simplifies the data
> structures significantly, and doesn't lose any fidelity due to the
> absence of a preprocessor with token pasting.
>
> The fact that this is an eager lexer instead of a lazy lexer is
> designed to simplify the implementation and testing of the lexer (and
> subsequent components). There is no reason to expect Carbon to lex so
> many tokens that there are significant locality advantages of lazy
> lexing. Moreover, if we want comparable performance benefits, I think
> pipelining is a much more promising architecture than laziness. For
> now, the simplicity is a huge win.
>
> Being eager also makes it easy for us to use extremely dense memory
> encodings for the information about lexed tokens. Everything is
> created in a dense array, and small indices are used to identify each
> token within the array.
>
> There is a fuzzer included here that we have run extensively over the
> code, but currently toolchain bugs and Bazel limitations prevent it
> from easily building. I'm hoping myself or someone else can push on
> this soon and enable the fuzzer to at least build if not run fuzz
> tests automatically. We have a significant fuzzing corpus that I'll
> add in a subsequent commit as well.

This also includes the fuzzer whose commit message was:

> Add fuzz testing infrastructure and the lexer's fuzzer. (#21)
>
> This adds a fairly simple `cc_fuzz_test` macro that is specialized for
> working with LLVM's LibFuzzer. In addition to building the fuzzer
> binary with the toolchain's `fuzzer` feature, it also sets up the test
> execution to pass the corpus as file arguments which is a simple
> mechanism to enable regression testing against the fuzz corpus.
>
> I've included an initial fuzzer corpus as well. To run the fuzzer in
> an open ended fashion, and build up a larger corpus:
> ```shell
> mkdir /tmp/new_corpus
> cp lexer/fuzzer_corpus/* /tmp/new_corpus
> ./bazel-bin/lexer/tokenized_buffer_fuzzer /tmp/new_corpus
> ```
>
> You can parallelize the fuzzer by adding `-jobs=N` for N threads. For
> more details about running fuzzers, see the documentation:
> http://llvm.org/docs/LibFuzzer.html
>
> To minimize and merge any interesting new inputs:
> ```shell
> ./bazel-bin/lexer/tokenized_buffer_fuzzer -merge=1 \
>     lexer/fuzzer_corpus /tmp/new_corpus
> ```

Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
@zygoloid zygoloid mentioned this pull request Jan 29, 2021
chandlerc added a commit that referenced this pull request Jun 28, 2022
The only change here is to update the fuzzer build extension path.

The main original commit message:

> Add an initial lexer. (#17)
>
> The specific logic here hasn't been updated to track the latest
> discussed changes, much less implement many aspects of things like
> Unicode support.
>
> However, this should lay out a reasonable framework and set of APIs.
> It gives an idea of the overall lexer architecture being proposed. The
> actual lexing algorithm is a relatively boring and naive hand written
> loop. It may make sense to replace this with something generated or
> other more advanced approach in the future, getting the implementation
> right was not the primary goal here. Instead, the focus was entirely
> on the architecture, encapsulation, APIs, and the testing
> infrastructure.
>
> The architecture of the lexer differs from "classical" high
> performance lexers in compilers. A high level summary:
>
> -   It is eager rather than lazy, lexing an entire file.
> -   Tokens intrinsically know their source location.
> -   Grouping lexical symbols are tracked within the lexer.
> -   Indentation is tracked within the lexer.
>
> Tracking of grouping and indentation is intended to simplify the
> strategies used for recovery of mismatched grouping tokens, and
> eventually use indentation.
>
> Folding source location into the token itself simplifies the data
> structures significantly, and doesn't lose any fidelity due to the
> absence of a preprocessor with token pasting.
>
> The fact that this is an eager lexer instead of a lazy lexer is
> designed to simplify the implementation and testing of the lexer (and
> subsequent components). There is no reason to expect Carbon to lex so
> many tokens that there are significant locality advantages of lazy
> lexing. Moreover, if we want comparable performance benefits, I think
> pipelining is a much more promising architecture than laziness. For
> now, the simplicity is a huge win.
>
> Being eager also makes it easy for us to use extremely dense memory
> encodings for the information about lexed tokens. Everything is
> created in a dense array, and small indices are used to identify each
> token within the array.
>
> There is a fuzzer included here that we have run extensively over the
> code, but currently toolchain bugs and Bazel limitations prevent it
> from easily building. I'm hoping myself or someone else can push on
> this soon and enable the fuzzer to at least build if not run fuzz
> tests automatically. We have a significant fuzzing corpus that I'll
> add in a subsequent commit as well.

This also includes the fuzzer whose commit message was:

> Add fuzz testing infrastructure and the lexer's fuzzer. (#21)
>
> This adds a fairly simple `cc_fuzz_test` macro that is specialized for
> working with LLVM's LibFuzzer. In addition to building the fuzzer
> binary with the toolchain's `fuzzer` feature, it also sets up the test
> execution to pass the corpus as file arguments which is a simple
> mechanism to enable regression testing against the fuzz corpus.
>
> I've included an initial fuzzer corpus as well. To run the fuzzer in
> an open ended fashion, and build up a larger corpus:
> ```shell
> mkdir /tmp/new_corpus
> cp lexer/fuzzer_corpus/* /tmp/new_corpus
> ./bazel-bin/lexer/tokenized_buffer_fuzzer /tmp/new_corpus
> ```
>
> You can parallelize the fuzzer by adding `-jobs=N` for N threads. For
> more details about running fuzzers, see the documentation:
> http://llvm.org/docs/LibFuzzer.html
>
> To minimize and merge any interesting new inputs:
> ```shell
> ./bazel-bin/lexer/tokenized_buffer_fuzzer -merge=1 \
>     lexer/fuzzer_corpus /tmp/new_corpus
> ```

Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
@jonmeow jonmeow mentioned this pull request Apr 1, 2024
github-merge-queue bot pushed a commit that referenced this pull request May 1, 2024
We want to support legacy identifiers that overlap with new keywords
(for example, `base`). This is being called "raw identifier syntax"
using `r#<identifier>`, and is based on
[Rust](https://doc.rust-lang.org/reference/identifiers.html).

Note this proposal is derived from [Proposal #17: Lexical
conventions](#17).

Co-authored-by: zygoloid <richard@metafoo.co.uk>

---------

Co-authored-by: Carbon Infra Bot <carbon-external-infra@google.com>
chandlerc pushed a commit to chandlerc/carbon-lang that referenced this pull request May 2, 2024
We want to support legacy identifiers that overlap with new keywords
(for example, `base`). This is being called "raw identifier syntax"
using `r#<identifier>`, and is based on
[Rust](https://doc.rust-lang.org/reference/identifiers.html).

Note this proposal is derived from [Proposal carbon-language#17: Lexical
conventions](carbon-language#17).

Co-authored-by: zygoloid <richard@metafoo.co.uk>

---------

Co-authored-by: Carbon Infra Bot <carbon-external-infra@google.com>
CJ-Johnson pushed a commit to CJ-Johnson/carbon-lang that referenced this pull request May 23, 2024
We want to support legacy identifiers that overlap with new keywords
(for example, `base`). This is being called "raw identifier syntax"
using `r#<identifier>`, and is based on
[Rust](https://doc.rust-lang.org/reference/identifiers.html).

Note this proposal is derived from [Proposal carbon-language#17: Lexical
conventions](carbon-language#17).

Co-authored-by: zygoloid <richard@metafoo.co.uk>

---------

Co-authored-by: Carbon Infra Bot <carbon-external-infra@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal rfc Proposal with request-for-comment sent out proposal A proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[proposal] Lexical conventions