Skip to content

Latest commit

 

History

History
308 lines (237 loc) · 14.1 KB

p0555.md

File metadata and controls

308 lines (237 loc) · 14.1 KB

Operator precedence

Pull request

Table of contents

Problem

Most expression-oriented languages use a strict hierarchy of precedence levels. That approach is error-prone, as it assigns meaning to programs that developers may either not understand or may misunderstand.

Background

Given an expression, we need to be able to infer its structure: what are the operands of each of the operators? This may be ambiguous in the absence of rules that determine which operator is preferred, such as in the expression a $ b ^ c: is this (a $ b) ^ c or a $ (b ^ c)?

Starting with a sequence of operators and non-operator terms, we can completely determine the structure of an expression by determining which operator in our sequence will be the root of the parse tree, splitting the expression at that point, and recursively determining the structure of each subexpression. The operator that forms the root of the parse tree is said to have the lowest precedence in the expression.

Traditionally, this is accomplished by assigning a precedence level to each operator and devising a total ordering over precedence levels. For example, we could assign a higher precedence level to an infix * operator than an infix + operator. With that choice of precedence levels, an infix * operator would bind tighter than an infix + operator, regardless of the order in which they appear.

This approach is well-understood, but is problematic. For example, in C++, expressions such as a & b << c * 3 are valid, but the meaning of such an expression is unlikely to be readily apparent to many developers. Worse, for cases such as a & 3 == 3, there is a clear intended meaning, namely (a & 3) == 3, but the actual meaning is something else -- in this case, a & (3 == 3).

Because the precedence rules are not widely known and are sometimes quite surprising, parentheses are used as a matter of course for certain kinds of C++ expressions. However, the absence of such parentheses is not diagnosed in all cases, even by many linting tools, and forgetting those parentheses can lead to subtle bugs.

Proposal

Do not have a total ordering of precedence levels. Instead, define a partial ordering of precedence levels. Expressions using operators that lack relative orderings must be disambiguated by the developer, for example by adding parentheses; when a program's meaning depends on an undefined relative ordering of two operators, it will be rejected due to ambiguity.

The default behavior for any new operator is for it to be unordered with respect to all other operators, thereby requiring parentheses when combining that operator with any other operator. Precedence rules should be added only if it is reasonable to expect most or all developers who regularly use Carbon to reliably remember the precedence rule.

Details

Notational convention

For pedagogical purposes, our documentation will use Hasse diagrams to represent operator precedence partial orders, where operators with lower precedence are considered less than (and therefore depicted lower than and connected to) operators with higher precedence. In our diagrams, an enclosing arrow will be used to show associativity within precedence groups, if there is any, with a left-to-right arrow meaning a left-associative operator.

For example:

Example operator precedence diagram

... would depict a higher-precedence * operator and a lower-precedence + operator, both of which are left-associative, and a non-associative << operator. The == operator is lower precedence than all of those operators, and parentheses are higher precedence than all of those operators.

With those precedence rules:

  • a + b * c would be parsed as a + (b * c), because + has lower precedence than *.
  • a + b << c would be an error, requiring parentheses, because the precedence levels of + and << are unordered.

A python script to generate these diagrams is included with this proposal.

When to add precedence edges

Given a program whose meaning is ambiguous to a reader, it is preferable to reject the program rather than to arbitrarily pick a meaning. For Carbon's operators, we should only add an ordering between two operators if there is a logical reason for that ordering, not merely to provide some answer. Goal: for every combination of operators, either it should be reasonable to expect most or all developers who regularly use Carbon to reliably remember the precedence, or there should not be a precedence rule.

As an example, consider the expression a * b ^ c, where * is assumed to be a multiplication operator and ^ is assumed to be a bitwise XOR operation. We should reject this expression because there is no logical reason to perform either operator first and it would be unreasonable to expect Carbon developers to remember an arbitrary tie-breaker between the two options.

This still leaves open the question of how high a bar of knowledge we put on our developers (what is reasonable for us to expect?). We can use experience from C++ to direct this decision: just as many developers who regularly use C++ do not remember the relative precedence of && vs ||, and & vs |, and & vs <<, and so on, we shouldn't expect them to remember similar precedence rules in Carbon. If we are in doubt, omitting a precedence rule and waiting for real-world experience should be preferred.

Parsing with a partial precedence order

A traditional, totally-ordered precedence scheme can be implemented by an operator precedence parser:

  • Keep track of the current left-hand-side operand and an ambient precedence level. The ambient precedence level is the precedence of the operator whose operand is being parsed, or a placeholder "lowest" precedence level when parsing an expression that is not the operand of an operator.
  • When a new operator is encountered, its precedence is compared to the ambient precedence level:
    • If its precedence is higher than the ambient precedence level, then recurse ("shift") with that as the new ambient precedence level to form the right-hand side of the new operator. After forming the right-hand side, build an operator expression from the current left-hand side operand and the right-hand side operand; that is the new current left-hand side.
    • If its precedence is equal to the ambient precedence level, then use the associativity of that precedence level to determine what to do:
      • If the operator is left-associative, build an operator expression.
      • If the operator is right-associative, recurse.
      • If the operator is non-associative, produce an error.
    • If its precedence is lower than the ambient precedence level, return the expression formed so far; it's the complete operand to an earlier operator.

This is, for example, the strategy currently used in Clang.

The above algorithm is only suited to parsing in the case where precedence levels are totally ordered, because it does not say what to do if the new precedence is not comparable with the ambient precedence. However, the algorithm can easily be adapted to also parse with a partial precedence order by adding one more case:

  • If the precedence level of the new operator is not comparable with the ambient precedence level, produce an ambiguity error.

The key observation here is that, if we ever see ... a * b ^ c ..., where * and ^ have incomparable precedence, no later tokens can ever resolve the ambiguity, so we can diagnose it immediately. Sketch proof: If there were a valid parse tree for this expression, one of * and ^ must end up as an ancestor of the other. But in a valid parse tree, along the path from one operator to the other, precedences monotonically increase, so by transitivity of the precedence partial ordering, the ancestor operator has lower precedence than the descendent operator.

An operator precedence parser with a partial ordering of precedence levels has been implemented as a proof-of-concept in the Carbon toolchain.

Operator precedence partial ordering can also be implemented in yacc / bison parser generators by using a variant of the precedence climbing method. For example, here is a yacc grammar for the Hasse diagram shown above:

expression: compare_expression | compare_operand;

compare_expression: compare_lhs EQEQ compare_operand { $$ = ($1 == $3); };
compare_lhs: compare_expression | compare_operand;
compare_operand: add_expression | multiply_expression | shift_expression | primary_expression;

add_expression: add_lhs '+' add_operand { $$ = ($1 + $3); };
add_lhs: add_expression | add_operand;
add_operand: multiply_expression | multiply_operand;

multiply_expression: multiply_lhs '*' multiply_operand { $$ = ($1 * $3); };
multiply_lhs: multiply_expression | multiply_operand;
multiply_operand: primary_expression;

shift_expression: shift_lhs LSH shift_operand { $$ = ($1 << $3); };
shift_lhs: shift_expression | shift_operand;
shift_operand: primary_expression;

primary_expression: INT | '(' expression ')' { $$ = $2; };

Note that some care must be taken to avoid grammar ambiguities. Under the precedence climbing method, a primary_expression would be a shift_expression, a multiply_expression, and an add_expression, and therefore interpreting a primary_expression as an expression would be ambiguous: we could take either the shift_expression path or the multiply_expression path through the grammar. The above formulation avoids this ambiguity by excluding primary_expression from add_expression and shift_expression, and instead listing it as a distinct production for compare_operand. A yacc grammar such as the above can be produced systematically for any precedence partial ordering.

A complete example of a yacc parser with operator precedence partial ordering is available alongside this proposal.

Rationale based on Carbon's goals

  • Software and language evolution

    • The advice to not supply an operator precedence relationship if in doubt is based on the idea that it's easier to add a precedence rule as an evolutionary step than to remove one.
  • Code that is easy to read, understand, and write

    • This proposal aims to support this goal by ensuring that the operator expressions that are used in programs are readily understood by practitioners, by making unreadable constructs invalid.

Alternatives considered

Total order

We could provide a total order over operator precedence. This proposal is not strictly in conflict with doing so, if every ordering relationship is justified, but in practice we expect there to be pairs of operators for which there is no obvious precedence relationship.

For:

  • This is established practice across most languages.

Against:

  • This practice is a common source of bugs in the case where an arbitrary or bad choice is made.

Different precedence for different operands

We could provide different precedence relationships for the left and right sides of infix operators. For example, we could allow multiplication on the left of a << operator but not on the right. This is precedented in C++: the ? in a ?: allows a comma operator on its right but not on its left.

For:

  • This may allow some additional cases that would be clear and unsurprising.

Against:

  • The resulting rules would be more challenging to learn, and it seems likely that they would fail the test that most developers who regularly use Carbon know the rules.

This proposal is not incompatible with adopting such a direction in future if we find motivation to do so.

Require less than a partial order

We could require something weaker than a partial ordering of precedence levels. This proposal assumes the following two points are useful for human comprehension of operator precedence:

  • The lowest-precedence operator does not depend on the relative order of operators in the expression (except as a tie-breaker when there are multiple operators with the same precedence, where the associativity of the operator is considered).
  • If an ^ expression can appear indirectly (but unparenthesized) within an $ expression, then an ^ expression can appear directly within an $ expression.
  • If the lowest-precedence operator in a $ b ^ c is $, and the lowest-precedence operator in b ^ c # d is ^, then the lowest-precedence operator in a $ b ^ c # d is $.

These assumptions lead to the conclusion that operator precedence should form a partial order over equivalence classes of operators. However, these assumptions could be wrong.

If we find motivation to select rules that violate the above assumptions, we should reconsider the approach of using a partial precedence ordering, but no motivating case is currently known.