what syntax should we use for pointer types? #523

zygoloid · 2021-05-05T20:53:50Z

C and C++ use prefix * for dereference, infix * for multiplication, and use syntax that superficially resembles using postfix * for forming pointer types (although it's actually a prefix * applied to a declarator).

From a comment by @josh11b in #520 (comment):

In general it seems like we are pushing towards single meanings for operator symbols and keywords, so wanting to support multiple meanings for * is mostly to be consistent with C/C++. [...] I think we should be asking "given that we want to use binary * for multiplication, how should we designate pointer types and dereference?" I think there are a lot of options to consider there:

Use prefix * for dereference, postfix * for forming pointer types, and use whitespace to disambiguate whether operators are prefix, postfix, or infix.

Use prefix * for dereference, postfix * for forming pointer types, and use precedence rules to disambiguate.

Use prefix * for dereference, postfix * for forming pointer types, but assume we are not going to have any binary operators on pointer types, and use that to disambiguate (not sure this works, but maybe?).

Use prefix * for dereference, prefix & for forming pointer types (analogy to "type of the tuple is the tuple of the types" here is "type of a pointer is a pointer to the type").

Use prefix * for dereference, Ptr(T) for forming pointer types.

Use ^ for some pointer operations, stop using it for binary xor.

(Probably a lot of other choices.)

The text was updated successfully, but these errors were encountered:

zygoloid · 2021-05-05T21:17:43Z

I think use of postfix * for forming pointer types may be important in order to maximize familiarity to people coming to Carbon from C++; I think we should be open to making sacrifices in other dimensions to improve the ergonomics of pointer types. However, we don't yet have a good idea of what kinds of pointer types we'll have. If we expect to be distinguishing pointer types along various different axes (uniqueness / ownership, nullability, indexability, mutability, and so on) then we should be careful that we don't give one specific set of options a privileged * syntax while giving other options poorer syntax. All else being equal, we want the default tool that people reach for to be the one with the nicest syntax, and if there isn't a default choice, then perhaps we shouldn't privilege the syntax for any particular option. If "raw pointers" are expected to be rare, then Ptr(T) seems fine to me, but I think we'll have the same question for whatever our most common location-of-an-object type is.

Regarding the options with a postfix *:

Use prefix * for dereference, postfix * for forming pointer types, and use whitespace to disambiguate whether operators are prefix, postfix, or infix.

Swift uses this approach for operators in general. I like this model in principle, assuming we apply it generally across the language, but some costs of doing so are discussed in #520.

Use prefix * for dereference, postfix * for forming pointer types, and use precedence rules to disambiguate.

This needs more exploration; it's not clear to me what constraints we need to apply to the precedence rules in order for this to be meaningful and parseable in linear time. For example, do we need postfix * and infix * to have the same precedence?

Use prefix * for dereference, postfix * for forming pointer types, but assume we are not going to have any binary operators on pointer types, and use that to disambiguate (not sure this works, but maybe?).

I think this is problematic:

fn F[Type$$ T](T x) {
  x * (0); // is this a function call where the callee is a pointer type? or a multiplication?
}

Use prefix * for dereference, prefix & for forming pointer types (analogy to "type of the tuple is the tuple of the types" here is "type of a pointer is a pointer to the type").

I think this is problematic:

fn F() {
  var Type x = Int;
  var auto y = &x; // is this the address of x (a Type*) or pointer-to-Int (a Type)?

chandlerc · 2021-05-09T00:22:13Z

To add a touch of nuance to my stance here...

I think we should, for now, assume that we can have one of the first two options (either through a very cautious application of #520 or through more specialized precedence rules as suggested in the second point). And we should work at implementing #520 to understand how well that works, and if problems emerge, try implementing the next thing.

But to the other question, I'd not anchor on which specific pointer-like type (or semantic set) this syntax ends up referring to. As we refine and delineate the different semantics we want, we should point the familiar syntax of Int* towards whichever will best match the desired ergonomic priority between them.

What do folks think about this path forward? Any major concerns?

mconst · 2021-05-15T08:08:22Z

As an alternative, have we considered using prefix * to form pointer types? That's what Go and Rust do, and C++ programmers don't seem to have trouble picking it up (at least as far as I've seen).

It makes the parsing nice and simple, without requiring whitespace-sensitivity, and it matches existing languages.

Sorry if this has already been considered and rejected -- I just wanted to check, since I didn't see it here.

jonmeow · 2021-05-17T16:27:32Z

Pointers and arrays should probably be considered together if you're proposing different fixity; var int *ptrs[SIZE]; might be either an array of pointers or a pointer to an array. Operator precedence may determine the outcome, but developers may still have trouble reading it. (I might note, this is probably true of the & suggestion as well)

In Go, they do var ptrs [SIZE]*int;
In Rust, if I understand array syntax, it'd be let mut ptrs: [*i32; SIZE];

Note Go keeps the same call syntax (*ptrs[index]), implying that doesn't need to change -- the compiler can sort out some confusion as long as the variable declaration is easy to understand. My main point is that a prefix * for pointer types strongly discourages a postfix [] for declaring array types.

mconst · 2021-05-17T20:57:47Z

Yes, I agree that we should consider pointer and array types together.

Personally, I like Go's syntax here: it exactly matches how you read these types, and also has the advantage that multidimensional arrays keep their sizes in the same order as C++. (That is, [2][3]int in Go is an array of 2 arrays of 3 ints, which matches C++'s int[2][3]. If our array and pointer types are postfix, then we end up like D, where that array is int[3][2] instead!) But I think other options are fine too.

jonmeow · 2021-05-17T21:09:22Z

The irony here is Go's syntax makes me more sympathetic to post-fix * for dereferencing pointers.

That is to say, var list [2]int; being accessed through list[0] uses a prefix [] on definition, and postfix [0] for access. This would correspond to var ptr *int; being accessed through ptr*; i.e., prefix * on definition, postfix * on access. This may be confusing to developers, though, and I don't know whether it'd be hard to parse with binary *, but it may be more consistent with -> too. (but switching fixity of both * uses may confuse developers, too)

OTOH I would definitely avoid D's syntax. At a glance it's a syntax which is similar to C++, and superficially compatible, but just different enough to cause programming errors by C++ developers who don't realize the order swapped.

geoffromer · 2021-05-17T22:40:54Z

... I don't know whether it'd be hard to parse with binary *, but it may be more consistent with -> too

If * for dereferencing were a postfix operator, we wouldn't need -> in the first place: foo->bar could instead be written as foo*.bar. However, I think that example also illustrates that this would indeed be hard to parse with binary *.

josh11b · 2021-05-19T04:20:57Z

OTOH I would definitely avoid D's syntax. At a glance it's a syntax which is similar to C++, and superficially compatible, but just different enough to cause programming errors by C++ developers who don't realize the order swapped.

So we conclude from this that we don't want a postfix operator for declaring array types (assuming we want to specify array bounds as part of the syntax). Prefix array modifiers work better with a prefix pointer modifier to avoid needing parenthesis to disambiguate the order. Or we could do other array syntaxes.

josh11b · 2021-05-19T20:56:48Z

I've tried to capture the various options for different pointer syntaxes in this doc.
[Note: link updated 2022-07-22]

chandlerc · 2021-05-21T07:26:17Z

One thing that hasn't been discussed yet is avoiding [N][M] for multidimensional arrays, and instead using [N, M]. I've not spent lots of time thinking about it, but it seemed interesting to at least factor into considerations. Especially given a comment in one fo the discussions which pointed out that multidimensional arrays are significantly less common than one dimensional arrays and pointers.

chandlerc · 2021-05-21T07:27:08Z

One thing that hasn't been discussed yet is avoiding [N][M] for multidimensional arrays, and instead using [N, M].

Should have said -- or any other syntax that is fundamentally different in its tradeoffs by not trying to use basic nesting to establish the dimensionality.

chandlerc · 2021-05-25T04:17:00Z

After the open discussion today, both Richard and I were pretty happy with one specific outcome here, curious if anyone sees a problem here.

High level result:

We use postfix * for pointer types to look similar to C/C++ (if not quite the same).
We use prefix * for dereference to look similar to C/C++ (and x-> for (*x).).
Insist on some disambiguation rules to handle this and infix * for multiply. I'll make a concrete suggestion on the whitespace fixity issue, but the goal is to pick disambiguation rules that are as simple as we can get away with while supporting the these use cases.

Some related things:

What about arrays? Defer having a concrete solution today -- best done with insight into how we want slices to work. Until then, use Array(T, N).
What about multidimensional arrays? Same as for arrays. Can use Array(T, X, Y, Z) for a 3 dimensional array of Ts for now.
- When looking at them, I feel like we should at least consider making an multidimensional array use syntax like Int*[X, Y, Z] where the bounds X, Y, and Z govern indices x, y, and z in arr[x, y, z]. I think this would be intuitive and easy for C++ programmers.
- We can also consider making an array-of-arrays work in the principled way for postfix and significantly reduce confusion by insisting on parentheses. I feel like (Int*[X])[Y] looks like an array-of-arrays and I'd be unsurprised at how to index it even coming from C/C++. This might give a good way to avoid people falling into the trap of Int*[X][Y], and nudge them either to intuitive multidimensional syntax or towards understandable array-of-array syntax.
- But again, we don't need an answer here today or for this issue. This is something to revisit along side slices.

josh11b · 2021-05-25T04:59:00Z

I think we wanted Array(N, T) instead so that it nests properly.

I don't see how the parenthesis in (Int*[X])[Y] change anything? With or without the parenthesis, the type is "an array of Y arrays of X pointers to integers". Without the pointer you get:

var a : (Int[5])[3] = ...;
// Error: array index out of bounds, 4 >= 3.
Print(a[4][2]);

This just seems super dangerous and hard to diagnose.

chandlerc · 2021-05-25T05:35:24Z

I think we wanted Array(N, T) instead so that it nests properly.

I would suggest not nesting for multidimensional arrays....

I don't see how the parenthesis in (Int*[X])[Y] change anything? With or without the parenthesis, the type is "an array of Y arrays of X pointers to integers". Without the pointer you get:
var a : (Int[5])[3] = ...;
// Error: array index out of bounds, 4 >= 3.
Print(a[4][2]);

I find this much less surprising than without paretheses.

This just seems super dangerous and hard to diagnose.

Maybe. Talking with @mconst it seemed significantly less surprising than w/o the parentheses.

But really, all of this is just thoughts to factor into future discussions, not something we should try to sort out now.

zygoloid · 2021-05-26T21:13:33Z

Given that we've switched from type name to name: type for bindings, we've removed the major case where there would be genuine ambiguity between postfix * and infix *. So perhaps we could allow both, and select the (unique) valid parse in each case?

In an attempt to demonstrate that this is unambiguous, I think the following local syntactic rule does the right thing in all cases (at least given our current set of grammar productions):

When an expression is followed by a sequence of N * tokens, the * tokens are all postfix operators forming an N-level pointer type if the next token is ,, ;, {, or any closing bracket, and otherwise the first is a binary * and the rest are prefix * operators.

This does rely on there being no binary operators that can syntactically accept T* as their left-hand operand, but in the framework proposed by #168, that seems feasible. It's also worth noting that the "sequence of N * tokens" part means the resulting grammar is not LR(k) for any k.

geoffromer · 2021-05-27T20:03:35Z

It's also worth noting that the "sequence of N * tokens" part means the resulting grammar is not LR(k) for any k.

AFAICT that would mean Yacc can't parse it at all, and Bison can parse it only in its GLR mode, which sounds liable to add a good deal of unwelcome complexity:

Semantic value types are required to be POD, which means we're back to unions and manual memory management.
We have to carefully monitor the set of conflicts, but Bison provides very limited tools for doing so (especially in the old version of Bison our toolchain is currently stuck on)
Execution of semantic actions can be deferred, which means some features (such as YYERROR and certain interactions with lookahead tokens) may not behave as expected.
For some grammars and some inputs, parsing can take exponential time and space. I'm not sure how to determine how much of a problem that will be for our grammar.

chandlerc · 2021-06-02T10:00:04Z

I think in #520 we're coming up with reasonably robust ideas for disambiguating this that won't hit the parsing challenges raised here by @geoffromer and @zygoloid so for now I'd like to suggest we resolve this for now with my proposed guidance above:

We use postfix * for pointer types to look similar to C/C++ (if not quite the same).

We use prefix * for dereference to look similar to C/C++ (and x-> for (*x).).

Insist on some disambiguation rules to handle this and infix * for multiply. The goal is to pick disambiguation rules that are as simple as we can get away with while supporting the these use cases.

For arrays, I would suggest continuing to write Array(...) in some form until we have enough motivation to figure this out (likely needing to figure out slices at roughly the same time). And I would suggest using some syntax that is super unambiguous for multi-dimensional arrays (rather than nesting things) until that is resolved as well.

Marking as closed as I think this is a workable state that at least Richard and I were comfortable with in discussion and we've not heard serious concerns with (other than making sure the parsing strategy doesn't go down the expensive paths of GLR and other things -- we're definitely going to invest in avoiding that!).

That said, if there is outstanding confusion here or anyone really disagrees with my read on things, please re-open!

#582) The presence or absence of whitespace is used to determine which operator is in use, following the rules described in #520. Support for prefix * dereference operator follows #523. Co-authored-by: Geoff Romer <gromer@google.com>

nigeltao · 2022-07-20T23:59:49Z

I've tried to capture the various options for different pointer syntaxes in this doc.

FWIW, this doc is not publicly accessible.

geoffromer · 2022-07-27T22:15:08Z

I've tried to capture the various options for different pointer syntaxes in this doc.

FWIW, this doc is not publicly accessible.

Here's a publically accessible mirror

zygoloid mentioned this issue May 5, 2021

should we use whitespace-sensitive operator fixity? #520

Closed

zygoloid added this to Questions in Issues for leads via automation May 5, 2021

carbon-language deleted a comment from chandlerc May 11, 2021

carbon-language deleted a comment from zygoloid May 11, 2021

jonmeow mentioned this issue May 17, 2021

Generics overview #524

Merged

josh11b mentioned this issue May 21, 2021

Multicharacter bitwise operators? #545

Closed

chandlerc closed this as completed Jun 2, 2021

Issues for leads automation moved this from Questions to Resolved Jun 2, 2021

zygoloid mentioned this issue Jun 30, 2021

Operator tokens #601

Merged

josh11b mentioned this issue Sep 10, 2021

Values, variables, pointers, and references #821

Closed

nigeltao mentioned this issue Jul 20, 2022

Reconsider pointer syntax #1454

Closed

jwtowner mentioned this issue Jul 26, 2022

Please consider changing -> to . for pointer access #1725

Open

jonmeow added the leads question A question for the leads team label Aug 10, 2022

chandlerc mentioned this issue Apr 20, 2023

Values, variables, pointers, and references #2006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what syntax should we use for pointer types? #523

what syntax should we use for pointer types? #523

zygoloid commented May 5, 2021

zygoloid commented May 5, 2021 •

edited

Loading

chandlerc commented May 9, 2021 •

edited

Loading

mconst commented May 15, 2021 •

edited

Loading

jonmeow commented May 17, 2021

mconst commented May 17, 2021

jonmeow commented May 17, 2021 •

edited

Loading

geoffromer commented May 17, 2021

josh11b commented May 19, 2021 •

edited

Loading

josh11b commented May 19, 2021 •

edited

Loading

chandlerc commented May 21, 2021

chandlerc commented May 21, 2021

chandlerc commented May 25, 2021

josh11b commented May 25, 2021 •

edited

Loading

chandlerc commented May 25, 2021

zygoloid commented May 26, 2021 •

edited

Loading

geoffromer commented May 27, 2021

chandlerc commented Jun 2, 2021

nigeltao commented Jul 20, 2022

geoffromer commented Jul 27, 2022

what syntax should we use for pointer types? #523

what syntax should we use for pointer types? #523

Comments

zygoloid commented May 5, 2021

zygoloid commented May 5, 2021 • edited Loading

chandlerc commented May 9, 2021 • edited Loading

mconst commented May 15, 2021 • edited Loading

jonmeow commented May 17, 2021

mconst commented May 17, 2021

jonmeow commented May 17, 2021 • edited Loading

geoffromer commented May 17, 2021

josh11b commented May 19, 2021 • edited Loading

josh11b commented May 19, 2021 • edited Loading

chandlerc commented May 21, 2021

chandlerc commented May 21, 2021

chandlerc commented May 25, 2021

josh11b commented May 25, 2021 • edited Loading

chandlerc commented May 25, 2021

zygoloid commented May 26, 2021 • edited Loading

geoffromer commented May 27, 2021

chandlerc commented Jun 2, 2021

nigeltao commented Jul 20, 2022

geoffromer commented Jul 27, 2022

zygoloid commented May 5, 2021 •

edited

Loading

chandlerc commented May 9, 2021 •

edited

Loading

mconst commented May 15, 2021 •

edited

Loading

jonmeow commented May 17, 2021 •

edited

Loading

josh11b commented May 19, 2021 •

edited

Loading

josh11b commented May 19, 2021 •

edited

Loading

josh11b commented May 25, 2021 •

edited

Loading

zygoloid commented May 26, 2021 •

edited

Loading