Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what syntax should we use for pointer types? #523

Closed
zygoloid opened this issue May 5, 2021 · 19 comments
Closed

what syntax should we use for pointer types? #523

zygoloid opened this issue May 5, 2021 · 19 comments
Labels
leads question A question for the leads team

Comments

@zygoloid
Copy link
Contributor

zygoloid commented May 5, 2021

C and C++ use prefix * for dereference, infix * for multiplication, and use syntax that superficially resembles using postfix * for forming pointer types (although it's actually a prefix * applied to a declarator).

From a comment by @josh11b in #520 (comment):

In general it seems like we are pushing towards single meanings for operator symbols and keywords, so wanting to support multiple meanings for * is mostly to be consistent with C/C++. [...] I think we should be asking "given that we want to use binary * for multiplication, how should we designate pointer types and dereference?" I think there are a lot of options to consider there:

  • Use prefix * for dereference, postfix * for forming pointer types, and use whitespace to disambiguate whether operators are prefix, postfix, or infix.
  • Use prefix * for dereference, postfix * for forming pointer types, and use precedence rules to disambiguate.
  • Use prefix * for dereference, postfix * for forming pointer types, but assume we are not going to have any binary operators on pointer types, and use that to disambiguate (not sure this works, but maybe?).
  • Use prefix * for dereference, prefix & for forming pointer types (analogy to "type of the tuple is the tuple of the types" here is "type of a pointer is a pointer to the type").
  • Use prefix * for dereference, Ptr(T) for forming pointer types.
  • Use ^ for some pointer operations, stop using it for binary xor.
  • (Probably a lot of other choices.)
@zygoloid
Copy link
Contributor Author

zygoloid commented May 5, 2021

I think use of postfix * for forming pointer types may be important in order to maximize familiarity to people coming to Carbon from C++; I think we should be open to making sacrifices in other dimensions to improve the ergonomics of pointer types. However, we don't yet have a good idea of what kinds of pointer types we'll have. If we expect to be distinguishing pointer types along various different axes (uniqueness / ownership, nullability, indexability, mutability, and so on) then we should be careful that we don't give one specific set of options a privileged * syntax while giving other options poorer syntax. All else being equal, we want the default tool that people reach for to be the one with the nicest syntax, and if there isn't a default choice, then perhaps we shouldn't privilege the syntax for any particular option. If "raw pointers" are expected to be rare, then Ptr(T) seems fine to me, but I think we'll have the same question for whatever our most common location-of-an-object type is.

Regarding the options with a postfix *:

  • Use prefix * for dereference, postfix * for forming pointer types, and use whitespace to disambiguate whether operators are prefix, postfix, or infix.

Swift uses this approach for operators in general. I like this model in principle, assuming we apply it generally across the language, but some costs of doing so are discussed in #520.

  • Use prefix * for dereference, postfix * for forming pointer types, and use precedence rules to disambiguate.

This needs more exploration; it's not clear to me what constraints we need to apply to the precedence rules in order for this to be meaningful and parseable in linear time. For example, do we need postfix * and infix * to have the same precedence?

  • Use prefix * for dereference, postfix * for forming pointer types, but assume we are not going to have any binary operators on pointer types, and use that to disambiguate (not sure this works, but maybe?).

I think this is problematic:

fn F[Type$$ T](T x) {
  x * (0); // is this a function call where the callee is a pointer type? or a multiplication?
}
  • Use prefix * for dereference, prefix & for forming pointer types (analogy to "type of the tuple is the tuple of the types" here is "type of a pointer is a pointer to the type").

I think this is problematic:

fn F() {
  var Type x = Int;
  var auto y = &x; // is this the address of x (a Type*) or pointer-to-Int (a Type)?

@zygoloid zygoloid added this to Questions in Issues for leads via automation May 5, 2021
@chandlerc
Copy link
Contributor

chandlerc commented May 9, 2021

To add a touch of nuance to my stance here...

I think we should, for now, assume that we can have one of the first two options (either through a very cautious application of #520 or through more specialized precedence rules as suggested in the second point). And we should work at implementing #520 to understand how well that works, and if problems emerge, try implementing the next thing.

But to the other question, I'd not anchor on which specific pointer-like type (or semantic set) this syntax ends up referring to. As we refine and delineate the different semantics we want, we should point the familiar syntax of Int* towards whichever will best match the desired ergonomic priority between them.

What do folks think about this path forward? Any major concerns?

@carbon-language carbon-language deleted a comment from chandlerc May 11, 2021
@carbon-language carbon-language deleted a comment from zygoloid May 11, 2021
@mconst
Copy link
Contributor

mconst commented May 15, 2021

As an alternative, have we considered using prefix * to form pointer types? That's what Go and Rust do, and C++ programmers don't seem to have trouble picking it up (at least as far as I've seen).

It makes the parsing nice and simple, without requiring whitespace-sensitivity, and it matches existing languages.

Sorry if this has already been considered and rejected -- I just wanted to check, since I didn't see it here.

@jonmeow
Copy link
Contributor

jonmeow commented May 17, 2021

Pointers and arrays should probably be considered together if you're proposing different fixity; var int *ptrs[SIZE]; might be either an array of pointers or a pointer to an array. Operator precedence may determine the outcome, but developers may still have trouble reading it. (I might note, this is probably true of the & suggestion as well)

  • In Go, they do var ptrs [SIZE]*int;
  • In Rust, if I understand array syntax, it'd be let mut ptrs: [*i32; SIZE];

Note Go keeps the same call syntax (*ptrs[index]), implying that doesn't need to change -- the compiler can sort out some confusion as long as the variable declaration is easy to understand. My main point is that a prefix * for pointer types strongly discourages a postfix [] for declaring array types.

@mconst
Copy link
Contributor

mconst commented May 17, 2021

Yes, I agree that we should consider pointer and array types together.

Personally, I like Go's syntax here: it exactly matches how you read these types, and also has the advantage that multidimensional arrays keep their sizes in the same order as C++. (That is, [2][3]int in Go is an array of 2 arrays of 3 ints, which matches C++'s int[2][3]. If our array and pointer types are postfix, then we end up like D, where that array is int[3][2] instead!) But I think other options are fine too.

@jonmeow
Copy link
Contributor

jonmeow commented May 17, 2021

The irony here is Go's syntax makes me more sympathetic to post-fix * for dereferencing pointers.

That is to say, var list [2]int; being accessed through list[0] uses a prefix [] on definition, and postfix [0] for access. This would correspond to var ptr *int; being accessed through ptr*; i.e., prefix * on definition, postfix * on access. This may be confusing to developers, though, and I don't know whether it'd be hard to parse with binary *, but it may be more consistent with -> too. (but switching fixity of both * uses may confuse developers, too)

OTOH I would definitely avoid D's syntax. At a glance it's a syntax which is similar to C++, and superficially compatible, but just different enough to cause programming errors by C++ developers who don't realize the order swapped.

@geoffromer
Copy link
Contributor

... I don't know whether it'd be hard to parse with binary *, but it may be more consistent with -> too

If * for dereferencing were a postfix operator, we wouldn't need -> in the first place: foo->bar could instead be written as foo*.bar. However, I think that example also illustrates that this would indeed be hard to parse with binary *.

@josh11b
Copy link
Contributor

josh11b commented May 19, 2021

OTOH I would definitely avoid D's syntax. At a glance it's a syntax which is similar to C++, and superficially compatible, but just different enough to cause programming errors by C++ developers who don't realize the order swapped.

So we conclude from this that we don't want a postfix operator for declaring array types (assuming we want to specify array bounds as part of the syntax). Prefix array modifiers work better with a prefix pointer modifier to avoid needing parenthesis to disambiguate the order. Or we could do other array syntaxes.

@josh11b
Copy link
Contributor

josh11b commented May 19, 2021

I've tried to capture the various options for different pointer syntaxes in this doc.
[Note: link updated 2022-07-22]

@chandlerc
Copy link
Contributor

One thing that hasn't been discussed yet is avoiding [N][M] for multidimensional arrays, and instead using [N, M]. I've not spent lots of time thinking about it, but it seemed interesting to at least factor into considerations. Especially given a comment in one fo the discussions which pointed out that multidimensional arrays are significantly less common than one dimensional arrays and pointers.

@chandlerc
Copy link
Contributor

One thing that hasn't been discussed yet is avoiding [N][M] for multidimensional arrays, and instead using [N, M].

Should have said -- or any other syntax that is fundamentally different in its tradeoffs by not trying to use basic nesting to establish the dimensionality.

@chandlerc
Copy link
Contributor

After the open discussion today, both Richard and I were pretty happy with one specific outcome here, curious if anyone sees a problem here.

High level result:

  • We use postfix * for pointer types to look similar to C/C++ (if not quite the same).
  • We use prefix * for dereference to look similar to C/C++ (and x-> for (*x).).
  • Insist on some disambiguation rules to handle this and infix * for multiply. I'll make a concrete suggestion on the whitespace fixity issue, but the goal is to pick disambiguation rules that are as simple as we can get away with while supporting the these use cases.

Some related things:

  • What about arrays? Defer having a concrete solution today -- best done with insight into how we want slices to work. Until then, use Array(T, N).
  • What about multidimensional arrays? Same as for arrays. Can use Array(T, X, Y, Z) for a 3 dimensional array of Ts for now.
    • When looking at them, I feel like we should at least consider making an multidimensional array use syntax like Int*[X, Y, Z] where the bounds X, Y, and Z govern indices x, y, and z in arr[x, y, z]. I think this would be intuitive and easy for C++ programmers.
    • We can also consider making an array-of-arrays work in the principled way for postfix and significantly reduce confusion by insisting on parentheses. I feel like (Int*[X])[Y] looks like an array-of-arrays and I'd be unsurprised at how to index it even coming from C/C++. This might give a good way to avoid people falling into the trap of Int*[X][Y], and nudge them either to intuitive multidimensional syntax or towards understandable array-of-array syntax.
    • But again, we don't need an answer here today or for this issue. This is something to revisit along side slices.

@josh11b
Copy link
Contributor

josh11b commented May 25, 2021

I think we wanted Array(N, T) instead so that it nests properly.

I don't see how the parenthesis in (Int*[X])[Y] change anything? With or without the parenthesis, the type is "an array of Y arrays of X pointers to integers". Without the pointer you get:

var a : (Int[5])[3] = ...;
// Error: array index out of bounds, 4 >= 3.
Print(a[4][2]);

This just seems super dangerous and hard to diagnose.

@chandlerc
Copy link
Contributor

I think we wanted Array(N, T) instead so that it nests properly.

I would suggest not nesting for multidimensional arrays....

I don't see how the parenthesis in (Int*[X])[Y] change anything? With or without the parenthesis, the type is "an array of Y arrays of X pointers to integers". Without the pointer you get:

var a : (Int[5])[3] = ...;
// Error: array index out of bounds, 4 >= 3.
Print(a[4][2]);

I find this much less surprising than without paretheses.

This just seems super dangerous and hard to diagnose.

Maybe. Talking with @mconst it seemed significantly less surprising than w/o the parentheses.

But really, all of this is just thoughts to factor into future discussions, not something we should try to sort out now.

@zygoloid
Copy link
Contributor Author

zygoloid commented May 26, 2021

Given that we've switched from type name to name: type for bindings, we've removed the major case where there would be genuine ambiguity between postfix * and infix *. So perhaps we could allow both, and select the (unique) valid parse in each case?

In an attempt to demonstrate that this is unambiguous, I think the following local syntactic rule does the right thing in all cases (at least given our current set of grammar productions):

  • When an expression is followed by a sequence of N * tokens, the * tokens are all postfix operators forming an N-level pointer type if the next token is ,, ;, {, or any closing bracket, and otherwise the first is a binary * and the rest are prefix * operators.

This does rely on there being no binary operators that can syntactically accept T* as their left-hand operand, but in the framework proposed by #168, that seems feasible. It's also worth noting that the "sequence of N * tokens" part means the resulting grammar is not LR(k) for any k.

@geoffromer
Copy link
Contributor

It's also worth noting that the "sequence of N * tokens" part means the resulting grammar is not LR(k) for any k.

AFAICT that would mean Yacc can't parse it at all, and Bison can parse it only in its GLR mode, which sounds liable to add a good deal of unwelcome complexity:

  • Semantic value types are required to be POD, which means we're back to unions and manual memory management.
  • We have to carefully monitor the set of conflicts, but Bison provides very limited tools for doing so (especially in the old version of Bison our toolchain is currently stuck on)
  • Execution of semantic actions can be deferred, which means some features (such as YYERROR and certain interactions with lookahead tokens) may not behave as expected.
  • For some grammars and some inputs, parsing can take exponential time and space. I'm not sure how to determine how much of a problem that will be for our grammar.

@chandlerc
Copy link
Contributor

I think in #520 we're coming up with reasonably robust ideas for disambiguating this that won't hit the parsing challenges raised here by @geoffromer and @zygoloid so for now I'd like to suggest we resolve this for now with my proposed guidance above:

  • We use postfix * for pointer types to look similar to C/C++ (if not quite the same).
  • We use prefix * for dereference to look similar to C/C++ (and x-> for (*x).).
  • Insist on some disambiguation rules to handle this and infix * for multiply. The goal is to pick disambiguation rules that are as simple as we can get away with while supporting the these use cases.

For arrays, I would suggest continuing to write Array(...) in some form until we have enough motivation to figure this out (likely needing to figure out slices at roughly the same time). And I would suggest using some syntax that is super unambiguous for multi-dimensional arrays (rather than nesting things) until that is resolved as well.

Marking as closed as I think this is a workable state that at least Richard and I were comfortable with in discussion and we've not heard serious concerns with (other than making sure the parsing strategy doesn't go down the expensive paths of GLR and other things -- we're definitely going to invest in avoiding that!).

That said, if there is outstanding confusion here or anyone really disagrees with my read on things, please re-open!

Issues for leads automation moved this from Questions to Resolved Jun 2, 2021
zygoloid added a commit that referenced this issue Jun 22, 2021
#582)

The presence or absence of whitespace is used to determine which
operator is in use, following the rules described in #520.

Support for prefix * dereference operator follows #523.

Co-authored-by: Geoff Romer <gromer@google.com>
chandlerc pushed a commit that referenced this issue Jun 28, 2022
#582)

The presence or absence of whitespace is used to determine which
operator is in use, following the rules described in #520.

Support for prefix * dereference operator follows #523.

Co-authored-by: Geoff Romer <gromer@google.com>
@nigeltao
Copy link

I've tried to capture the various options for different pointer syntaxes in this doc.

FWIW, this doc is not publicly accessible.

@geoffromer
Copy link
Contributor

I've tried to capture the various options for different pointer syntaxes in this doc.

FWIW, this doc is not publicly accessible.

Here's a publically accessible mirror

@jonmeow jonmeow added the leads question A question for the leads team label Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leads question A question for the leads team
Projects
No open projects
Development

No branches or pull requests

7 participants