Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: clarify comprehension scope issues #84

Closed
alandonovan opened this issue Oct 4, 2019 · 14 comments
Closed

spec: clarify comprehension scope issues #84

alandonovan opened this issue Oct 4, 2019 · 14 comments

Comments

@alandonovan
Copy link
Contributor

alandonovan commented Oct 4, 2019

The Go implementation of Starlark follows the Python3 semantics---specifically, the outermost for loop's sequence expression is evaluated outside the new lexical block---and its spec describes this in some detail (https://github.com/bazelbuild/starlark/blob/master/spec.md#comprehensions). By contrast, the Java implementation evaluates the for-operand sequence in the outer lexical block, which resembles Python2.

So:

$ cat a.star
x = [1]
dummy = [(1,)]
print([x for x in x])
print([x for x in x for _ in dummy])

$ python2 a.star
[1]
Traceback (most recent call last):
  File "a.star", line 4, in <module>
    print([x for x in x for _ in dummy])
TypeError: 'int' object is not iterable

$ python3 a.star
[1]
[1]

$ starlark a.star # Go
[1]
[1]

$ blaze-bin/third_party/bazel/src/main/java/com/google/devtools/starlark/Starlark a.star
[1]
Traceback (most recent call last):
        File "", line 4
                print([x for x in x for _ in dummy])
        File "", line 4, in print
                x
local variable 'x' is referenced before assignment.

We should decide on the semantics and make the implementations agree. I propose we aim for the Python3-compatible behavior, for "least surprise". (The difference is entirely in the name resolver.)

@alandonovan
Copy link
Contributor Author

@brandjon

@brandjon
Copy link
Member

brandjon commented Oct 4, 2019

I dislike the asymmetry of Python's semantics here. But this probably doesn't cause problems for implementors, or much surprise for users. Unlike other deviations for Python, e.g. not supporting implicit string concatenation, there's little tangible benefit to deviating here -- it'd just be "improving" the language in an abstract way.

So I'm fine with codifying the Python 3 behavior in Starlark if Alan is.

@laurentlb
Copy link
Contributor

  • The only user-visible difference is when the loop variable conflicts with the sequence expression?
  • This is a safe change (meaning, it cannot break any user)?

If so, I'm okay with it.

@brandjon
Copy link
Member

brandjon commented Oct 9, 2019

This is not 100% safe to existing code.

$ cat ~/b.star
x = [1]
dummy = [(1,)]
print([x for x in x for _ in dummy])
print([x for _ in dummy for x in x])

$ ./blaze-bin/third_party/bazel/src/main/java/com/google/devtools/starlark/Starlark ~/b.star
[1]
[1]

$ python3 ~/b.star
[1]
Traceback (most recent call last):
  File "/usr/local/google/home/brandjon/b.star", line 4, in <module>
    print([x for _ in dummy for x in x])
  File "/usr/local/google/home/brandjon/b.star", line 4, in <listcomp>
    print([x for _ in dummy for x in x])
UnboundLocalError: local variable 'x' referenced before assignment

I think the change for consistency's sake may be worth it anyway.

@alandonovan
Copy link
Contributor Author

alandonovan commented Jul 31, 2020

In addition, we need to specify the behavior of [() for x in () for y in y]. All four implementations (Python2,3, Starlark/{Go,Java}) agree that the answer is [], that is, for y in y is not a static error, but the behavior cannot be derived from the Starlark spec, and AFAICT Python doesn't even attempt to specify it.

Recall that the first for clause (for x in ()) is special, and its operand is resolved in the enclosing scope and thus cannot reference the new x. But for all subsequent for clauses, the vars are bound before the body. I suspect the reason for this behavior is that it resembles the desugaring of comprehensions to nested loop statements, in which the scope of all bindings is the entire function, including the lines before the binding.

Similarly, the operand of the second for clause may refer to vars bound by the third:

>>> [() for x in () for y in () for z in y]
[]

Again, all four implementations agree but the spec says nothing.

@stepancheg

@brandjon
Copy link
Member

Yes, in Python, the first for-clause's iterable expression is evaluated in the outer scope, while everything else occurs inside the closure created by the comprehension. This can be confirmed by looking at the generated bytecode.

I suspect the reason for this behavior is that it resembles the desugaring of comprehensions to nested loop statements

It's actually an intentional design decision to benefit generator expressions, whose implementation leaks over to other kinds of comprehensions where there's little or no benefit. In generator expressions, this causes the RHS of the first for clause to be evaluated eagerly and therefore catch errors more quickly. There's nothing more that can be eagerly evaluated without forcing iteration over some of the generator expression. (Presumably they're not worried about supporting generator expressions where the first RHS is undefined until a later statement executed between the generator's construction and its first evaluation.)

Where do we stand on this github issue? My memory is that we decided (and implemented) that comprehensions define variables in their own local scopes. I furthermore thought that we did away with Python 3's special casing of the first RHS, so that [() for x in x] is an error even when x is properly defined outside the comprehension. If so, I'm happy with that state of affairs.

adonovan added a commit to google/starlark-go that referenced this issue Jul 31, 2020
- first loop operand is resolved in enclosing environment
- all loop vars are bound before any expression is resolved

See bazelbuild/starlark#84

Change-Id: I2c4283719cf78cea27dccba1ffb0932857254909
@brandjon
Copy link
Member

I see from the PR discussion there's also a point about whether each for clause of a comprehension forms its own nested scope or whether they're all part of one flat scope? The difference being whether [None for _ in y for y in [...]] (where y is not otherwise defined) is a static or a dynamic error.

I'm not sure I have strong opinions about this. The two possible design philosophies are

  1. Match what Python does (preferring Python 3 over Python 2 where they differ), to allow for least surprise.

  2. "Improve" upon Python by simplifying away its warts, or by prohibiting features that lead to common errors (like banning implicit string concatenation and unparenthesized singleton tuples). The precedent here is more on the side of disallowing things than changing the semantics of things.

@alandonovan
Copy link
Contributor Author

Jon, we follow Python3: a comprehension creates a single lexical block. The first loop's operand is resolved in the enclosing environment. Then all vars are bound; their scope is the entire block. Then all expressions are resolved.

(BTW, terms: a binding has a scope, which is the region of text in which its name refers to it. The environment is a tree of blocks, which map names to bindings.)

In the interests of sanity I think I would prefer to stick with the Python behavior. If the inner loop body is ever executed, you'll discover fwd-ref mistakes quickly. BTW it is possible to exploit the Python behavior by having the second loop conditionally execute the reference to the var of the third loop only on its second iteration [... for y in ([1] if firsttime() else z) for z in ...] but no-one should do that.

@brandjon
Copy link
Member

If the resolution is to do exactly the same scoping behavior as Python 3, that's fine be me, and it's just a matter of spec wording and implementation.

@stepancheg
Copy link
Contributor

I greatly appreciate increasing the strictness of the spec.

Replicating Python 3 behaviour is good.

Though I think Python 3 developers made a mistake by defining only one scope for the whole comprehension, we could do better (create scope for each for clause: that is probably what users intuitively expect).

(I guess Python did not bother with static error reporting due to its very dynamic nature, like locals() function or exec statement or for historical reasons, but Starlark is much stricter and can report errors better).

If the inner loop body is ever executed, you'll discover fwd-ref mistakes quickly.

This can be said about any static error. But a lot of definitions are very infrequently executed code.

As some evidence, when I enabled strict scope resolution in our repositories (each variable must be resolved to global or local), I have discovered a huge amount of broken code. Like this:

def create_target(name):
  if name.startswith("_"):
    fail("target name must not start with underscore: {}".format(target_name))
  ...

Can you spot an error here? But Starlark can, and that's great.

But either way is fine with me, it is not a life and death issue.

@ndmitchell
Copy link
Contributor

I asked @stepancheg for an example of how you can observe the scope difference - he replied with:

def foo():
    z = []
    print([x for x in [1] for y in z for z in []])
foo()

The z is in scope, but not yet bound. I find the Python3 behaviour here quite confusing - for in a list comp does different things each time. I'd be a fan of defining the scope the way it would be in a statically-typed language. And I agree with @stepancheg that detecting this statically (which Starlark would permit) is a great thing.

@brandjon
Copy link
Member

brandjon commented Aug 3, 2020

See my above comment on simplifying things versus matching Python behavior. Fixing this means yet another difference, and I'm dubious that this kind of error really occurs in practice. It's easier to deviate from Python when we're outright banning a feature that commonly leads to errors, rather than changing semantics of something for the sake of purity. That said, I don't feel super strongly.

alandonovan pushed a commit to google/starlark-go that referenced this issue Aug 4, 2020
- first loop operand is resolved in enclosing environment
- all loop vars are bound before any expression is resolved

See bazelbuild/starlark#84
@alandonovan
Copy link
Contributor Author

The remaining task is to copy the changes from google/starlark-go#299 into this repo.

@alandonovan
Copy link
Contributor Author

The remaining task is to copy the changes from google/starlark-go#299 into this repo.

Done by #115.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants