Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes from Prototyping #104 - Extend `For Each` statement with query comprehensions #380

Open
AnthonyDGreen opened this issue Feb 20, 2019 · 6 comments

Comments

Projects
None yet
3 participants
@AnthonyDGreen
Copy link
Contributor

commented Feb 20, 2019

I've prototyped the feature originally suggested in issue #104. See that issue for context/motivation.

A video summary demonstrating the prototype can be seen here:

Prototype - For Each Query Extensions

The source for the prototype can be found in branch prototypes/enhanced-for-each of my fork.

This prototype extends the For Each construct with:

  1. Support for all query operators, e.g.
For Each x In y
   Where P(x)

Next
  1. Support for declaring multiple variables sourced from multiple collections in a comma delimited list (#48), e.g.
For Each x In A, 
         y In B

Next

ADG: I like this syntax in general, but it's especially important to avoid requiring the awkward syntax:
For Each x In y From y In B which means exactly the same thing but... well, just look at it!

Note: (1) and (2) can be used in combination.

  1. Support for implicit line-continuation after the Each keyword, e.g.
For Each
    x In A
Join
    y In B
On
    x.Id Equals y.Id


Next

Language Design

Syntax

For Each ControlVariable In Expression [ , AdditionalRangeVariables ] [ QueryClauses ]
[ Statements ]
Next

  1. If AdditionalRangeVariables or QueryClauses is specified, ControlVariable must take the form of:

identifier [ As Type ]

This restriction is imposed by query expressions, which always declare new range variables and never reference existing l-values as range variables.

Note: It is illegal to use the nullable modifier ? without specifying an As clause today in both For Each control variables and range variable declarations.

  1. AdditionalRangeVariables is a comma-separated list of collection range variables of the form:

identifier2 [ As Type2 ] In expression2 [, identifier3 [ As Type3 ] In expression3 [ , ... ] ]

  1. QueryClauses is a list of query clauses (without restriction) as specified elsewhere (in the spec).

  2. NEW An optional newline may now follow the Each keyword.

    • This is consistent with what is permitted after From.
    • It is already legal for a newline to follow the In keyword.
  3. An optional newline may follow the comma before AdditionalRangeVariables and any comma between range variables.

  4. All other rules regarding line continuation (newlines) within and between collection range variables and query clauses is as specified in the spec.

Semantics

Given a For Each statement of the form:

For Each I [ As T ] In E [, AdditionalRangeVariables ] [ QueryClauses ]

Where AdditionalRangeVariables and/or QueryClauses are present:

  1. The identifier I and optional As clause specifying type T after the Each keyword, and the expression after the In keyword are first used to synthesize a query expression.
    a. If the optional As clause is specified, the query will be synthesized as if written From I As T In E
    b. If the optional As clause is omitted, the query will be synthesized as if written From I In E
    • Note: Option Infer Off does not cause an error to be reported in this case (as it would not were the query explicitly written).

c. E must therefore satisfy the Queryable pattern (have a Select method)
d. If AdditionalRangeVariables is specified the list of collection range variable declarators are appended to the From operator as though it had been written From I As T In E, I2 As T2 In T2, ...

  1. Any subsequent query clauses are then applied.
    a. If 0 range variables are in scope after completely constructing the query, an error is reported.
    b. If any range variables are unnamed, an error is reported.

  2. The result of synthesized query expression is then bound as the source collection
    a. The type of the synthesized query expression must therefore satisfy the Collection pattern (have a GetEnumerator method).

  3. If exactly one range variable, with name v, remains in scope after applying all query operators (i.e. the query resulted in a sequence of scalars), the loop is bound as if written:

Dim $temp = $QueryExpression
For Each v In QueryExpression
    ...
Next

Note: The Select operator may remove the original range variables from the result and/or introduce new range variables with different names. The name of the $current variable is the name in scope at the end, not at the start!

Note: In the common case only a single range variable is ever introduced and therefore has the same name as the identifier following the Each keyword (though this is not required). Normally, the collection expression of a For Each is bound after the declaration of the control variable and as such can refer to the control variable. In the case that the collection expression is a query the it is illegal to use the same name as the control variable for any range variable in the query expression because it is prohibited for range variables to shadow local variables. But as prototyped, the synthesized query expression is bound before the rest of the For Each statement so the declaration of the For Each control variable x does not cause collisions with the range variable x that appears in the query even if they both share the same name.

Note: IDE will have to do cascading rename/find-references to account for this behavior.

  1. If more than one range variable remains in scope after applying all query operators (i.e. the query resulted in a sequence of anonymous type objects), a new variable is declared within the body of the loop (i.e. captured per iteration) for each range variable and initialized with the value of the anonymous type property corresponding to that variable:
Dim $temp = $QueryExpression
For Each $current In QueryExpression
    Dim v1 = $current.v1,
        v2 = $current.v2, ...
    ...
Next

ADG: As implemented, I cache the result of $enumerator.Current in $current to avoid multiple calls to .Current and whatever side-effects they might have.

ADG: There's an optimization opportunity here, I think. If a range variabe is never written to we could elide it entirely and rewrite all references to it as references to the anonymous type property instead.

Other Issues

1. Can we optimize this to generate imperative code and not invoke method/allocate lambdas/enumerables/etc.?

In the case that the source collection is IQueryable, lowering should definitely bind it using LINQ. The performance implications of executing the query remotely on the server rather than in-memory using imperative code cannot be overstated.

In the case that the query expression is NOT using IQueryable and using well-known methods (e.g. System.Linq.Enumerable methods), we can generate all of the iteration code imperatively. I was not originally particular about this point but upon considering the number of use cases this would open up I'm not a strong proponent. The Roslyn compilers avoid LINQ for fear of allocations. Because we know that the intermediate IEnumerable objects don't "escape" we could elite them entirely and generate all the query operators imperatively. This is particularly seductive in the case of For Each x In A, y In B where no query operators are even used. Though as implemented this invokes SelectMany it's very understandable that a user would expect this to be equivalent to nested For Each loops.

That said, we should make this pattern-based by recognizing some set of attributes on the query methods to allow arbitrary collection types to opt into this optimized code-gen as the compiler data structures don't always implement IEnumerable(Of T). Another similar optimization we should open up to 3rd-party collections is the one where For Each over arrays and strings today use an indexer rather than the enumerator. Today code in Roslyn is sometimes forced to use For rather than For Each to get perf benefits that would be free were the code using plain arrays.

2. Multiple control variables

It's understandable that given that this works today:

For Each M(p).X In s1
    For Each N(p).Y In s2

    Next
Next

One might expect this to work:

For Each M(p).X In s1,
         N(p).Y In s2

Next

I can be done. It's not particularly hard but is utterly out of the question unless we specify that this code must transform into nested For Each loops. As implemented For Each x In A, y In B lowers to a call to SelectMany and as such neither declarator is permitted to reference an existing l-value. Doing this moves changing the code-gen strategy from the realm of optimization to requirement. And again, in the IQueryable case generating the code imperatively would be disastrous.

ADG: I'm very inclined to call this scenario unsupported.

3. Multiple next variables

All of these cases are logically consistent within the language as it works today. That is to say that they make sense. They might even be inexpensive.

Should this work?

For Each x In A,
         y In B
   Where x = y
  Select z = 1

    ...

Next z

Should this work?

For Each x In A,
         y In B
   Where x = y

    ...

Next y, x

Should this work?

For Each x In A,
         y In B
   Where x = y

    For Each z In C

        ...   
    
Next z, y, x

ADG: I hear Chuck Stoner screaming. I don't like NOT doing logically consistent things out of sheer laziness. But one issue I have with letting you refer to each range variable as though it were a separate variable in a separate loop is that it's not sustainable. There are other proposals #186 and #207 to support explicit Continue For x and Exit For y statements. Those statements do not make sense in combination with these variables at all.

4. If the query operators return a type that would otherwise be optimized during rewrite (vector arrays or strings), should we bail on optimized rewriting?

e.g. if the query results in an array should we still optimize the For Each into a For?

ADG: I suppose this optimization is in conflict with issue (1).

Syntax API Design

Shape

Design 1 (Currently implemented)

The current design aims to be minimally disruptive by extending the existing ForEachStatementSyntax with new optional children:

  • ForKeyword As SyntaxToken
  • EachKeyword As SyntaxToken
  • ControlVariable As SyntaxNode
  • InKeyword As SyntaxToken
  • Expression As ExpressionSyntax
  • Optional CommaToken As SyntaxToken
  • Optional AdditionalRangeVariables As SeparatedSyntaxList(Of CollectionRangeVariableSyntax)
  • Optional QueryClauses As SyntaxList(Of QueryClauseSyntax)

Note: All previous SyntaxFactory and Update methods had to be manually recreated for back-compat.

Pros/Cons

  • Pro: This means we don't need to duplicate a lot of code that today works with For Each blocks in order to work with a new special EnhancedForEachBlockSyntax. Of course, that code might make faulty assumptions when the new syntax changes semantics a little, e.g. assuming only 1 control variable is in scope or that it's declarator is the declarator of the ForEachStatementSyntax, etc.

  • Pro: The comma and the comma-separated list are separate to preserve the convention (invariant?) that comma separated syntax lists never begin with a comma (that commas are always at odd indices). There could be perf implications of changing this.

  • Con: This violates the convention that when two adjacent children (the comma and the additional variables) must either both be present or both be absent that they are represented as a single node. Doing so would require the creation of a CommaAddionalVariablesSyntax node.

  • Pro: Not adding such an intermediary also has the benefit that the Parent of the additional CollectionRangeVariableSyntax nodes is the ForEachStatementSyntax itself much like the control variable declarator, for whatever that's worth.

Design 2 - Query Expression on the Side

A For Each statement consists of the tokens For Each and either all the stuff that's there today or a query expression where the first query operator is a new query operator that's just like a FromQueryClause but without the From keyword.

For Each ( control [As Clause] In expression | QueryExpression )

ImplicitFromClause ::= CollectionRangeVariable [, ...]

Pros/Cons

  • Pro: Just adds a new query operator
  • Con: New single-use "implicit From" query clause.
  • Pro: Less disruptive to query code as range variables are always parented by query clauses (never ForEachStatementSyntax) and query clauses are always parented by query expressions.
  • Con: A little more disruptive to For Each statement because it retroactively makes the last 4 children of a ForEachStatementSyntax optional.

Design 3 - Query Clauses on the Side, hold the From keyword

A hybrid of (1) and (2). The ForEachStatementSyntax is still extended with query clauses, but the whole additional range variables thing is stuffed into a new special-case query clause just for this.

For Each control [As Clause] In expression [ QueryClause+ ]

ImplicitFromClause ::= , CollectionRangeVariable [, ...]

Pros/Cons

  • Pro: Collection range variables always parented by "query clause", but query clause sometimes parented by For Each statements.
  • Pro: None of the existing children become optional.
  • Con: New single-use "comma implicit From" query clause.

Other issues

While I'm in there...

Modifying the syntax API to support to support this is doable but disruptive. Not to the author of the feature, that work is easy enough. But it is a bit disruptive to consumers. A second more disruptive change would be supporting tuple deconstruction in queries and/or For Each. If we're going to be doing reconstructive surgery on ForEachStatementSyntax and/or CollectionRangeVariable, it could be nice to fix it all at once. e.g.

What if we unified all declarators across the VB Syntax API to "fit" the union of possibilities: (Expression | ModifiedIdentifier AsClause | Tuple), while still enforcing the current restrictions on what concrete derivatives can actually be used within any given construct?

Collisions with other potential upcoming features

I'm proposing we extend the existing For Each statement (and block) nodes to add queries. But with async streams in the wild it's foreseeable that we might either add an Await Each (or whatever) construct to the language. At which point is that another modification to the existing For Each or a new block? If there are now two statements, ForEachStatementSyntax and AwaitEachStatementSyntax does all this work have to be duplicated to support query clauses over async streams? And what about async query expressions (as opposed to synchronous query expressions over async sequences)? If there's an Async From expression or Async Where clause how does it combine with this feature across both synchronous and asynchronous For Each!?!?!?!? Do we have 2x2x2=8 new blocks?

ADG:* Of course not! But it's good to keep this in mind.*

@franzalex

This comment has been minimized.

Copy link

commented Feb 20, 2019

This is like having Python's list comprehension on VB.NET and that's great!
I also love how you've taken so much effort to flesh out this much detail.

I hope to read it in detail tomorrow and give my inputs (it's almost midnight here and I'm quite sleepy).

Once again, great work @AnthonyDGreen 😁👍🏾👍🏾

@Berrysoft

This comment has been minimized.

Copy link

commented Feb 21, 2019

@AnthonyDGreen I'm a little confused. Does the code in new syntax

For Each x In A, 
         y In B

Next

should be equal to

For Each x In A
    For Each y in B

    Next
Next

or

Dim e1 = A.GetEnumerator()
Dim e2 = B.GetEnumerator()
Do While e1.MoveNext() AndAlso e2.MoveNext()
    'Do with x = e1.Current and y = e2.Current
Loop

or some else?

@AnthonyDGreen

This comment has been minimized.

Copy link
Contributor Author

commented Feb 21, 2019

@Berrysoft ,

If I understand you right, you're asking if it's equivalent to Enumerable.SelectMany or Enumerable.Zip. It's the former. So the effect is like your first example though technically it's equal to:

Dim query = (From x In A, y In B)
For Each item In query
    Dim x = item.x,
        y = item.y

Next

However, in certain specific cases (the common cases) we could optimize it into a nested For Each.

@Berrysoft

This comment has been minimized.

Copy link

commented Feb 21, 2019

@AnthonyDGreen Oh, thanks! I think it's a good approach!

@ghost

This comment has been minimized.

Copy link

commented Feb 25, 2019

@AnthonyDGreen
I find merging foreach and fro confusing. Instead, it will be better to execute what we want in From directly. something like this

From x In {1, 2, 3, 4}
     Where x > 2
     Sub
        Console.WriteLine(x)
     End Sub

This From Sub expression doesn't return any result. It executes the sub directly on each result of the query as soon as it gets it..

Dim R = From x In {1, 2, 3, 4}
        Where x > 2

For each x in R
    Console.WriteLine(x)
Next
@ghost

This comment has been minimized.

Copy link

commented Feb 25, 2019

Noting that we can write thi today:

        For Each y In (
                       From x In {1, 2, 3, 4}
                       Where x > 2)
            Console.WriteLine(y)
        Next
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.