-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add lambda syntax to grammar #6868
feat: add lambda syntax to grammar #6868
Conversation
f7b417c
to
cf79b2c
Compare
5f6c212
to
f6ab6e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm quite under-qualified yet to review this PR, since I'm still learning the AST code.. so I just left a few questions mainly for my own learning.
public void shouldSanitizeLambdaArguments() { | ||
// Given: | ||
final Statement stmt = givenQuery( | ||
"SELECT TRANSFORM_ARRAY(X => X + 5) FROM TEST2;"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a valid statement in practice? I.e. without a column reference of the entity TEST
, how should we know which column to apply this lambda expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular one wouldn't be valid.
We don't have to have a column reference though to apply the invocation function, we could have Select Transform_Array(ARRAY[1, 2], x => x + 5) FROM TEST2 emit changes;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does ARRAY[1, 2]
mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.ksqldb.io/en/latest/developer-guide/syntax-reference/#array
It's how we create an inline array in the language. In this case we're not using any of the rows from the source and instead just using a constant array for use in the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So in this case it is irrelevant to TEST2
's records, and would always return a single record as {key: null, value: [6,7]} right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup
...tion/src/main/java/io/confluent/ksql/execution/expression/tree/LambdaFunctionExpression.java
Outdated
Show resolved
Hide resolved
@Test | ||
public void shouldBuildLambdaFunction() { | ||
// Given: | ||
final SingleStatementContext stmt = givenQuery("SELECT TRANSFORM_ARRAY(Col4, X => X + 5) FROM TEST1;"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own education: it seems that if Col is not specified, the sanitization phase would add a default column COL_0
to it, is that right? If yes, when is the sanitization phase triggered? And how would COL_0
be leverage in execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The santizer runs in the engine after parsing. The parser (which builds our internal AST) should be doing a direct translation from the antlr ast to our internal AST representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! But I'm wondering, if Col4
is not specified, i.e.
SELECT TRANSFORM_ARRAY(X => X + 5) FROM TEST1
Then the sanitizer would add a COL_0
to the result AST, does that mean we default to the first column (rowkey?) if none is specified? I thought if it is not specified then the statement should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the KSQL_COL_0
refers to the name of the column in the output of the query. We need to have a column name for each column and if the user didn't specify what the column name should be, we should generate a unique name for that output column
ksql> select 5,3 as TEST,6,time from KSQL_PROCESSING_LOG emit changes;
+----------------------------------+----------------------------------+----------------------------------+----------------------------------+
|KSQL_COL_0 |TEST |KSQL_COL_1 |TIME |
+----------------------------------+----------------------------------+----------------------------------+----------------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @stevenpyzhang! Left a few comments/questions inline.
@@ -1173,10 +1185,12 @@ public Node visitWhenClause(final SqlBaseParser.WhenClauseContext context) { | |||
|
|||
@Override | |||
public Node visitFunctionCall(final SqlBaseParser.FunctionCallContext context) { | |||
final List<Expression> expressionList = visit(context.expression(), Expression.class); | |||
expressionList.addAll(visit(context.lambdaFunction(), Expression.class)); | |||
return new FunctionCall( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't we losing the ordering of the lambda w/ respect to the other expressions? I guess maybe it doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't care about ordering the lambdas with respect to the expressions we should probably put them in their own list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to update the g4 so that all the lambda functions have to be after all the other function arguments
| identifier'(' (expression (',' expression)* (',' lambdaFunction)*)? ')' #functionCall
So then this assumption can be made soundly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
@Test | ||
public void shouldBuildLambdaFunction() { | ||
// Given: | ||
final SingleStatementContext stmt = givenQuery("SELECT TRANSFORM_ARRAY(Col4, X => X + 5) FROM TEST1;"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The santizer runs in the engine after parsing. The parser (which builds our internal AST) should be doing a direct translation from the antlr ast to our internal AST representation.
import java.util.Optional; | ||
|
||
@Immutable | ||
public class LambdaLiteral extends Literal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question rather than suggestion - why do we need to use another type for these? An alternative would be to just use UnqualifiedColumnReferenceExp
(and maybe rename it to UnqualifiedReferenceExp
). Do you think it makes it easier to implement lambdas if they have different types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when implementing the type reference, it makes the code cleaner. I'd imagine when doing type checking in ExpressionTypeManager
, instead of having to pass down in the context (similar to what's happening in AstSanitizer) that we're in a lambda function expression, we can just have a specific visitor for a LambdaLiteral.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After working through some of the type stuff, I do think it's somewhat necessary to have the additional type. Right now, the array/map input is passed in as an UnqualifiedReferenceExp
and it's nice to be able to distinguish that from the lambda inputs
return result.get(); | ||
} | ||
|
||
final LambdaContext lambdaContext = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LambdaContext
doesn't belong here - the rewriter shouldn't make any interpretation of the context. From this class's POV context is an opaque cookie to pass around to the rewriter so it can keep some internal state.
What's the intent of this LambdaContext stuff? is it just to screen for conflicts in variable names w/ enclosing variable names or column names? The better way to structure this would be to keep track of these in AstSanitizer
. So from there you could do:
class AstSanitizer {
...
private static final class ExpressionRewriterPlugin extends
VisitParentExpressionVisitor<Optional<Expression>, Context<SanitizerContext>> {
@Override
public Optional<Expression> visitUnqualifiedColumnReference(
final UnqualifiedColumnReferenceExp expression,
final SanitizerContext ctx
) {
final ColumnName columnName = expression.getColumnName();
if (ctx.lambdaArgs().contains(columnName.text())) {
return Optional.of(new LambdaLiteral(columnName.text()));
}
...
}
@Override
public Optional<Expression> visitLambdaExpression(
final LambdaFunctionExpression expression,
final SanitizerContext ctx
) {
dataSourceExtractor.getAllSources().forEach(aliasedDataSource -> {
for (String argument : expression.getArguments()) {
if (aliasedDataSource.getDataSource().getSchema().columns().stream()
.map(column -> column.name().text()).collect(Collectors.toList())
.contains(argument)) {
throw new KsqlException("Lambda argument can't be a column name.");
}
if (ctx.lambdaArgs().contains(argument)) {
throw new KsqlException...
}
ctx.addLambdaArg(argument);
}
});
return Optional.empty(); // just let the rewriter reconstruct the lambda - this function just validates and tracks args
}
}
private static class SanitizerContext {
Set<String> lambdaArgs = new HashSet<>();
private void addLambdaArg(final String name) {
reservedNames.add(name);
}
private Set<String> lambdaArgs() {
return ImmutableSet.copyOf(lambdaArgs);
}
}
}
Then, this method should just become a dumb rewrite like everything else, and we can get rid of LambdaContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then in https://github.com/confluentinc/ksql/pull/6868/files#diff-e5f9abbf3a2878a1dce09039136805764ecc31404a7eb27f2d1b49b9e038030dR81 we would just pass in a new SanitizerContext()
instead of a null
.
I'm wondering, if we can just reuse LambdaContext
instead of creating a new SanitizerContext
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could, though tracking the lambda args seems specific to AstSanitizer
at the moment.
...tion/src/main/java/io/confluent/ksql/execution/expression/tree/LambdaFunctionExpression.java
Outdated
Show resolved
Hide resolved
0a5ad24
to
a8fc46a
Compare
a8fc46a
to
1ee3a57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the new commit. I just have a few minor comments left, otherwise LGTM.
ksqldb-parser/src/main/antlr4/io/confluent/ksql/parser/SqlBase.g4
Outdated
Show resolved
Hide resolved
@@ -1173,10 +1185,12 @@ public Node visitWhenClause(final SqlBaseParser.WhenClauseContext context) { | |||
|
|||
@Override | |||
public Node visitFunctionCall(final SqlBaseParser.FunctionCallContext context) { | |||
final List<Expression> expressionList = visit(context.expression(), Expression.class); | |||
expressionList.addAll(visit(context.lambdaFunction(), Expression.class)); | |||
return new FunctionCall( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
@Test | ||
public void shouldBuildLambdaFunction() { | ||
// Given: | ||
final SingleStatementContext stmt = givenQuery("SELECT TRANSFORM_ARRAY(Col4, X => X + 5) FROM TEST1;"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
), | ||
Optional.empty()) | ||
)))); | ||
} | ||
|
||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since in g4 we allow multiple expressions followed by multiple lambda functions, could we also add a test for the error case where number of expressions is not equal to the number of lambda functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can make this restriction, not all invocation functions would fit this requirement. From KLIP-30
reduce_map(map, s, (k, v, s) => s) - Reduces the input Map down to a single value. s is the initial state and is passed into the scope of the lambda function. Each invocation returns a new value for s, which the next invocation will receive. reduce_map will return the final value of s.
^ this one wouldn't fit this restriction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
Adds syntax for lambda functions to the KSQL grammar
LambdaFunctionExpression for representing a lambda function node
LambdaLiteral for representing a lambda variable
During the AstBuilder phase, the lambda variables are treated as UnqualifiedColumnReferences since I couldn't pass in a proper context, so I decided to convert them with the AstSanitizer.
Currently, lambdas won't work properly because the corresponding SqlToJavaVisitor methods haven't been implemented yet.
Testing done
Unit test
Manual test
Reviewer checklist