Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add lambda syntax to grammar #6868

Merged

Conversation

stevenpyzhang
Copy link
Member

Description

Adds syntax for lambda functions to the KSQL grammar

LambdaFunctionExpression for representing a lambda function node
LambdaLiteral for representing a lambda variable

During the AstBuilder phase, the lambda variables are treated as UnqualifiedColumnReferences since I couldn't pass in a proper context, so I decided to convert them with the AstSanitizer.

Currently, lambdas won't work properly because the corresponding SqlToJavaVisitor methods haven't been implemented yet.

Testing done

Unit test
Manual test

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@stevenpyzhang stevenpyzhang force-pushed the lambda-function-syntax branch 4 times, most recently from f7b417c to cf79b2c Compare January 19, 2021 04:53
@stevenpyzhang stevenpyzhang force-pushed the lambda-function-syntax branch 3 times, most recently from 5f6c212 to f6ab6e0 Compare January 20, 2021 06:21
@stevenpyzhang stevenpyzhang marked this pull request as ready for review January 20, 2021 18:01
@stevenpyzhang stevenpyzhang requested a review from a team as a code owner January 20, 2021 18:01
Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm quite under-qualified yet to review this PR, since I'm still learning the AST code.. so I just left a few questions mainly for my own learning.

public void shouldSanitizeLambdaArguments() {
// Given:
final Statement stmt = givenQuery(
"SELECT TRANSFORM_ARRAY(X => X + 5) FROM TEST2;");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a valid statement in practice? I.e. without a column reference of the entity TEST, how should we know which column to apply this lambda expression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular one wouldn't be valid.

We don't have to have a column reference though to apply the invocation function, we could have Select Transform_Array(ARRAY[1, 2], x => x + 5) FROM TEST2 emit changes;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does ARRAY[1, 2] mean here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.ksqldb.io/en/latest/developer-guide/syntax-reference/#array

It's how we create an inline array in the language. In this case we're not using any of the rows from the source and instead just using a constant array for use in the function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So in this case it is irrelevant to TEST2's records, and would always return a single record as {key: null, value: [6,7]} right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

@Test
public void shouldBuildLambdaFunction() {
// Given:
final SingleStatementContext stmt = givenQuery("SELECT TRANSFORM_ARRAY(Col4, X => X + 5) FROM TEST1;");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own education: it seems that if Col is not specified, the sanitization phase would add a default column COL_0 to it, is that right? If yes, when is the sanitization phase triggered? And how would COL_0 be leverage in execution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The santizer runs in the engine after parsing. The parser (which builds our internal AST) should be doing a direct translation from the antlr ast to our internal AST representation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! But I'm wondering, if Col4 is not specified, i.e.

SELECT TRANSFORM_ARRAY(X => X + 5) FROM TEST1

Then the sanitizer would add a COL_0 to the result AST, does that mean we default to the first column (rowkey?) if none is specified? I thought if it is not specified then the statement should fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the KSQL_COL_0 refers to the name of the column in the output of the query. We need to have a column name for each column and if the user didn't specify what the column name should be, we should generate a unique name for that output column

ksql> select 5,3 as TEST,6,time from KSQL_PROCESSING_LOG emit changes;

+----------------------------------+----------------------------------+----------------------------------+----------------------------------+
|KSQL_COL_0                        |TEST                              |KSQL_COL_1                        |TIME                              |
+----------------------------------+----------------------------------+----------------------------------+----------------------------------+

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @stevenpyzhang! Left a few comments/questions inline.

@@ -1173,10 +1185,12 @@ public Node visitWhenClause(final SqlBaseParser.WhenClauseContext context) {

@Override
public Node visitFunctionCall(final SqlBaseParser.FunctionCallContext context) {
final List<Expression> expressionList = visit(context.expression(), Expression.class);
expressionList.addAll(visit(context.lambdaFunction(), Expression.class));
return new FunctionCall(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we losing the ordering of the lambda w/ respect to the other expressions? I guess maybe it doesn't matter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't care about ordering the lambdas with respect to the expressions we should probably put them in their own list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to update the g4 so that all the lambda functions have to be after all the other function arguments

    | identifier'(' (expression (',' expression)* (',' lambdaFunction)*)? ')'             #functionCall

So then this assumption can be made soundly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@Test
public void shouldBuildLambdaFunction() {
// Given:
final SingleStatementContext stmt = givenQuery("SELECT TRANSFORM_ARRAY(Col4, X => X + 5) FROM TEST1;");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The santizer runs in the engine after parsing. The parser (which builds our internal AST) should be doing a direct translation from the antlr ast to our internal AST representation.

import java.util.Optional;

@Immutable
public class LambdaLiteral extends Literal {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question rather than suggestion - why do we need to use another type for these? An alternative would be to just use UnqualifiedColumnReferenceExp (and maybe rename it to UnqualifiedReferenceExp). Do you think it makes it easier to implement lambdas if they have different types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when implementing the type reference, it makes the code cleaner. I'd imagine when doing type checking in ExpressionTypeManager, instead of having to pass down in the context (similar to what's happening in AstSanitizer) that we're in a lambda function expression, we can just have a specific visitor for a LambdaLiteral.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After working through some of the type stuff, I do think it's somewhat necessary to have the additional type. Right now, the array/map input is passed in as an UnqualifiedReferenceExp and it's nice to be able to distinguish that from the lambda inputs

return result.get();
}

final LambdaContext lambdaContext =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LambdaContext doesn't belong here - the rewriter shouldn't make any interpretation of the context. From this class's POV context is an opaque cookie to pass around to the rewriter so it can keep some internal state.

What's the intent of this LambdaContext stuff? is it just to screen for conflicts in variable names w/ enclosing variable names or column names? The better way to structure this would be to keep track of these in AstSanitizer. So from there you could do:

class AstSanitizer {
...
    private static final class ExpressionRewriterPlugin extends
      VisitParentExpressionVisitor<Optional<Expression>, Context<SanitizerContext>> {

          @Override
          public Optional<Expression> visitUnqualifiedColumnReference(
              final UnqualifiedColumnReferenceExp expression,
              final SanitizerContext ctx
          ) {
              final ColumnName columnName = expression.getColumnName();
              if (ctx.lambdaArgs().contains(columnName.text())) {
                  return Optional.of(new LambdaLiteral(columnName.text()));
              }
              ...
          }

          @Override
          public Optional<Expression> visitLambdaExpression(
              final LambdaFunctionExpression expression,
              final SanitizerContext ctx
           ) {
              dataSourceExtractor.getAllSources().forEach(aliasedDataSource -> {
              for (String argument : expression.getArguments()) {
                  if (aliasedDataSource.getDataSource().getSchema().columns().stream()
                      .map(column -> column.name().text()).collect(Collectors.toList())
                      .contains(argument)) {
                       throw new KsqlException("Lambda argument can't be a column name.");
                  }
                  if (ctx.lambdaArgs().contains(argument)) {
                       throw new KsqlException...
                  }
                  ctx.addLambdaArg(argument);
              }
              });
              return Optional.empty();  // just let the rewriter reconstruct the lambda - this function just validates and tracks args
          }
    }

    private static class SanitizerContext {
        Set<String> lambdaArgs = new HashSet<>();

        private void addLambdaArg(final String name) {
            reservedNames.add(name);
        }

        private Set<String> lambdaArgs() {
            return ImmutableSet.copyOf(lambdaArgs);
        }
    }
}

Then, this method should just become a dumb rewrite like everything else, and we can get rid of LambdaContext

Copy link
Contributor

@guozhangwang guozhangwang Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then in https://github.com/confluentinc/ksql/pull/6868/files#diff-e5f9abbf3a2878a1dce09039136805764ecc31404a7eb27f2d1b49b9e038030dR81 we would just pass in a new SanitizerContext() instead of a null.

I'm wondering, if we can just reuse LambdaContext instead of creating a new SanitizerContext here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could, though tracking the lambda args seems specific to AstSanitizer at the moment.

@stevenpyzhang stevenpyzhang force-pushed the lambda-function-syntax branch 4 times, most recently from 0a5ad24 to a8fc46a Compare January 27, 2021 07:03
Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new commit. I just have a few minor comments left, otherwise LGTM.

@@ -1173,10 +1185,12 @@ public Node visitWhenClause(final SqlBaseParser.WhenClauseContext context) {

@Override
public Node visitFunctionCall(final SqlBaseParser.FunctionCallContext context) {
final List<Expression> expressionList = visit(context.expression(), Expression.class);
expressionList.addAll(visit(context.lambdaFunction(), Expression.class));
return new FunctionCall(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@Test
public void shouldBuildLambdaFunction() {
// Given:
final SingleStatementContext stmt = givenQuery("SELECT TRANSFORM_ARRAY(Col4, X => X + 5) FROM TEST1;");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

),
Optional.empty())
))));
}

@Test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in g4 we allow multiple expressions followed by multiple lambda functions, could we also add a test for the error case where number of expressions is not equal to the number of lambda functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can make this restriction, not all invocation functions would fit this requirement. From KLIP-30

reduce_map(map, s, (k, v, s) => s) - Reduces the input Map down to a single value. s is the initial state and is passed into the scope of the lambda function. Each invocation returns a new value for s, which the next invocation will receive. reduce_map will return the final value of s.

^ this one wouldn't fit this restriction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stevenpyzhang stevenpyzhang merged commit dd3f365 into confluentinc:master Jan 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants