Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-3679] Allow lambda expressions in SQL queries #3502

Merged
merged 1 commit into from Dec 19, 2023

Conversation

macroguo-ghy
Copy link
Contributor

@macroguo-ghy macroguo-ghy commented Nov 4, 2023

This PR is about three things:

  1. Implementing parsing, validating, and converting SqlNode to RelNode, and RelNode to SqlNode for lambda expressions.
  2. Implementing the conversion of lambda expression RexNode to Linq4j Expression to execute higher order functions.
  3. Adding the EXISTS function, which is enabled in the Spark library.

Parse

Lambda expression is only accepted as a parameter of function call.

select func(1, (x, y) -> x + y); // accepted
select (x,y) -> x + y; // illegal for now

Validate

I add a new type LAMBDA_EXPRESSION to describe type of lambda expression.
For example:

  public static final SqlFunction EXISTS =
      SqlBasicFunction.create("EXISTS",
          ReturnTypes.BOOLEAN_NULLABLE,
          OperandTypes.sequence("EXISTS(<ARRAY>, <FUNCTION(ANY)->BOOLEAN>)",
              OperandTypes.ARRAY, OperandTypes.lambda(SqlTypeFamily.BOOLEAN, SqlTypeFamily.ANY)));

The type of lambda in exists function is a function type where the input parameter is of type ANY and the return type is of type BOOLEAN.
The lambda expression will be validated twice. For the first time, calcite will validate the lambda expression itself, which means all the types of parameters of lambda expression are considered as ANY. For example, (x, y) -> x + y, both x and y are ANY type, we can deduce that x + y is alsoANY, so the type of (x, y) -> x + y is FUNCTION(ANY, ANY) -> ANY. And then we need to know if the type of lambda is legal in a function, this is what LambdaExpressionOperandTypeChecker do. It will reset the type of parameters, and infer the type of lambda expression, and check if the derived type satisfy this checker. After the checker, we can infer the real type of lamda expression.

RelNode

In this part, I create a class RexLambdaRef to reference the parameter of lambda expression. From another perspective, we can view lambda expressions as "constants" rather than tables, so we cannot directly use 'RexInputRef' and 'RexLocalRef'`.


I referred to #1733, so I listed Ritesh as a co-author.

@mihaibudiu
Copy link
Contributor

Is there a design document for this feature?
This looks extremely useful, but before I review it I would like to have a big picture understanding of how it's supposed to work.
For example, is there type inference for lambdas?

Moreover, if EXISTS is independent on the lambdas perhaps it should be in a separate PR.

@mihaibudiu
Copy link
Contributor

Leaving aside the design document, I don't see any changes to the documentation files either.

@mihaibudiu
Copy link
Contributor

mihaibudiu commented Nov 7, 2023

The real tests of whether the design works properly is to allow a lambda expression wherever a function name is allowed.
So you should be allowed to say SELECT (x -> x + 1)(age) FROM Person or
SELECT ":" || ((x -> x || x)(SUBSTRING(name, 3, 3))) FROM Person WHERE (x -> x > 5)(age).
I think there is limited value to having a lambda expression that only works in some very restricted contexts.

@macroguo-ghy
Copy link
Contributor Author

Hi @mihaibudiu, thanks for your review. I have made some commits to address your feedback:

  • Add docs about lambda expressions and higher-order functions.
  • Add some negative tests in SqlParserTest.
  • Log a new jira case CALCITE-6116 to track the EXISTS function.

Copy link
Contributor

@julianhyde julianhyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good.

I've suggested quite a few changes to names and scope (e.g. removing public) to make this easier to maintain.

Can you add a quidem test? I would like to see this working end-to-end.

*/
public class RexLambdaRef extends RexInputRef {

private final String paramName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be an ordinal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After making RexLambdaRef extend RexSlot, we can now use the name, so there is no need for the paramName anymore.

import org.apache.calcite.sql.SqlKind;

/**
* Variable which references a field of a lambda expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/which/that/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}

@Override public boolean equals(@Nullable Object o) {
if (this == o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the compact form 'this == o || o instanceof RexLambdaExpression & ...'?

no need for Objects.equals. use expression.equals because it's not-null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

//~ Constructors -----------------------------------------------------------

RexLambdaExpression(List<RexLambdaRef> parameters, RexNode expression) {
this.expression = expression;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutbleList.copyOf(parameters)
requireNonNull(expression)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

/**
* Represents a lambda expression.
*/
public class RexLambdaExpression extends RexNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename RexLambdaExpression to RexLambda. give example.

@@ -1060,6 +1064,9 @@ void AddArg(List<SqlNode> list, ExprContext exprContext) :
)
(
e = Default()
|
LOOKAHEAD((SimpleIdentifierOrList() | <LPAREN> <RPAREN>) <LAMBDA_OPERATOR>)
e = LambdaExpression()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is this LOOKAHEAD?

LOOKAHEAD(2)
<LPAREN> <RPAREN> { parameters = SqlNodeList.EMPTY; }
|
parameters = SimpleIdentifierOrList()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a rule that allows (), x and (x [, y]*)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extract this rule to SimpleIdentifierOrListOrEmpty.

@@ -8787,6 +8817,7 @@ void NonReservedKeyWord2of3() :
| < NE2: "!=" >
| < PLUS: "+" >
| < MINUS: "-" >
| < LAMBDA_OPERATOR: "->" >
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just LAMBDA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
final SqlNode expression = toSql(program, lambda.getExpression());
return new SqlLambdaExpression(POS, parameters, expression);
case LAMBDA_REF:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add blank line before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

.fails("Cannot apply '(?s).*HIGHER_ORDER_FUNCTION' to arguments of type "
+ "'HIGHER_ORDER_FUNCTION\\(<INTEGER>, <FUNCTION\\(ANY, ANY, ANY\\) -> ANY>\\)'.*");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to see more tests for EXISTS. including negative tests. wrong type of arguments, wrong number of arguments, arguments of wrong name, arguments of wrong case.

check type derivation (if you didn't do it in SqlOperatorTest)

@@ -134,6 +134,18 @@ RelDataType createMapType(
RelDataType keyType,
RelDataType valueType);

/**
* Create a lambda expression type. Lambda expressions are functions that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Create/Creates/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return Objects.hash(expression, parameters);
}

@Override public String toString() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why override toString rather than computeDigest?

I think it would be clearer and more concise if you use a for loop rather than functional style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best thing is to do what RexMRAggCall does - make the class final and initialize digest in the constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, I use for loop instead.
RexLambdaExpression extends RexNode, the computeDigest is defined in RexCall.

Copy link
Contributor Author

@macroguo-ghy macroguo-ghy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review and encouragement, @julianhyde. I have made some changes based on your comments. And, I will be adding an end-to-end test later.

@@ -1033,6 +1034,9 @@ void AddArg0(List<SqlNode> list, ExprContext exprContext) :
)
(
e = Default()
|
LOOKAHEAD((SimpleIdentifierOrList() | <LPAREN> <RPAREN>) <LAMBDA_OPERATOR>)
e = LambdaExpression()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important token is ->, so the expense depends on how many tokens come before ->.
For example:

() -> true  // equals LOOKAHEAD(3)
(a, b) -> a + b // equals LOOKAHED(6)

LOOKAHEAD(2)
<LPAREN> <RPAREN> { parameters = SqlNodeList.EMPTY; }
|
parameters = SimpleIdentifierOrList()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extract this rule to SimpleIdentifierOrListOrEmpty.

@@ -8787,6 +8817,7 @@ void NonReservedKeyWord2of3() :
| < NE2: "!=" >
| < PLUS: "+" >
| < MINUS: "-" >
| < LAMBDA_OPERATOR: "->" >
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
final SqlNode expression = toSql(program, lambda.getExpression());
return new SqlLambdaExpression(POS, parameters, expression);
case LAMBDA_REF:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -134,6 +134,18 @@ RelDataType createMapType(
RelDataType keyType,
RelDataType valueType);

/**
* Create a lambda expression type. Lambda expressions are functions that
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -77,6 +77,7 @@ public enum SqlTypeFamily implements RelDataTypeFamily {
CURSOR,
COLUMN_LIST,
GEO,
LAMBDA_EXPRESSION,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

super(parent);
this.lambdaExpr = lambdaExpr;

// default parameter type is ANY
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the lambda expression will be validated twice.

A simple lambda expression a -> a, we can not infer the type of lambda if we just know the lambda expression, this is why I say "default parameter type is ANY".

But if we see the lambda expression in a function, we will validate the lambda expression second time in LambdaExpressionOperandTypeChecker.

For example, give a function test_func1, whose type checker is

OperandTypes.sequence("TEST_FUNC1(INTEGER, FUNCTION(STRING) -> NUMERIC)",
        OperandTypes.family(SqlTypeFamily.INTEGER),
        OperandTypes.function(SqlTypeFamily.NUMERIC, SqlTypeFamily.STRING))

Assuming there is an SQL statement,

select test_func1(1, x -> x / 2);

The first validation:
we assume the type of x is ANY. We can know the x -> x / 2 is legal.
The second validation:
Based on the type checker, the type of x is string, so x / 2 is not a valid expression.

*/
public class SqlLambdaExpressionScope extends ListScope {
private final SqlLambdaExpression lambdaExpr;
private final Map<String, RelDataType> parameterTypes;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to register the type of parameter to SqlLambdaExpressionScope when validate lambda expression second time, so I use the map as cache.


/**
* SQL lambda expression type.
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed this class to FunctionSqlType. But I think all the RelDataType classes are public, FunctionSqlType should be consistent with them.

@@ -623,4 +625,24 @@ public CompositeFunction() {
return typeFactory.createSqlType(SqlTypeName.BIGINT);
}
}

private static final SqlFunction HIGHER_ORDER_FUNCTION =
new SqlFunction("HIGHER_ORDER_FUNCTION",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@mihaibudiu
Copy link
Contributor

Do the lambdas allow capturing values?
SELECT EXISTS(T.a, x -> x < T.VALUE) FROM T

@macroguo-ghy
Copy link
Contributor Author

Thank you for your review. I apologize for the delayed response; I have been quite busy over the past few weeks.

Regarding this PR, I have made modifications to almost all of the reviews. And then, I will write more tests. Once that is done, I will split this PR into two parts: one for lambda and another for the exists function

Co-authored-by: Ritesh Kapoor <riteshkapoor.opensource@gmail.com>
Copy link

sonarcloud bot commented Dec 19, 2023

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

12 New issues
0 Security Hotspots
72.1% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

@macroguo-ghy macroguo-ghy merged commit 6f64865 into apache:main Dec 19, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants