[BEAM-10925] Enable user-defined Java scalar functions in ZetaSQL. #13891

ibzib · 2021-02-03T22:45:56Z

This completes the proof of concept. Note however that when UDFs and
built-in ZetaSQL operators are mixed, the program will crash without
warning.

R: @amaliujia @apilloud

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	Dataflow	Samza	Twister2
Go	---	---	---
Java
Python		---	---
XLang		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---		---	---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

This completes the proof of concept. Note however that when UDFs and built-in ZetaSQL operators are mixed, the program will crash without warning.

amaliujia · 2021-02-03T23:28:12Z

Created a JIRA to track the mixed Java UDF and built-in operator case: https://issues.apache.org/jira/browse/BEAM-11747

amaliujia · 2021-02-03T23:33:59Z

...sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/ZetaSQLPlannerImpl.java


    ResolvedStatement statement;
    ParseResumeLocation parseResumeLocation = new ParseResumeLocation(sql);
    do {
      statement = analyzer.analyzeNextStatement(parseResumeLocation, options, catalog);
      if (statement.nodeKind() == RESOLVED_CREATE_FUNCTION_STMT) {
        ResolvedCreateFunctionStmt createFunctionStmt = (ResolvedCreateFunctionStmt) statement;
-        udfBuilder.put(createFunctionStmt.getNamePath(), createFunctionStmt);
+        String functionGroup = SqlAnalyzer.getFunctionGroup(createFunctionStmt);
+        if (SqlAnalyzer.USER_DEFINED_FUNCTIONS.equals(functionGroup)) {


Can you remind me that whether USER_DEFINED_FUNCTIONS here refers to SQL-native UDF? If so can you update USER_DEFINED_FUNCTIONS to USER_DEFINED_SQL_FUNCTIONS?

Can you remind me that whether USER_DEFINED_FUNCTIONS here refers to SQL-native UDF?

yes

If so can you update USER_DEFINED_FUNCTIONS to USER_DEFINED_SQL_FUNCTIONS?

done

amaliujia · 2021-02-03T23:38:24Z

...sql/zetasql/src/test/java/org/apache/beam/sdk/extensions/sql/zetasql/ZetaSqlJavaUdfTest.java

+    BeamRelNode beamRelNode = zetaSQLQueryPlanner.convertToBeamRel(sql);
+    BeamSqlRelUtils.toPCollection(pipeline, beamRelNode);
+    thrown.expect(RuntimeException.class);
+    thrown.expectMessage("CalcFn failed to evaluate");


Question: do you know where the exception is thrown?

public static class IncrementFn extends ScalarFn { @ApplyMethod public Long increment(Long i) { return i + 1; } }

The increment seems does not handle NULL at all.

increment throws a NullPointerException which is caught by the script evaluator and then by BeamCalcRel.

beam/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java

Line 291 in 989c317

"CalcFn failed to evaluate: " + processElementBlock, e.getCause());

I added assertions to make this test more strict and clear.

...sql/zetasql/src/test/java/org/apache/beam/sdk/extensions/sql/zetasql/ZetaSqlJavaUdfTest.java

amaliujia · 2021-02-03T23:39:54Z

...sql/zetasql/src/test/java/org/apache/beam/sdk/extensions/sql/zetasql/ZetaSqlJavaUdfTest.java

+    ZetaSQLQueryPlanner zetaSQLQueryPlanner = new ZetaSQLQueryPlanner(config);
+    BeamRelNode beamRelNode = zetaSQLQueryPlanner.convertToBeamRel(sql);
+    thrown.expect(UnsupportedOperationException.class);
+    thrown.expectMessage("Could not compile CalcFn");


Sorry I have a hard time to understand why this test case has failed?

I added some assertions and comments that should explain it.

amaliujia · 2021-02-03T23:41:15Z

...sql/zetasql/src/test/java/org/apache/beam/sdk/extensions/sql/zetasql/ZetaSqlJavaUdfTest.java

+  }
+
+  @Test
+  public void testBinaryJavaUdf() {


Can you link https://issues.apache.org/jira/browse/BEAM-11747 to here.

We will need to figure out how to better handle the mixed case. To me the better way is to reject such cases before we implement Calc splitting.

I added a TODO.

apilloud

Make sure you fix your call to getValue to check for null before merging, otherwise LGTM.

I left some comments here on bypassing layers. I'll let you decide if any of those need to be fixed now, in a followup, or if they can wait. (My thoughts: Supporting SqlTransform UDFs on ZetaSQL is something that is expected to come out of this work, and those currently go through SchemaPlus. Building out a TableProvider interface and supporting this new UDF format in Calcite is a larger project that can probably wait until we add UDFs to DataCatalog.)

apilloud · 2021-02-03T22:57:30Z

...nsions/sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/SqlAnalyzer.java

@@ -127,6 +144,53 @@ static boolean isEndOfInput(ParseResumeLocation parseResumeLocation) {
    return tables.build();
  }

+  /** Returns the fully qualified name of the function defined in the statement. */
+  static String getFunctionQualifiedName(ResolvedCreateFunctionStmt createFunctionStmt) {


This looks unused.

I removed it.

apilloud · 2021-02-03T23:19:06Z

...nsions/sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/SqlAnalyzer.java

+    }
+  }
+
+  private Function createFunction(ResolvedCreateFunctionStmt createFunctionStmt) {


It seems like you bypassed a few abstraction layers here. Probably ResolvedCreateFunctionStmt should add a udf to the TableProvider (or an equivalent for UDFs). For an example, see CREATE EXTERNAL TABLE in the calcite dialect:

beam/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/parser/SqlCreateExternalTable.java

Line 136 in 3bb232f

schema.getTableProvider().createTable(toTable());

apilloud · 2021-02-04T00:04:40Z

...nsions/sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/SqlAnalyzer.java

@@ -115,6 +116,22 @@ static boolean isEndOfInput(ParseResumeLocation parseResumeLocation) {
        >= parseResumeLocation.getInput().getBytes(UTF_8).length;
  }

+  static String getOptionStringValue(


nit: This method appears to be used exactly once in another file. It should go right next to the method that calls it.

apilloud · 2021-02-04T00:11:27Z

...sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/ZetaSQLPlannerImpl.java

+        } else if (SqlAnalyzer.USER_DEFINED_JAVA_SCALAR_FUNCTIONS.equals(functionGroup)) {
+          String jarPath = getJarPath(createFunctionStmt);
+          ScalarFn scalarFn =
+              javaUdfLoader.loadScalarFunction(createFunctionStmt.getNamePath(), jarPath);


Again on bypassing layers, it seems like all this should be in a TableProvider like interface, and be built through a buildBeamSqlUDF method called from BeamCalciteSchema (see just above the line in this link for the table example):

beam/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/BeamCalciteSchema.java

Line 122 in 68d6c8e

public Collection<Function> getFunctions(String name) {

apilloud · 2021-02-04T00:17:12Z

...nsions/sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/SqlAnalyzer.java

+  static String getOptionStringValue(
+      ResolvedCreateFunctionStmt createFunctionStmt, String optionName) {
+    for (ResolvedNodes.ResolvedOption option : createFunctionStmt.getOptionList()) {
+      if (option.getName().equals(optionName)) {


How about optionName.equals(option.getName()). That will avoid potential crashes if getName returns null.

apilloud · 2021-02-04T00:23:00Z

...nsions/sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/SqlAnalyzer.java

+      ResolvedCreateFunctionStmt createFunctionStmt, String optionName) {
+    for (ResolvedNodes.ResolvedOption option : createFunctionStmt.getOptionList()) {
+      if (option.getName().equals(optionName)) {
+        if (option.getValue().getType().getKind() != TypeKind.TYPE_STRING) {


getValue can return null here. I didn't check the other two, but you can find a copy of the generated ResolvedNodes.java in internal code search. https://github.com/google/zetasql/blob/862a192a6da487757e860166a9666120b16773f5/java/com/google/zetasql/resolvedast/ResolvedNodes.java.template#L295

done (hopefully at some point we can enable the nullness checker on this file)

apilloud · 2021-02-04T00:24:10Z

...nsions/sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/SqlAnalyzer.java

+              "Native SQL aggregate functions are not supported (BEAM-9954).");
+        }
+        return USER_DEFINED_FUNCTIONS;
+      case "PY":


I'm curious as to where these came from. Is there another engine that supports these constants?

BigQuery supports JS. I made up the rest of them.

apilloud · 2021-02-04T00:27:53Z

...sql/zetasql/src/main/java/org/apache/beam/sdk/extensions/sql/zetasql/ZetaSQLPlannerImpl.java


    ResolvedStatement statement;
    ParseResumeLocation parseResumeLocation = new ParseResumeLocation(sql);
    do {
      statement = analyzer.analyzeNextStatement(parseResumeLocation, options, catalog);
      if (statement.nodeKind() == RESOLVED_CREATE_FUNCTION_STMT) {
        ResolvedCreateFunctionStmt createFunctionStmt = (ResolvedCreateFunctionStmt) statement;
-        udfBuilder.put(createFunctionStmt.getNamePath(), createFunctionStmt);
+        String functionGroup = SqlAnalyzer.getFunctionGroup(createFunctionStmt);
+        if (SqlAnalyzer.USER_DEFINED_FUNCTIONS.equals(functionGroup)) {


nit: switch/case is better than if for this pattern if your string isn't null.

ibzib · 2021-02-04T19:35:20Z

I left some comments here on bypassing layers. I'll let you decide if any of those need to be fixed now, in a followup, or if they can wait. (My thoughts: Supporting SqlTransform UDFs on ZetaSQL is something that is expected to come out of this work, and those currently go through SchemaPlus. Building out a TableProvider interface and supporting this new UDF format in Calcite is a larger project that can probably wait until we add UDFs to DataCatalog.)

I'll leave those for a follow-up (tracked by BEAM-10943).

ibzib · 2021-02-04T19:52:53Z

Run Java_Examples_Dataflow_Java11 PreCommit

amaliujia · 2021-02-04T20:53:24Z

LGTM

[BEAM-10925] Enable user-defined Java scalar functions in ZetaSQL.

ef5aeef

This completes the proof of concept. Note however that when UDFs and built-in ZetaSQL operators are mixed, the program will crash without warning.

amaliujia reviewed Feb 3, 2021

View reviewed changes

apilloud approved these changes Feb 4, 2021

View reviewed changes

address review comments

feee800

ibzib merged commit 9bbd5bd into apache:master Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-10925] Enable user-defined Java scalar functions in ZetaSQL. #13891

[BEAM-10925] Enable user-defined Java scalar functions in ZetaSQL. #13891

ibzib commented Feb 3, 2021

amaliujia commented Feb 3, 2021

amaliujia Feb 3, 2021 •

edited

ibzib Feb 4, 2021

amaliujia Feb 3, 2021

ibzib Feb 4, 2021

amaliujia Feb 3, 2021

ibzib Feb 4, 2021

amaliujia Feb 3, 2021

ibzib Feb 4, 2021

apilloud left a comment

apilloud Feb 3, 2021

ibzib Feb 4, 2021

apilloud Feb 3, 2021

apilloud Feb 4, 2021

ibzib Feb 4, 2021

apilloud Feb 4, 2021

apilloud Feb 4, 2021

ibzib Feb 4, 2021

apilloud Feb 4, 2021

ibzib Feb 4, 2021

apilloud Feb 4, 2021

ibzib Feb 4, 2021 •

edited

apilloud Feb 4, 2021

ibzib Feb 4, 2021

ibzib commented Feb 4, 2021

ibzib commented Feb 4, 2021

amaliujia commented Feb 4, 2021

[BEAM-10925] Enable user-defined Java scalar functions in ZetaSQL. #13891

[BEAM-10925] Enable user-defined Java scalar functions in ZetaSQL. #13891

Conversation

ibzib commented Feb 3, 2021

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

amaliujia commented Feb 3, 2021

amaliujia Feb 3, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apilloud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibzib Feb 4, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibzib commented Feb 4, 2021

ibzib commented Feb 4, 2021

amaliujia commented Feb 4, 2021

amaliujia Feb 3, 2021 •

edited

ibzib Feb 4, 2021 •

edited