[SPARK-39139][SQL] DS V2 supports push down DS V2 UDF #36593

beliefer · 2022-05-18T09:36:08Z

What changes were proposed in this pull request?

Currently, Spark DS V2 push-down framework supports push down SQL to data sources.
But the DS V2 push-down framework only support push down the built-in functions to data sources.
Each database have a lot very useful functions which not supported by Spark.
If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases.

Why are the changes needed?

Spark can leverage the functions supported by databases
Improve the query performance.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New tests.

beliefer · 2022-05-19T02:33:30Z

ping @huaxingao cc @cloud-fan

cloud-fan · 2022-05-19T14:11:27Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala

+    case ApplyFunctionExpression(function, children) =>
+      val childrenExpressions = children.flatMap(generateExpression(_))
+      if (childrenExpressions.length == children.length) {
+        Some(new GeneralScalarExpression(function.name().toUpperCase(Locale.ROOT),


UDF is kind of special and I think it's better to have a new v2 API for it, instead of reusing GeneralScalarExpression.

e.g., we can add a UserDefinedScalarFunction, which defines the function name, canonical name and inputs.

cloud-fan · 2022-05-19T14:12:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+        case aggregate.V2Aggregator(aggrFunc, children, _, _) =>
+          val translatedExprs = children.flatMap(PushableExpression.unapply(_))
+          if (translatedExprs.length == children.length) {
+            Some(new GeneralAggregateFunc(aggrFunc.name().toUpperCase(Locale.ROOT), agg.isDistinct,


ditto, let's create a dedicated v2 api, such as UserDefinedAggregateFunction

cloud-fan · 2022-05-19T14:19:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCCatalog.scala

  private var catalogName: String = null
  private var options: JDBCOptions = _
  private var dialect: JdbcDialect = _

+  private val functions: util.Map[Identifier, UnboundFunction] =
+    new ConcurrentHashMap[Identifier, UnboundFunction]()


We should clearly define how can this be used. I thought each JDBC dialect should have APIs to register its own UDFs.

cloud-fan · 2022-05-23T13:26:31Z

...r-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala

@@ -73,7 +73,7 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCTes
  }

  override def sparkConf: SparkConf = super.sparkConf
-    .set("spark.sql.catalog.oracle", classOf[JDBCTableCatalog].getName)
+    .set("spark.sql.catalog.oracle", classOf[JDBCCatalog].getName)


isn't it a breaking change? existing queries may fail because they need to change the conf value now

maybe we should introduce a short name for the JDBC catalog, like spark.sql.catalog.oracle=jdbc

cloud-fan · 2022-05-23T13:32:18Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

+   * @param fn The user-defined function.
+   * @return The user-defined function.
+   */
+  def registerFunction(


This looks like a weird API design. Why not ask dialect to return a Array[(Identifier, UnboundFunction)]?

Then we don't need to expose JDBCCatalog as a public api.

BTW, I think it's better to have a PR to support registering dialect specific functions, and then add pushdown in the next PR

cloud-fan · 2022-05-30T14:12:29Z

...talyst/src/main/java/org/apache/spark/sql/connector/expressions/GeneralScalarExpression.java

@@ -235,8 +235,8 @@ public String toString() {
    try {
      return builder.build(this);
    } catch (Throwable e) {
-      return name + "(" +
-        Arrays.stream(children).map(child -> child.toString()).reduce((a,b) -> a + "," + b) + ")";


what's wrong with the previous code?

The previous code let the toString display as Option(...).

cloud-fan · 2022-05-30T14:14:17Z

...main/java/org/apache/spark/sql/connector/expressions/aggregate/UserDefinedAggregateFunc.java

+
+/**
+ * The general representation of user defined aggregate function, which implements
+ * {@link AggregateFunc}, contains the upper-cased function name, the `isDistinct` flag and


let's mention canonical name as well

cloud-fan · 2022-05-30T14:14:29Z

...main/java/org/apache/spark/sql/connector/expressions/aggregate/UserDefinedAggregateFunc.java

+/**
+ * The general representation of user defined aggregate function, which implements
+ * {@link AggregateFunc}, contains the upper-cased function name, the `isDistinct` flag and
+ * all the inputs. Note that Spark cannot push down partial aggregate with this function to


Suggested change

* all the inputs. Note that Spark cannot push down partial aggregate with this function to

* all the inputs. Note that Spark cannot push down aggregate with this function partially to

cloud-fan · 2022-05-30T14:15:09Z

...main/java/org/apache/spark/sql/connector/expressions/aggregate/UserDefinedAggregateFunc.java

+
+  @Override
+  public String toString() {
+    String inputsString = Arrays.stream(children)


can we use V2ExpressionSQLBuilder?

cloud-fan · 2022-05-30T14:20:48Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java

@@ -241,6 +246,11 @@ protected String visitSQLFunction(String funcName, String[] inputs) {
    return funcName + "(" + Arrays.stream(inputs).collect(Collectors.joining(", ")) + ")";
  }

+  protected String visitUserDefinedFunction(
+      String funcName, String canonicalName, String[] inputs) {
+    return funcName + "(" + Arrays.stream(inputs).collect(Collectors.joining(", ")) + ")";


I think it's safer to fail by default here. We must ask the data sources to carefully check the canonicalName to make sure they support the UDF.

cloud-fan · 2022-05-30T14:28:14Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

@@ -304,6 +310,11 @@ abstract class JdbcDialect extends Serializable with Logging{
   */
  @Since("3.3.0")
  def compileAggregate(aggFunction: AggregateFunc): Option[String] = {


I don't think we should add new APIs to jdbc dialect. Can we leverage V2ExpressionBuilder to make it easier for dialects to compile aggregates?

wmoustafa · 2022-05-31T22:54:05Z

@beliefer, glad to see interest/progress in cross platform SQL/UDFs pushdown. Have you considered doing this leveraging frameworks such as Transport [1, 2] for UDFs and Coral [1, 2] for SQL?

With Transport, one can implement a function that is executable in Spark as well as other data sources, using one implementation. All function variants (automatically generated) will natively access the in-memory records of the corresponding engine/data source.

With Coral, one can apply transformations/rewrites to built-in functions/SQL expressions so they translate to the same semantics in an underlying engine/data source. For example, it can be used to push down complex functions/SQL expressions from Spark to Trino despite having different syntax.

This PR might not be the best place to discuss this in detail, but happy to file a JIRA ticket to carry this forward. cc: @xkrogen.

beliefer · 2022-06-01T05:49:44Z

glad to see interest/progress in cross platform SQL/UDFs pushdown. Have you considered doing this leveraging frameworks such as Transport [1, 2] for UDFs and Coral [1, 2] for SQL?

Spark DS V2 has the special UDF API. Users could implement UDF with the API and DS V2 push-down framework could supports them. If we use Transport and Coral, it seems introduce more components.

cloud-fan · 2022-06-28T03:28:38Z

...src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java

+    ToStringSQLBuilder builder = new ToStringSQLBuilder();
+    try {
+      return builder.build(this);
+    } catch (Throwable e) {


which error do we expect to hit?

same question to GeneralScalarExpression.toString

cloud-fan · 2022-06-28T03:29:14Z

...main/java/org/apache/spark/sql/connector/expressions/aggregate/UserDefinedAggregateFunc.java

+    ToStringSQLBuilder builder = new ToStringSQLBuilder();
+    try {
+      return builder.build(this);
+    } catch (Throwable e) {


cloud-fan · 2022-06-28T03:31:36Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/ToStringSQLBuilder.java

+/**
+ * The builder to generate `toString` information of V2 expressions.
+ */
+public class ToStringSQLBuilder extends V2ExpressionSQLBuilder {


let's put it in org/apache/spark/sql/internal/connector, we should not expose it to end users.

and we can even write it in Scala

cloud-fan · 2022-06-28T03:32:51Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala

+      if (childrenExpressions.length == children.length) {
+        Some(new UserDefinedScalarFunc(
+          function.name().toUpperCase(Locale.ROOT),
+          function.canonicalName().toUpperCase(Locale.ROOT),


let's keep the names as it is, do not upper case them

cloud-fan · 2022-06-28T03:34:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+        case aggregate.V2Aggregator(aggrFunc, children, _, _) =>
+          val translatedExprs = children.flatMap(PushableExpression.unapply(_))
+          if (translatedExprs.length == children.length) {
+            Some(new UserDefinedAggregateFunc(aggrFunc.name().toUpperCase(Locale.ROOT),


cloud-fan · 2022-06-28T03:36:59Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala

+    override def visitUserDefinedScalarFunction(
+        funcName: String, canonicalName: String, inputs: Array[String]): String = {
+      canonicalName match {
+        case "H2.CHAR_LENGTH" =>


I don't think this makes sense. In production, the H2 dialect does not expose the H2.CHAR_LENGTH UDF. How can it work in a non-test environment?

cloud-fan · 2022-06-28T06:43:44Z

...talyst/src/main/java/org/apache/spark/sql/connector/expressions/GeneralScalarExpression.java

@@ -249,7 +249,7 @@ public int hashCode() {

  @Override
  public String toString() {
-    V2ExpressionSQLBuilder builder = new V2ExpressionSQLBuilder();
+    ToStringSQLBuilder builder = new ToStringSQLBuilder();
    try {


We can remove the try-catch here as well.

cloud-fan · 2022-06-29T17:32:20Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

+      checkPushedInfo(df1, "PushedFilters: [char_length(NAME) > 2],")
+      checkAnswer(df1, Seq(Row("fred", 1), Row("mary", 2)))
+
+      val df2 = sql("SELECT * FROM h2.test.people where h2.my_strlen(name) > 4")


nit: I don't think it worth a new test case for changing h2.my_strlen(name) > 2 to h2.my_strlen(name) > 4. Can we remove this df2 test?

cloud-fan · 2022-06-29T17:32:49Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

+          """
+            |SELECT *
+            |FROM h2.test.people
+            |WHERE h2.my_strlen(CASE WHEN NAME = 'fred' THEN NAME ELSE "abc" END) > 3


ditto, let's remove one test

cloud-fan

LGTM except for a few minor comments

cloud-fan · 2022-06-29T17:37:14Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

+  case object CharLength extends ScalarFunction[Int] {
+    override def inputTypes(): Array[DataType] = Array(StringType)
+    override def resultType(): DataType = IntegerType
+    override def name(): String = "char_length"


nit: to be more SQL-ish we can upper case the function name.

cloud-fan · 2022-06-30T10:02:50Z

thanks, merging to master!

beliefer · 2022-06-30T10:11:15Z

@cloud-fan Thank you very much !

### What changes were proposed in this pull request? Currently, Spark DS V2 push-down framework supports push down SQL to data sources. But the DS V2 push-down framework only support push down the built-in functions to data sources. Each database have a lot very useful functions which not supported by Spark. If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases. ### Why are the changes needed? 1. Spark can leverage the functions supported by databases 2. Improve the query performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#36593 from beliefer/SPARK-39139. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…LIMIT (#505) * [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF ### What changes were proposed in this pull request? Currently, Spark DS V2 push-down framework supports push down SQL to data sources. But the DS V2 push-down framework only support push down the built-in functions to data sources. Each database have a lot very useful functions which not supported by Spark. If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases. ### Why are the changes needed? 1. Spark can leverage the functions supported by databases 2. Improve the query performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#36593 from beliefer/SPARK-39139. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful ### What changes were proposed in this pull request? apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI). But he `Rand` in test case looks no meaningful. ### Why are the changes needed? Let `Rand` in filter is more meaningful. ### Does this PR introduce _any_ user-facing change? 'No'. Just update test case. ### How was this patch tested? Just update test case. Closes apache#37033 from beliefer/SPARK-39453_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT` ### What changes were proposed in this pull request? apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect. Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT. So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. ### Why are the changes needed? Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. ### Does this PR introduce _any_ user-facing change? 'Yes'. Bug will be fix. ### How was this patch tested? New test cases. Closes apache#37090 from beliefer/SPARK-37527_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39627][SQL] DS V2 pushdown should unify the compile API ### What changes were proposed in this pull request? Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them. ### Why are the changes needed? Improve ease of use. ### Does this PR introduce _any_ user-facing change? 'No'. The two API `compileAggregate` call `compileExpression` not changed. ### How was this patch tested? N/A Closes apache#37047 from beliefer/SPARK-39627. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect ### What changes were proposed in this pull request? Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions. Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect. ### Why are the changes needed? Make build-in JDBC dialect support compile linear regression aggregate push-down. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#37188 from beliefer/SPARK-39384. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT ### What changes were proposed in this pull request? This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg. The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary. ### Why are the changes needed? support pushing down LIMIT/OFFSET after agg. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes apache#37195 from cloud-fan/agg. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

…LIMIT (Kyligence#505) * [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF ### What changes were proposed in this pull request? Currently, Spark DS V2 push-down framework supports push down SQL to data sources. But the DS V2 push-down framework only support push down the built-in functions to data sources. Each database have a lot very useful functions which not supported by Spark. If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases. ### Why are the changes needed? 1. Spark can leverage the functions supported by databases 2. Improve the query performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#36593 from beliefer/SPARK-39139. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful ### What changes were proposed in this pull request? apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI). But he `Rand` in test case looks no meaningful. ### Why are the changes needed? Let `Rand` in filter is more meaningful. ### Does this PR introduce _any_ user-facing change? 'No'. Just update test case. ### How was this patch tested? Just update test case. Closes apache#37033 from beliefer/SPARK-39453_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT` ### What changes were proposed in this pull request? apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect. Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT. So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. ### Why are the changes needed? Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. ### Does this PR introduce _any_ user-facing change? 'Yes'. Bug will be fix. ### How was this patch tested? New test cases. Closes apache#37090 from beliefer/SPARK-37527_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39627][SQL] DS V2 pushdown should unify the compile API ### What changes were proposed in this pull request? Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them. ### Why are the changes needed? Improve ease of use. ### Does this PR introduce _any_ user-facing change? 'No'. The two API `compileAggregate` call `compileExpression` not changed. ### How was this patch tested? N/A Closes apache#37047 from beliefer/SPARK-39627. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect ### What changes were proposed in this pull request? Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions. Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect. ### Why are the changes needed? Make build-in JDBC dialect support compile linear regression aggregate push-down. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#37188 from beliefer/SPARK-39384. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT ### What changes were proposed in this pull request? This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg. The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary. ### Why are the changes needed? support pushing down LIMIT/OFFSET after agg. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes apache#37195 from cloud-fan/agg. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

…LIMIT (#505) * [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF ### What changes were proposed in this pull request? Currently, Spark DS V2 push-down framework supports push down SQL to data sources. But the DS V2 push-down framework only support push down the built-in functions to data sources. Each database have a lot very useful functions which not supported by Spark. If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases. ### Why are the changes needed? 1. Spark can leverage the functions supported by databases 2. Improve the query performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#36593 from beliefer/SPARK-39139. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful ### What changes were proposed in this pull request? apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI). But he `Rand` in test case looks no meaningful. ### Why are the changes needed? Let `Rand` in filter is more meaningful. ### Does this PR introduce _any_ user-facing change? 'No'. Just update test case. ### How was this patch tested? Just update test case. Closes apache#37033 from beliefer/SPARK-39453_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT` ### What changes were proposed in this pull request? apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect. Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT. So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. ### Why are the changes needed? Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. ### Does this PR introduce _any_ user-facing change? 'Yes'. Bug will be fix. ### How was this patch tested? New test cases. Closes apache#37090 from beliefer/SPARK-37527_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39627][SQL] DS V2 pushdown should unify the compile API ### What changes were proposed in this pull request? Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them. ### Why are the changes needed? Improve ease of use. ### Does this PR introduce _any_ user-facing change? 'No'. The two API `compileAggregate` call `compileExpression` not changed. ### How was this patch tested? N/A Closes apache#37047 from beliefer/SPARK-39627. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect ### What changes were proposed in this pull request? Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions. Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect. ### Why are the changes needed? Make build-in JDBC dialect support compile linear regression aggregate push-down. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#37188 from beliefer/SPARK-39384. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT ### What changes were proposed in this pull request? This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg. The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary. ### Why are the changes needed? support pushing down LIMIT/OFFSET after agg. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes apache#37195 from cloud-fan/agg. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

…LIMIT (#505) * [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF Currently, Spark DS V2 push-down framework supports push down SQL to data sources. But the DS V2 push-down framework only support push down the built-in functions to data sources. Each database have a lot very useful functions which not supported by Spark. If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases. 1. Spark can leverage the functions supported by databases 2. Improve the query performance. 'No'. New feature. New tests. Closes apache#36593 from beliefer/SPARK-39139. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI). But he `Rand` in test case looks no meaningful. Let `Rand` in filter is more meaningful. 'No'. Just update test case. Just update test case. Closes apache#37033 from beliefer/SPARK-39453_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT` apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect. Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT. So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. 'Yes'. Bug will be fix. New test cases. Closes apache#37090 from beliefer/SPARK-37527_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39627][SQL] DS V2 pushdown should unify the compile API Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them. Improve ease of use. 'No'. The two API `compileAggregate` call `compileExpression` not changed. N/A Closes apache#37047 from beliefer/SPARK-39627. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions. Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect. Make build-in JDBC dialect support compile linear regression aggregate push-down. 'No'. New feature. New test cases. Closes apache#37188 from beliefer/SPARK-39384. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg. The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary. support pushing down LIMIT/OFFSET after agg. no updated tests Closes apache#37195 from cloud-fan/agg. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

asiunov · 2023-06-09T08:47:52Z

...catalyst/src/main/java/org/apache/spark/sql/connector/expressions/UserDefinedScalarFunc.java

+
+  @Override
+  public int hashCode() {
+    return Objects.hash(name, canonicalName, children);


@beliefer , should it be Arrays.hashCode(children)?

…LIMIT (#505) * [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF Currently, Spark DS V2 push-down framework supports push down SQL to data sources. But the DS V2 push-down framework only support push down the built-in functions to data sources. Each database have a lot very useful functions which not supported by Spark. If we can push down these functions into data source, it will reduce disk I/O and network I/O and improve the performance when query databases. 1. Spark can leverage the functions supported by databases 2. Improve the query performance. 'No'. New feature. New tests. Closes apache#36593 from beliefer/SPARK-39139. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter is more meaningful apache#36830 makes DS V2 supports push down misc non-aggregate functions(non ANSI). But he `Rand` in test case looks no meaningful. Let `Rand` in filter is more meaningful. 'No'. Just update test case. Just update test case. Closes apache#37033 from beliefer/SPARK-39453_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT` apache#35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect. Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT. So apache#35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. 'Yes'. Bug will be fix. New test cases. Closes apache#37090 from beliefer/SPARK-37527_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39627][SQL] DS V2 pushdown should unify the compile API Currently, `JdbcDialect` have two API `compileAggregate` and `compileExpression`, we can unify them. Improve ease of use. 'No'. The two API `compileAggregate` call `compileExpression` not changed. N/A Closes apache#37047 from beliefer/SPARK-39627. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39384][SQL] Compile built-in linear regression aggregate functions for JDBC dialect Recently, Spark DS V2 pushdown framework translate a lot of standard linear regression aggregate functions. Currently, only H2Dialect compile these standard linear regression aggregate functions. This PR compile these standard linear regression aggregate functions for other build-in JDBC dialect. Make build-in JDBC dialect support compile linear regression aggregate push-down. 'No'. New feature. New test cases. Closes apache#37188 from beliefer/SPARK-39384. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-39148][SQL] DS V2 aggregate push down can work with OFFSET or LIMIT This PR refactors the v2 agg pushdown code. The main change is, now we don't build the `Scan` immediately when pushing agg. We did it so before because we want to know the data schema with agg pushed, then we can add cast when rewriting the query plan after pushdown. But the problem is, we build `Scan` too early and can't push down any more operators, while it's common to see LIMIT/OFFSET after agg. The idea of the refactor is, we don't need to know the data schema with agg pushed. We just give an expectation (the data type should be the same of Spark agg functions), use it to define the output of `ScanBuilderHolder`, and then rewrite the query plan. Later on, when we build the `Scan` and replace `ScanBuilderHolder` with `DataSourceV2ScanRelation`, we check the actual data schema and add a `Project` to do type cast if necessary. support pushing down LIMIT/OFFSET after agg. no updated tests Closes apache#37195 from cloud-fan/agg. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

github-actions bot added the SQL label May 18, 2022

cloud-fan reviewed May 19, 2022

View reviewed changes

beliefer force-pushed the SPARK-39139 branch from ac58dff to d043c0c Compare May 20, 2022 10:49

cloud-fan reviewed May 23, 2022

View reviewed changes

beliefer force-pushed the SPARK-39139 branch 2 times, most recently from 7672f35 to cfb472e Compare May 27, 2022 09:43

cloud-fan reviewed May 30, 2022

View reviewed changes

beliefer force-pushed the SPARK-39139 branch from 207b1e3 to ce5264c Compare June 1, 2022 10:33

beliefer changed the title ~~[SPARK-39139][SQL] DS V2 push-down framework supports DS V2 UDF~~ [SPARK-39139][SQL] DS V2 supports push-down DS V2 UDF Jun 17, 2022

beliefer changed the title ~~[SPARK-39139][SQL] DS V2 supports push-down DS V2 UDF~~ [SPARK-39139][SQL] DS V2 supports push down DS V2 UDF Jun 17, 2022

beliefer force-pushed the SPARK-39139 branch 2 times, most recently from 3c3482f to add8f18 Compare June 25, 2022 07:43

beliefer added 7 commits June 27, 2022 09:01

[SPARK-39139][SQL] DS V2 push-down supports DS V2 UDF

b788e7f

Update code

f8b1a69

Update code

98ebe02

Update code

d982314

Update code

7a88fbf

Update code

d45342d

Update code

a49e6c0

beliefer added 2 commits June 28, 2022 10:23

Update code

d7d3993

Update code

a2dbbc8

cloud-fan reviewed Jun 28, 2022

View reviewed changes

Update code

a093ed8

cloud-fan reviewed Jun 28, 2022

View reviewed changes

beliefer added 3 commits June 28, 2022 14:59

Update code

0f63383

Update code

0771f19

Update code

2dee8e5

cloud-fan reviewed Jun 29, 2022

View reviewed changes

cloud-fan approved these changes Jun 29, 2022

View reviewed changes

cloud-fan reviewed Jun 29, 2022

View reviewed changes

Update code

3d985fd

cloud-fan closed this in b242f85 Jun 30, 2022

asiunov reviewed Jun 9, 2023

View reviewed changes

	* all the inputs. Note that Spark cannot push down partial aggregate with this function to
	* all the inputs. Note that Spark cannot push down aggregate with this function partially to

[SPARK-39139][SQL] DS V2 supports push down DS V2 UDF #36593

[SPARK-39139][SQL] DS V2 supports push down DS V2 UDF #36593

Conversation

beliefer commented May 18, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

beliefer commented May 19, 2022

cloud-fan May 19, 2022 • edited Loading

Choose a reason for hiding this comment

cloud-fan May 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan May 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wmoustafa commented May 31, 2022

beliefer commented Jun 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jun 30, 2022

beliefer commented Jun 30, 2022

Choose a reason for hiding this comment

beliefer commented May 18, 2022 •

edited

Loading

cloud-fan May 19, 2022 •

edited

Loading

cloud-fan May 19, 2022 •

edited

Loading

cloud-fan May 23, 2022 •

edited

Loading

cloud-fan Jun 28, 2022 •

edited

Loading