[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC #44397

beliefer · 2023-12-18T07:55:37Z

What changes were proposed in this pull request?

This PR will translate the aggregate function PERCENTILE_CONT and PERCENTILE_DISC for pushdown.

This PR adds Expression[] orderingWithinGroups into GeneralAggregateFunc, so as DS V2 pushdown framework could compile the WITHIN GROUP (ORDER BY ...) easily.
This PR also split visitInverseDistributionFunction from visitAggregateFunction, so as DS V2 pushdown framework could generate the syntax WITHIN GROUP (ORDER BY ...) easily.
This PR also fix a bug that JdbcUtils can't treat the precision and scale of decimal returned from JDBC.

Why are the changes needed?

DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

beliefer · 2023-12-18T07:58:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

    case DecimalType.Fixed(p, s) =>
      (rs: ResultSet, row: InternalRow, pos: Int) =>
        val decimal =
-          nullSafeConvert[java.math.BigDecimal](rs.getBigDecimal(pos + 1), d => Decimal(d, p, s))


The origin code will throws exception.

Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003 at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)

The precision is 38 and scale is 38 based on DecimalType.
In fact, the decimal return from JDBC is BigDecimal(7, 3).

…ialect ### What changes were proposed in this pull request? This PR fix a but by make JDBC dialect decide the decimal precision and scale. **How to reproduce the bug?** #44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`. The bug fired when pushdown the below SQL to H2 JDBC. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` **The root cause** `getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`. The query for database H2 show below. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` We can get the five variables from `ResultSetMetaData`, please refer: ``` columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)" dataType = 2 typeName = "NUMERIC" fieldSize = 100000 fieldScale = 50000 ``` Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually. The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`. At finally, `makeGetter` throws exception. ``` Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003 at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339) ``` ### Why are the changes needed? This PR fix the bug that `JdbcUtils` can't get the correct decimal type. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug. ### How was this patch tested? Manual tests in #44397 ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44398 from beliefer/SPARK-46443. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ialect ### What changes were proposed in this pull request? This PR fix a but by make JDBC dialect decide the decimal precision and scale. **How to reproduce the bug?** #44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`. The bug fired when pushdown the below SQL to H2 JDBC. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` **The root cause** `getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`. The query for database H2 show below. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` We can get the five variables from `ResultSetMetaData`, please refer: ``` columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)" dataType = 2 typeName = "NUMERIC" fieldSize = 100000 fieldScale = 50000 ``` Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually. The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`. At finally, `makeGetter` throws exception. ``` Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003 at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339) ``` ### Why are the changes needed? This PR fix the bug that `JdbcUtils` can't get the correct decimal type. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug. ### How was this patch tested? Manual tests in #44397 ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44398 from beliefer/SPARK-46443. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a921da8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…NTILE_DISC

beliefer · 2023-12-22T06:07:46Z

ping @cloud-fan cc @huaxingao

beliefer · 2024-01-04T13:23:07Z

ping @cloud-fan

cloud-fan · 2024-01-05T16:01:55Z

...src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java


  public GeneralAggregateFunc(String name, boolean isDistinct, Expression[] children) {
    this.name = name;
    this.isDistinct = isDistinct;
    this.children = children;
+    this.orderingWithinGroups = null;


empty array is a better default

cloud-fan · 2024-01-05T16:03:29Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java

+      String funcName, boolean isDistinct, String[] inputs, String[] orderingWithinGroups) {
+    assert(isDistinct == false);
+    String withinGroup =
+      joinArrayToString(orderingWithinGroups, ", ", "WITHIN GROUP (ORDER BY ", ")");


how do we translate ASC/DESC?

Please refer visitSortOrder.

beliefer · 2024-01-09T04:28:03Z

The GA failure is unrelated.

cloud-fan · 2024-01-09T11:29:43Z

...src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java

@@ -51,11 +53,21 @@ public final class GeneralAggregateFunc extends ExpressionWithToString implement
  private final String name;
  private final boolean isDistinct;
  private final Expression[] children;
+  private final Expression[] orderingWithinGroups;


shall we use SortOrder[]?

cloud-fan · 2024-01-09T11:29:53Z

...src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java

@@ -64,6 +76,8 @@ public GeneralAggregateFunc(String name, boolean isDistinct, Expression[] childr
  @Override
  public Expression[] children() { return children; }

+  public Expression[] orderingWithinGroups() { return orderingWithinGroups; }


cloud-fan · 2024-01-09T11:32:02Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala

@@ -42,7 +42,7 @@ private[sql] object H2Dialect extends JdbcDialect {

  private val distinctUnsupportedAggregateFunctions =
    Set("COVAR_POP", "COVAR_SAMP", "CORR", "REGR_INTERCEPT", "REGR_R2", "REGR_SLOPE", "REGR_SXY",
-      "MODE")
+      "MODE", "PERCENTILE_CONT", "PERCENTILE_DISC")


why do we need this change? H2 dialect deals with these two functions in visitInverseDistributionFunction

Because H2 dialect overrides the visitInverseDistributionFunction and check with isSupportedFunction.

override def visitInverseDistributionFunction( funcName: String, isDistinct: Boolean, inputs: Array[String], orderingWithinGroups: Array[String]): String = { if (isSupportedFunction(funcName)) { super.visitInverseDistributionFunction( dialectFunctionName(funcName), isDistinct, inputs, orderingWithinGroups) } else { throw new UnsupportedOperationException( s"${this.getClass.getSimpleName} does not support " + s"inverse distribution function: $funcName") } }

cloud-fan · 2024-01-10T04:24:02Z

thanks, merging to master!

beliefer · 2024-01-10T04:44:02Z

@cloud-fan Thank you!

github-actions bot added the SQL label Dec 18, 2023

beliefer commented Dec 18, 2023

View reviewed changes

beliefer mentioned this pull request Dec 18, 2023

[SPARK-46443][SQL] Decimal precision and scale should decided by H2 dialect. #44398

Closed

beliefer force-pushed the SPARK-46442 branch 2 times, most recently from da47b85 to 1c985b3 Compare December 20, 2023 12:06

[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCE…

87561b7

…NTILE_DISC

beliefer force-pushed the SPARK-46442 branch from 1c985b3 to 87561b7 Compare December 22, 2023 02:13

cloud-fan reviewed Jan 5, 2024

View reviewed changes

cloud-fan reviewed Jan 9, 2024

View reviewed changes

cloud-fan approved these changes Jan 9, 2024

View reviewed changes

Update code

7a89d77

beliefer force-pushed the SPARK-46442 branch from e085d72 to 7a89d77 Compare January 9, 2024 13:47

cloud-fan closed this in 85b504d Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC #44397

[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC #44397

beliefer commented Dec 18, 2023 •

edited

beliefer Dec 18, 2023

beliefer Dec 18, 2023

beliefer commented Dec 22, 2023

beliefer commented Jan 4, 2024

cloud-fan Jan 5, 2024

cloud-fan Jan 5, 2024

beliefer Jan 7, 2024

beliefer commented Jan 9, 2024

cloud-fan Jan 9, 2024

cloud-fan Jan 9, 2024

cloud-fan Jan 9, 2024

beliefer Jan 9, 2024

cloud-fan commented Jan 10, 2024

beliefer commented Jan 10, 2024

[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC #44397

[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC #44397

Conversation

beliefer commented Dec 18, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Dec 22, 2023

beliefer commented Jan 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Jan 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 10, 2024

beliefer commented Jan 10, 2024

beliefer commented Dec 18, 2023 •

edited