[SPARK-38997][SQL] DS V2 aggregate push-down supports group by expressions #36325

beliefer · 2022-04-22T11:54:04Z

What changes were proposed in this pull request?

Currently, Spark DS V2 aggregate push-down only supports group by column.
But the SQL show below is very useful and common.

SELECT
  CASE 
    WHEN 'SALARY' > 8000.00
      AND 'SALARY' < 10000.00
    THEN 'SALARY'
    ELSE 0.00
  END AS key,
  SUM('SALARY')
FROM "test"."employee"
GROUP BY key

Why are the changes needed?

Let DS V2 aggregate push-down supports group by expressions

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New tests

…sions

beliefer · 2022-04-25T02:15:09Z

ping @huaxingao cc @cloud-fan

cloud-fan · 2022-04-25T15:54:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

@@ -106,7 +107,9 @@ object AggregatePushDownUtils {
      // aggregate push down simple and don't handle this complicate case for now.
      return None
    }
-    aggregation.groupByColumns.foreach { col =>
+    aggregation.groupByExpressions.foreach { expr =>
+      assert(expr.isInstanceOf[FieldReference])


how can we make this assertion? do you mean the v2 pushdown framework still only support group by attrs?

cloud-fan · 2022-04-25T16:44:26Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

@@ -206,14 +213,15 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan] with PredicateHelper wit
                    DataSourceV2ScanRelation(sHolder.relation, wrappedScan, output)
                  if (r.supportCompletePushDown(pushedAggregates.get)) {
                    val projectExpressions = finalResultExpressions.map { expr =>
-                      // TODO At present, only push down group by attribute is supported.
-                      // In future, more attribute conversion is extended here. e.g. GetStructField
                      expr.transform {


let's explicitly use transformDown as we need to match an expression tree for the group by expression.

cloud-fan · 2022-04-25T16:56:58Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

+      """
+        |PushedAggregates: [SUM(SALARY)],
+        |PushedFilters: [],
+        |PushedGroupByColumns:


Suggested change

|PushedGroupByColumns:

|PushedGroupByExpressions:

cloud-fan · 2022-04-26T05:13:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

@@ -106,7 +107,9 @@ object AggregatePushDownUtils {
      // aggregate push down simple and don't handle this complicate case for now.
      return None
    }
-    aggregation.groupByColumns.foreach { col =>
+    aggregation.groupByExpressions.foreach { expr =>
+      if (!expr.isInstanceOf[FieldReference]) return None


nit:

val colName = expr match { case f: FieldReference if f.fieldNames.length == 1 => f.fieldNames.head case _ => None } if (colName.isEmpty || ! isPartitionCol(colName.get)) return None finalSchema = ...

cloud-fan · 2022-04-26T05:17:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

-    val groupByColNames = aggregation.groupByColumns.map(_.fieldNames.head)
+    val groupByColNames =
+      aggregation.groupByExpressions.map { expr =>
+        assert(expr.isInstanceOf[FieldReference])


ditto, we can even have a common function to extract column name

private def extractColName(v2Expr: V2Expression): Option[String] = ...

cloud-fan · 2022-04-26T05:17:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

-      case PushableColumnWithoutNestedColumn(name) =>
-        Some(FieldReference.column(name).asInstanceOf[FieldReference])
+    def translateGroupBy(e: Expression): Option[V2Expression] = e match {
+      case PushableExpression(expr) => Some(expr)


to confirm: this supports nested column, right?

PushableExpression uses PushableColumnWithoutNestedColumn in default.

cloud-fan · 2022-04-26T05:23:22Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+                    case (expr, b) =>
+                      if (!groupByExprToOutputOrdinal.contains(expr.canonicalized)) {
+                        groupByExprToOutputOrdinal(expr.canonicalized) = ordinal
+                        ordinal += 1


this looks incorrect. We use the ordinal to access groupAttrs, but here we skip incrementing ordinal if group by expression is an attribute. I think we should use normalizedGroupingExpressions.zip(newOutput).zipWithIndex.

BTW, do we have a test for mixed group by attr and group by expr?

Thank you. I forgot the case.

cloud-fan · 2022-04-26T14:17:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

@@ -106,11 +107,11 @@ object AggregatePushDownUtils {
      // aggregate push down simple and don't handle this complicate case for now.
      return None
    }
-    aggregation.groupByColumns.foreach { col =>
+    aggregation.groupByExpressions.flatMap(extractColName).foreach { fieldName =>


We should return None earlier if any of the group by expr is not a column.

cloud-fan

LGTM except for one issue

cloud-fan · 2022-04-27T16:43:49Z

thanks, merging to master/3.3! (I believe this is the last API breaking change for DS v2 pushdown API)

…sions ### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down only supports group by column. But the SQL show below is very useful and common. ``` SELECT CASE WHEN 'SALARY' > 8000.00 AND 'SALARY' < 10000.00 THEN 'SALARY' ELSE 0.00 END AS key, SUM('SALARY') FROM "test"."employee" GROUP BY key ``` ### Why are the changes needed? Let DS V2 aggregate push-down supports group by expressions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes #36325 from beliefer/SPARK-38997. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ee6ea3c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

beliefer · 2022-04-28T01:20:18Z

@cloud-fan Thank you for the help!

…sions ### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down only supports group by column. But the SQL show below is very useful and common. ``` SELECT CASE WHEN 'SALARY' > 8000.00 AND 'SALARY' < 10000.00 THEN 'SALARY' ELSE 0.00 END AS key, SUM('SALARY') FROM "test"."employee" GROUP BY key ``` ### Why are the changes needed? Let DS V2 aggregate push-down supports group by expressions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes apache#36325 from beliefer/SPARK-38997. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…sions ### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down only supports group by column. But the SQL show below is very useful and common. ``` SELECT CASE WHEN 'SALARY' > 8000.00 AND 'SALARY' < 10000.00 THEN 'SALARY' ELSE 0.00 END AS key, SUM('SALARY') FROM "test"."employee" GROUP BY key ``` ### Why are the changes needed? Let DS V2 aggregate push-down supports group by expressions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes apache#36325 from beliefer/SPARK-38997. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> fix ut

* [SPARK-38997][SQL] DS V2 aggregate push-down supports group by expressions ### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down only supports group by column. But the SQL show below is very useful and common. ``` SELECT CASE WHEN 'SALARY' > 8000.00 AND 'SALARY' < 10000.00 THEN 'SALARY' ELSE 0.00 END AS key, SUM('SALARY') FROM "test"."employee" GROUP BY key ``` ### Why are the changes needed? Let DS V2 aggregate push-down supports group by expressions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes apache#36325 from beliefer/SPARK-38997. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> fix ut * [SPARK-39135][SQL] DS V2 aggregate partial push-down should supports group by without aggregate functions ### What changes were proposed in this pull request? Currently, the SQL show below not supported by DS V2 aggregate partial push-down. `select key from tab group by key` ### Why are the changes needed? Make DS V2 aggregate partial push-down supports group by without aggregate functions. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes apache#36492 from beliefer/SPARK-39135. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39157][SQL] H2Dialect should override getJDBCType so as make the data type is correct ### What changes were proposed in this pull request? Currently, `H2Dialect` not implement `getJDBCType` of `JdbcDialect`, so the DS V2 push-down will throw exception show below: ``` Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13) (jiaan-gengdembp executor driver): org.h2.jdbc.JdbcSQLNonTransientException: Unknown data type: "STRING"; SQL statement: SELECT "DEPT","NAME","SALARY","BONUS","IS_MANAGER" FROM "test"."employee" WHERE ("BONUS" IS NOT NULL) AND ("DEPT" IS NOT NULL) AND (CAST("BONUS" AS string) LIKE '%30%') AND (CAST("DEPT" AS byte) > 1) AND (CAST("DEPT" AS short) > 1) AND (CAST("BONUS" AS decimal(20,2)) > 1200.00) [50004-210] ``` H2Dialect should implement `getJDBCType` of `JdbcDialect`. ### Why are the changes needed? make the H2 data type is correct. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug for `H2Dialect`. ### How was this patch tested? New tests. Closes apache#36516 from beliefer/SPARK-39157. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39162][SQL] Jdbc dialect should decide which function could be pushed down ### What changes were proposed in this pull request? Regardless of whether the functions are ANSI or not, most databases are actually unsure of their support. So we should add a new API into `JdbcDialect` so that jdbc dialect decide which function could be pushed down. ### Why are the changes needed? Let function push-down more flexible. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? Exists tests. Closes apache#36521 from beliefer/SPARK-39162. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> * [SPARK-38897][SQL] DS V2 supports push down string functions ### What changes were proposed in this pull request? Currently, Spark have some string functions of ANSI standard. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L503 These functions show below: `SUBSTRING,` `UPPER`, `LOWER`, `TRANSLATE`, `TRIM`, `OVERLAY` The mainstream databases support these functions show below. Function | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- `SUBSTRING` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes `UPPER` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes `LOWER` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | YES | Yes | Yes | Yes | Yes `TRIM` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes `TRANSLATE` | Yes | No | Yes | No | Yes | Yes | No | No | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | No | No | No | No | No `OVERLAY` | Yes | No | No | No | Yes | No | No | No | No | Yes | Yes | No | No | No | No | No | No | No | No | No | No | No DS V2 should supports push down these string functions. ### Why are the changes needed? DS V2 supports push down string functions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#36330 from chenzhx/spark-master. Authored-by: chenzhx <chen@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression ### What changes were proposed in this pull request? This is a ANSI SQL and feature id is `F861` ``` <query expression> ::= [ <with clause> ] <query expression body> [ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ] <result offset clause> ::= OFFSET <offset row count> { ROW | ROWS } ``` For example: ``` SELECT customer_name, customer_gender FROM customer_dimension WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY customer_name; customer_name | customer_gender ----------------------+----------------- Amy X. Lang | Female Anna H. Li | Female Brian O. Weaver | Male Craig O. Pavlov | Male Doug Z. Goldberg | Male Harold S. Jones | Male Jack E. Perkins | Male Joseph W. Overstreet | Male Kevin . Campbell | Male Raja Y. Wilson | Male Samantha O. Brown | Female Steve H. Gauthier | Male William . Nielson | Male William Z. Roy | Male (14 rows) SELECT customer_name, customer_gender FROM customer_dimension WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY customer_name OFFSET 8; customer_name | customer_gender -------------------+----------------- Kevin . Campbell | Male Raja Y. Wilson | Male Samantha O. Brown | Female Steve H. Gauthier | Male William . Nielson | Male William Z. Roy | Male (6 rows) ``` There are some mainstream database support the syntax. **Druid** https://druid.apache.org/docs/latest/querying/sql.html#offset **Kylin** http://kylin.apache.org/docs/tutorial/sql_reference.html#QUERYSYNTAX **Exasol** https://docs.exasol.com/sql/select.htm **Greenplum** http://docs.greenplum.org/6-8/ref_guide/sql_commands/SELECT.html **MySQL** https://dev.mysql.com/doc/refman/5.6/en/select.html **Monetdb** https://www.monetdb.org/Documentation/SQLreference/SQLSyntaxOverview#SELECT **PostgreSQL** https://www.postgresql.org/docs/11/queries-limit.html **Sqlite** https://www.sqlite.org/lang_select.html **Vertica** https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/SELECT/OFFSETClause.htm?zoom_highlight=offset The description for design: **1**. Consider `OFFSET` as the special case of `LIMIT`. For example: `SELECT * FROM a limit 10;` similar to `SELECT * FROM a limit 10 offset 0;` `SELECT * FROM a offset 10;` similar to `SELECT * FROM a limit -1 offset 10;` **2**. Because the current implement of `LIMIT` has good performance. For example: `SELECT * FROM a limit 10;` parsed to the logic plan as below: ``` GlobalLimit (limit = 10) |--LocalLimit (limit = 10) ``` and then the physical plan as below: ``` GlobalLimitExec (limit = 10) // Take the first 10 rows globally |--LocalLimitExec (limit = 10) // Take the first 10 rows locally ``` This operator reduce massive shuffle and has good performance. Sometimes, the logic plan transformed to the physical plan as: ``` CollectLimitExec (limit = 10) // Take the first 10 rows globally ``` If the SQL contains order by, such as `SELECT * FROM a order by c limit 10;`. This SQL will be transformed to the physical plan as below: ``` TakeOrderedAndProjectExec (limit = 10) // Take the first 10 rows after sort globally ``` Based on this situation, this PR produces the following operations. For example: `SELECT * FROM a limit 10 offset 10;` parsed to the logic plan as below: ``` GlobalLimit (limit = 10) |--LocalLimit (limit = 10) |--Offset (offset = 10) ``` After optimization, the above logic plan will be transformed to: ``` GlobalLimitAndOffset (limit = 10, offset = 10) // Limit clause accompanied by offset clause |--LocalLimit (limit = 20) // 10 + offset = 20 ``` and then the physical plan as below: ``` GlobalLimitAndOffsetExec (limit = 10, offset = 10) // Skip the first 10 rows and take the next 10 rows globally |--LocalLimitExec (limit = 20) // Take the first 20(limit + offset) rows locally ``` Sometimes, the logic plan transformed to the physical plan as: ``` CollectLimitExec (limit = 10, offset = 10) // Skip the first 10 rows and take the next 10 rows globally ``` If the SQL contains order by, such as `SELECT * FROM a order by c limit 10 offset 10;`. This SQL will be transformed to the physical plan as below: ``` TakeOrderedAndProjectExec (limit = 10, offset 10) // Skip the first 10 rows and take the next 10 rows after sort globally ``` **3**.In addition to the above, there is a special case that is only offset but no limit. For example: `SELECT * FROM a offset 10;` parsed to the logic plan as below: ``` Offset (offset = 10) // Only offset clause ``` If offset is very large, will generate a lot of overhead. So this PR will refuse use offset clause without limit clause, although we can parse, transform and execute it. A balanced idea is add a configuration item `spark.sql.forceUsingOffsetWithoutLimit` to force running query when user knows the offset is small enough. The default value of `spark.sql.forceUsingOffsetWithoutLimit` is false. This PR just came up with the idea so that it could be implemented at a better time in the future. Note: The origin PR to support this feature is apache#25416. Because the origin PR too old, there exists massive conflict which is hard to resolve. So I open this new PR to support this feature. ### Why are the changes needed? new feature ### Does this PR introduce any user-facing change? 'No' ### How was this patch tested? Exists and new UT Closes apache#35975 from beliefer/SPARK-28330. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39057][SQL] Offset could work without Limit ### What changes were proposed in this pull request? Currently, `Offset` must work with `Limit`. The behavior not allow to use offset alone and add offset API into `DataFrame`. If we use `Offset` alone, there are two situations: 1. If `Offset` is the last operator, collect the result to the driver and then drop/skip the first n (offset value) rows. Users can test or debug `Offset` in the way. 2. If `Offset` is the intermediate operator, shuffle all the result to one task and drop/skip the first n (offset value) rows and the result will be passed to the downstream operator. For example, `SELECT * FROM a offset 10; ` parsed to the logic plan as below: ``` Offset (offset = 10) // Only offset clause |--Relation ``` and then the physical plan as below: ``` CollectLimitExec(limit = -1, offset = 10) // Collect the result to the driver and skip the first 10 rows |--JDBCRelation ``` or ``` GlobalLimitAndOffsetExec(limit = -1, offset = 10) // Collect the result and skip the first 10 rows |--JDBCRelation ``` After this PR merged, users could input the SQL show below: ``` SELECT '' AS ten, unique1, unique2, stringu1 FROM onek ORDER BY unique1 OFFSET 990; ``` Note: apache#35975 supports offset clause, it create a logical node named `GlobalLimitAndOffset`. In fact, we can avoid use this node and use `Offset` instead and the latter is good with unify name. ### Why are the changes needed? Improve the implement of offset clause. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? Exists test cases. Closes apache#36417 from beliefer/SPARK-28330_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39159][SQL] Add new Dataset API for Offset ### What changes were proposed in this pull request? Currently, Spark added `Offset` operator. This PR try to add `offset` API into `Dataset`. ### Why are the changes needed? `offset` API is very useful and construct test case more easily. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#36519 from beliefer/SPARK-39159. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * fix ut * update spark version Co-authored-by: Jiaan Geng <beliefer@163.com>

…Column` need be translated to predicate too ### What changes were proposed in this pull request? #35768 assume the expression in `And`, `Or` and `Not` must be predicate. #36370 and #36325 supported push down expressions in `GROUP BY` and `ORDER BY`. But the children of `And`, `Or` and `Not` can be `FieldReference.column(name)`. `FieldReference.column(name)` is not a predicate, so the assert may fail. ### Why are the changes needed? This PR fix the bug for `PushableColumnWithoutNestedColumn`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Let the push-down framework more correctly. ### How was this patch tested? New tests Closes #36776 from beliefer/SPARK-38997_SPARK-39037_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…Column` need be translated to predicate too ### What changes were proposed in this pull request? #35768 assume the expression in `And`, `Or` and `Not` must be predicate. #36370 and #36325 supported push down expressions in `GROUP BY` and `ORDER BY`. But the children of `And`, `Or` and `Not` can be `FieldReference.column(name)`. `FieldReference.column(name)` is not a predicate, so the assert may fail. ### Why are the changes needed? This PR fix the bug for `PushableColumnWithoutNestedColumn`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Let the push-down framework more correctly. ### How was this patch tested? New tests Closes #36776 from beliefer/SPARK-38997_SPARK-39037_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 125555c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…Column` need be translated to predicate too ### What changes were proposed in this pull request? apache#35768 assume the expression in `And`, `Or` and `Not` must be predicate. apache#36370 and apache#36325 supported push down expressions in `GROUP BY` and `ORDER BY`. But the children of `And`, `Or` and `Not` can be `FieldReference.column(name)`. `FieldReference.column(name)` is not a predicate, so the assert may fail. ### Why are the changes needed? This PR fix the bug for `PushableColumnWithoutNestedColumn`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Let the push-down framework more correctly. ### How was this patch tested? New tests Closes apache#36776 from beliefer/SPARK-38997_SPARK-39037_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…mal binary arithmetic (#481) * [SPARK-39270][SQL] JDBC dialect supports registering dialect specific functions ### What changes were proposed in this pull request? The build-in functions in Spark is not the same as JDBC database. We can provide the chance users could register dialect specific functions. ### Why are the changes needed? JDBC dialect supports registering dialect specific functions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#36649 from beliefer/SPARK-39270. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39413][SQL] Capitalize sql keywords in JDBCV2Suite ### What changes were proposed in this pull request? `JDBCV2Suite` exists some test case which uses sql keywords are not capitalized. This PR will capitalize sql keywords in `JDBCV2Suite`. ### Why are the changes needed? Capitalize sql keywords in `JDBCV2Suite`. ### Does this PR introduce _any_ user-facing change? 'No'. Just update test cases. ### How was this patch tested? N/A. Closes apache#36805 from beliefer/SPARK-39413. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> * [SPARK-38997][SPARK-39037][SQL][FOLLOWUP] PushableColumnWithoutNestedColumn` need be translated to predicate too ### What changes were proposed in this pull request? apache#35768 assume the expression in `And`, `Or` and `Not` must be predicate. apache#36370 and apache#36325 supported push down expressions in `GROUP BY` and `ORDER BY`. But the children of `And`, `Or` and `Not` can be `FieldReference.column(name)`. `FieldReference.column(name)` is not a predicate, so the assert may fail. ### Why are the changes needed? This PR fix the bug for `PushableColumnWithoutNestedColumn`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Let the push-down framework more correctly. ### How was this patch tested? New tests Closes apache#36776 from beliefer/SPARK-38997_SPARK-39037_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39316][SQL] Merge PromotePrecision and CheckOverflow into decimal binary arithmetic ### What changes were proposed in this pull request? The main change: - Add a new method `resultDecimalType` in `BinaryArithmetic` - Add a new expression `DecimalAddNoOverflowCheck` for the internal decimal add, e.g. `Sum`/`Average`, the different with `Add` is: - `DecimalAddNoOverflowCheck` does not check overflow - `DecimalAddNoOverflowCheck` make `dataType` as its input parameter - Merge the decimal precision code of `DecimalPrecision` into each arithmetic data type, so every arithmetic should report the accurate decimal type. And we can remove the unused expression `PromotePrecision` and related code - Merge `CheckOverflow` iinto arithmetic eval and code-gen code path, so every arithmetic can handle the overflow case during runtime Merge `PromotePrecision` into `dataType`, for example, `Add`: ```scala override def resultDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = { val resultScale = max(s1, s2) if (allowPrecisionLoss) { DecimalType.adjustPrecisionScale(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } else { DecimalType.bounded(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } } ``` Merge `CheckOverflow`, for example, `Add` eval: ```scala dataType match { case decimalType: DecimalType => val value = numeric.plus(input1, input2) checkOverflow(value.asInstanceOf[Decimal], decimalType) ... } ``` Note that, `CheckOverflow` is still useful after this pr, e.g. `RowEncoder`. We can do further in a separate pr. ### Why are the changes needed? Fix the bug of `TypeCoercion`, for example: ```sql SELECT CAST(1 AS DECIMAL(28, 2)) UNION ALL SELECT CAST(1 AS DECIMAL(18, 2)) / CAST(1 AS DECIMAL(18, 2)); ``` Relax the decimal precision at runtime, so we do not need redundant Cast ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? Pass exists test and add some bug fix test in `decimalArithmeticOperations.sql` Closes apache#36698 from ulysses-you/decimal. Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * fix ut Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com>

…mal binary arithmetic (#481) * [SPARK-39270][SQL] JDBC dialect supports registering dialect specific functions The build-in functions in Spark is not the same as JDBC database. We can provide the chance users could register dialect specific functions. JDBC dialect supports registering dialect specific functions 'No'. New feature. New tests. Closes apache#36649 from beliefer/SPARK-39270. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39413][SQL] Capitalize sql keywords in JDBCV2Suite `JDBCV2Suite` exists some test case which uses sql keywords are not capitalized. This PR will capitalize sql keywords in `JDBCV2Suite`. Capitalize sql keywords in `JDBCV2Suite`. 'No'. Just update test cases. N/A. Closes apache#36805 from beliefer/SPARK-39413. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> * [SPARK-38997][SPARK-39037][SQL][FOLLOWUP] PushableColumnWithoutNestedColumn` need be translated to predicate too apache#35768 assume the expression in `And`, `Or` and `Not` must be predicate. apache#36370 and apache#36325 supported push down expressions in `GROUP BY` and `ORDER BY`. But the children of `And`, `Or` and `Not` can be `FieldReference.column(name)`. `FieldReference.column(name)` is not a predicate, so the assert may fail. This PR fix the bug for `PushableColumnWithoutNestedColumn`. 'Yes'. Let the push-down framework more correctly. New tests Closes apache#36776 from beliefer/SPARK-38997_SPARK-39037_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39316][SQL] Merge PromotePrecision and CheckOverflow into decimal binary arithmetic The main change: - Add a new method `resultDecimalType` in `BinaryArithmetic` - Add a new expression `DecimalAddNoOverflowCheck` for the internal decimal add, e.g. `Sum`/`Average`, the different with `Add` is: - `DecimalAddNoOverflowCheck` does not check overflow - `DecimalAddNoOverflowCheck` make `dataType` as its input parameter - Merge the decimal precision code of `DecimalPrecision` into each arithmetic data type, so every arithmetic should report the accurate decimal type. And we can remove the unused expression `PromotePrecision` and related code - Merge `CheckOverflow` iinto arithmetic eval and code-gen code path, so every arithmetic can handle the overflow case during runtime Merge `PromotePrecision` into `dataType`, for example, `Add`: ```scala override def resultDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = { val resultScale = max(s1, s2) if (allowPrecisionLoss) { DecimalType.adjustPrecisionScale(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } else { DecimalType.bounded(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } } ``` Merge `CheckOverflow`, for example, `Add` eval: ```scala dataType match { case decimalType: DecimalType => val value = numeric.plus(input1, input2) checkOverflow(value.asInstanceOf[Decimal], decimalType) ... } ``` Note that, `CheckOverflow` is still useful after this pr, e.g. `RowEncoder`. We can do further in a separate pr. Fix the bug of `TypeCoercion`, for example: ```sql SELECT CAST(1 AS DECIMAL(28, 2)) UNION ALL SELECT CAST(1 AS DECIMAL(18, 2)) / CAST(1 AS DECIMAL(18, 2)); ``` Relax the decimal precision at runtime, so we do not need redundant Cast yes, bug fix Pass exists test and add some bug fix test in `decimalArithmeticOperations.sql` Closes apache#36698 from ulysses-you/decimal. Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * fix ut Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com>

…mal binary arithmetic (#481) * [SPARK-39270][SQL] JDBC dialect supports registering dialect specific functions ### What changes were proposed in this pull request? The build-in functions in Spark is not the same as JDBC database. We can provide the chance users could register dialect specific functions. ### Why are the changes needed? JDBC dialect supports registering dialect specific functions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#36649 from beliefer/SPARK-39270. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39413][SQL] Capitalize sql keywords in JDBCV2Suite ### What changes were proposed in this pull request? `JDBCV2Suite` exists some test case which uses sql keywords are not capitalized. This PR will capitalize sql keywords in `JDBCV2Suite`. ### Why are the changes needed? Capitalize sql keywords in `JDBCV2Suite`. ### Does this PR introduce _any_ user-facing change? 'No'. Just update test cases. ### How was this patch tested? N/A. Closes apache#36805 from beliefer/SPARK-39413. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> * [SPARK-38997][SPARK-39037][SQL][FOLLOWUP] PushableColumnWithoutNestedColumn` need be translated to predicate too ### What changes were proposed in this pull request? apache#35768 assume the expression in `And`, `Or` and `Not` must be predicate. apache#36370 and apache#36325 supported push down expressions in `GROUP BY` and `ORDER BY`. But the children of `And`, `Or` and `Not` can be `FieldReference.column(name)`. `FieldReference.column(name)` is not a predicate, so the assert may fail. ### Why are the changes needed? This PR fix the bug for `PushableColumnWithoutNestedColumn`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Let the push-down framework more correctly. ### How was this patch tested? New tests Closes apache#36776 from beliefer/SPARK-38997_SPARK-39037_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39316][SQL] Merge PromotePrecision and CheckOverflow into decimal binary arithmetic ### What changes were proposed in this pull request? The main change: - Add a new method `resultDecimalType` in `BinaryArithmetic` - Add a new expression `DecimalAddNoOverflowCheck` for the internal decimal add, e.g. `Sum`/`Average`, the different with `Add` is: - `DecimalAddNoOverflowCheck` does not check overflow - `DecimalAddNoOverflowCheck` make `dataType` as its input parameter - Merge the decimal precision code of `DecimalPrecision` into each arithmetic data type, so every arithmetic should report the accurate decimal type. And we can remove the unused expression `PromotePrecision` and related code - Merge `CheckOverflow` iinto arithmetic eval and code-gen code path, so every arithmetic can handle the overflow case during runtime Merge `PromotePrecision` into `dataType`, for example, `Add`: ```scala override def resultDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = { val resultScale = max(s1, s2) if (allowPrecisionLoss) { DecimalType.adjustPrecisionScale(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } else { DecimalType.bounded(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } } ``` Merge `CheckOverflow`, for example, `Add` eval: ```scala dataType match { case decimalType: DecimalType => val value = numeric.plus(input1, input2) checkOverflow(value.asInstanceOf[Decimal], decimalType) ... } ``` Note that, `CheckOverflow` is still useful after this pr, e.g. `RowEncoder`. We can do further in a separate pr. ### Why are the changes needed? Fix the bug of `TypeCoercion`, for example: ```sql SELECT CAST(1 AS DECIMAL(28, 2)) UNION ALL SELECT CAST(1 AS DECIMAL(18, 2)) / CAST(1 AS DECIMAL(18, 2)); ``` Relax the decimal precision at runtime, so we do not need redundant Cast ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? Pass exists test and add some bug fix test in `decimalArithmeticOperations.sql` Closes apache#36698 from ulysses-you/decimal. Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * fix ut Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com>

…mal binary arithmetic (#481) * [SPARK-39270][SQL] JDBC dialect supports registering dialect specific functions The build-in functions in Spark is not the same as JDBC database. We can provide the chance users could register dialect specific functions. JDBC dialect supports registering dialect specific functions 'No'. New feature. New tests. Closes apache#36649 from beliefer/SPARK-39270. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39413][SQL] Capitalize sql keywords in JDBCV2Suite `JDBCV2Suite` exists some test case which uses sql keywords are not capitalized. This PR will capitalize sql keywords in `JDBCV2Suite`. Capitalize sql keywords in `JDBCV2Suite`. 'No'. Just update test cases. N/A. Closes apache#36805 from beliefer/SPARK-39413. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> * [SPARK-38997][SPARK-39037][SQL][FOLLOWUP] PushableColumnWithoutNestedColumn` need be translated to predicate too apache#35768 assume the expression in `And`, `Or` and `Not` must be predicate. apache#36370 and apache#36325 supported push down expressions in `GROUP BY` and `ORDER BY`. But the children of `And`, `Or` and `Not` can be `FieldReference.column(name)`. `FieldReference.column(name)` is not a predicate, so the assert may fail. This PR fix the bug for `PushableColumnWithoutNestedColumn`. 'Yes'. Let the push-down framework more correctly. New tests Closes apache#36776 from beliefer/SPARK-38997_SPARK-39037_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-39316][SQL] Merge PromotePrecision and CheckOverflow into decimal binary arithmetic The main change: - Add a new method `resultDecimalType` in `BinaryArithmetic` - Add a new expression `DecimalAddNoOverflowCheck` for the internal decimal add, e.g. `Sum`/`Average`, the different with `Add` is: - `DecimalAddNoOverflowCheck` does not check overflow - `DecimalAddNoOverflowCheck` make `dataType` as its input parameter - Merge the decimal precision code of `DecimalPrecision` into each arithmetic data type, so every arithmetic should report the accurate decimal type. And we can remove the unused expression `PromotePrecision` and related code - Merge `CheckOverflow` iinto arithmetic eval and code-gen code path, so every arithmetic can handle the overflow case during runtime Merge `PromotePrecision` into `dataType`, for example, `Add`: ```scala override def resultDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = { val resultScale = max(s1, s2) if (allowPrecisionLoss) { DecimalType.adjustPrecisionScale(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } else { DecimalType.bounded(max(p1 - s1, p2 - s2) + resultScale + 1, resultScale) } } ``` Merge `CheckOverflow`, for example, `Add` eval: ```scala dataType match { case decimalType: DecimalType => val value = numeric.plus(input1, input2) checkOverflow(value.asInstanceOf[Decimal], decimalType) ... } ``` Note that, `CheckOverflow` is still useful after this pr, e.g. `RowEncoder`. We can do further in a separate pr. Fix the bug of `TypeCoercion`, for example: ```sql SELECT CAST(1 AS DECIMAL(28, 2)) UNION ALL SELECT CAST(1 AS DECIMAL(18, 2)) / CAST(1 AS DECIMAL(18, 2)); ``` Relax the decimal precision at runtime, so we do not need redundant Cast yes, bug fix Pass exists test and add some bug fix test in `decimalArithmeticOperations.sql` Closes apache#36698 from ulysses-you/decimal. Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * fix ut Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-38997][SQL] DS V2 aggregate push-down supports group by expres…

4c40c9b

…sions

github-actions bot added the SQL label Apr 22, 2022

Update code

2440505

cloud-fan reviewed Apr 25, 2022

View reviewed changes

Update code

b7ebfc5

cloud-fan reviewed Apr 26, 2022

View reviewed changes

beliefer added 4 commits April 26, 2022 14:40

Update code

e53066e

Update code

baa96df

Update code

410158b

Update code

7b48b54

cloud-fan reviewed Apr 26, 2022

View reviewed changes

cloud-fan approved these changes Apr 26, 2022

View reviewed changes

beliefer added 2 commits April 27, 2022 08:44

Update code

0b259cf

Update code

b92a231

cloud-fan closed this in ee6ea3c Apr 27, 2022

beliefer mentioned this pull request Jun 6, 2022

[SPARK-38997][SPARK-39037][SQL][FOLLOWUP] PushableColumnWithoutNestedColumn need be translated to predicate too #36776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38997][SQL] DS V2 aggregate push-down supports group by expressions #36325

[SPARK-38997][SQL] DS V2 aggregate push-down supports group by expressions #36325

beliefer commented Apr 22, 2022 •

edited

Loading

beliefer commented Apr 25, 2022

cloud-fan Apr 25, 2022

cloud-fan Apr 25, 2022

beliefer Apr 26, 2022

cloud-fan Apr 25, 2022

cloud-fan Apr 26, 2022 •

edited

Loading

cloud-fan Apr 26, 2022

cloud-fan Apr 26, 2022

beliefer Apr 26, 2022

cloud-fan Apr 26, 2022

beliefer Apr 26, 2022

cloud-fan Apr 26, 2022

cloud-fan left a comment

cloud-fan commented Apr 27, 2022

beliefer commented Apr 28, 2022

[SPARK-38997][SQL] DS V2 aggregate push-down supports group by expressions #36325

[SPARK-38997][SQL] DS V2 aggregate push-down supports group by expressions #36325

Conversation

beliefer commented Apr 22, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

beliefer commented Apr 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Apr 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

cloud-fan commented Apr 27, 2022

beliefer commented Apr 28, 2022

beliefer commented Apr 22, 2022 •

edited

Loading

cloud-fan Apr 26, 2022 •

edited

Loading