[SPARK-39213][SQL] Create ANY_VALUE aggregate function #36584

vitaliili-db · 2022-05-17T19:57:41Z

What changes were proposed in this pull request?

Adding implementation for ANY_VALUE aggregate function. During optimization stage it is rewritten to First aggregate function.

Why are the changes needed?

This feature provides feature parity with popular DBs and DWHs

Does this PR introduce any user-facing change?

Yes - introducing new aggregate function ANY_VALUE. Respective documentation is updated.

How was this patch tested?

Unit tests

AmplabJenkins · 2022-05-18T07:10:32Z

Can one of the admins verify this patch?

vitaliili-db · 2022-05-18T17:24:07Z

@dtenedor

dtenedor

Generally LGTM, we should just update the function comment.

dtenedor · 2022-05-18T17:44:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AnyValue.scala

+import org.apache.spark.sql.types._
+
+/**
+ * Returns some value of `child` for a group of rows. The result will not be deterministic.


Maybe just update this comment to mention that this return the first value in the group, and the implementation is the same as the First aggregate function?

MaxGekk · 2022-05-18T18:38:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AnyValue.scala

+    The function is non-deterministic.
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")


Since SPARK-39213 is not in the allow list for 3.3 (see https://lists.apache.org/thread/2tl67py05t1620v3fk8ms672mnxt6nol), the changes shouldn't be targeted to 3.3. Please, change the version to 3.4.0 here and other places.

Good catch, thank you! Fixed.

vitaliili-db · 2022-05-19T16:10:30Z

@MaxGekk please review and help me merge this.

MaxGekk · 2022-05-19T16:34:48Z

How about to add the function to other APIs like first() in

PySpark:

spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Line 500 in b63674e

def first(e: Column, ignoreNulls: Boolean): Column = withAggregateFunction {
R:

spark/R/pkg/R/functions.R

Line 1178 in 16d1c68

setMethod("first",

BTW, if the purpose of this new feature is to make migrations to Spark SQL from other systems easier, I would propose to add it to Spark SQL only (and not extend functions.scala).

vitaliili-db · 2022-05-19T16:44:56Z

@MaxGekk Yes, the purpose is ease of migration, removed change to functions.scala to limit scope to Spark SQL only.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AnyValue.scala

MaxGekk · 2022-05-20T08:19:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AnyValue.scala

+  """,
+  group = "agg_funcs",
+  since = "3.4.0")
+case class AnyValue(child: Expression, ignoreNulls: Boolean)


Could you explain, please, why do you need a separate expression and why any_value() is not implemented as an alias of First like first_value():

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

Line 469 in 7221ea3

expression[First]("first_value", true),

This is primarily for documentation purposes.

MaxGekk

Waiting for CI.

MaxGekk · 2022-05-20T19:27:32Z

+1, LGTM. Merging to master.
Thank you, @vli-databricks, and @dtenedor for review.

github-actions bot added DOCS SQL labels May 17, 2022

vitaliili-db force-pushed the SPARK-39213 branch 3 times, most recently from aaaad89 to 2a166f3 Compare May 18, 2022 00:04

vitaliili-db force-pushed the SPARK-39213 branch from 2a166f3 to d3afed3 Compare May 18, 2022 17:21

dtenedor approved these changes May 18, 2022

View reviewed changes

vitaliili-db force-pushed the SPARK-39213 branch from d3afed3 to 59e27e2 Compare May 18, 2022 18:00

MaxGekk reviewed May 18, 2022

View reviewed changes

vitaliili-db force-pushed the SPARK-39213 branch from 59e27e2 to 1cee3dd Compare May 18, 2022 21:01

MaxGekk changed the title ~~[SPARK-39213] Create ANY_VALUE aggregate function~~ [SPARK-39213][SQL] Create ANY_VALUE aggregate function May 19, 2022

vitaliili-db force-pushed the SPARK-39213 branch from 1cee3dd to 5913683 Compare May 19, 2022 16:43

MaxGekk requested changes May 20, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AnyValue.scala Outdated Show resolved Hide resolved

MaxGekk reviewed May 20, 2022

View reviewed changes

Moving any_value to expressions.aggregate

634b3d6

vitaliili-db force-pushed the SPARK-39213 branch from 5913683 to 634b3d6 Compare May 20, 2022 16:48

MaxGekk approved these changes May 20, 2022

View reviewed changes

MaxGekk closed this in efc1e8a May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-39213][SQL] Create ANY_VALUE aggregate function #36584

[SPARK-39213][SQL] Create ANY_VALUE aggregate function #36584

Uh oh!

vitaliili-db commented May 17, 2022

Uh oh!

AmplabJenkins commented May 18, 2022

Uh oh!

vitaliili-db commented May 18, 2022

Uh oh!

dtenedor left a comment

Uh oh!

dtenedor May 18, 2022

Uh oh!

vitaliili-db May 18, 2022

Uh oh!

MaxGekk May 18, 2022

Uh oh!

vitaliili-db May 18, 2022

Uh oh!

vitaliili-db commented May 19, 2022

Uh oh!

MaxGekk commented May 19, 2022

Uh oh!

vitaliili-db commented May 19, 2022 •

edited

Loading

Uh oh!

Uh oh!

MaxGekk May 20, 2022

Uh oh!

vitaliili-db May 20, 2022

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk commented May 20, 2022

Uh oh!

Uh oh!

[SPARK-39213][SQL] Create ANY_VALUE aggregate function #36584

[SPARK-39213][SQL] Create ANY_VALUE aggregate function #36584

Uh oh!

Conversation

vitaliili-db commented May 17, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented May 18, 2022

Uh oh!

vitaliili-db commented May 18, 2022

Uh oh!

dtenedor left a comment

Choose a reason for hiding this comment

Uh oh!

dtenedor May 18, 2022

Choose a reason for hiding this comment

Uh oh!

vitaliili-db May 18, 2022

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 18, 2022

Choose a reason for hiding this comment

Uh oh!

vitaliili-db May 18, 2022

Choose a reason for hiding this comment

Uh oh!

vitaliili-db commented May 19, 2022

Uh oh!

MaxGekk commented May 19, 2022

Uh oh!

vitaliili-db commented May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MaxGekk May 20, 2022

Choose a reason for hiding this comment

Uh oh!

vitaliili-db May 20, 2022

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented May 20, 2022

Uh oh!

Uh oh!

vitaliili-db commented May 19, 2022 •

edited

Loading