[SPARK-42746][SQL] Implement LISTAGG function #48748

mikhailnik-db · 2024-11-04T15:22:09Z

What changes were proposed in this pull request?

Implement new aggregation function listagg([ALL | DISTINCT] expr[, sep]) [WITHIN GROUP (ORDER BY key [ASC | DESC] [,...])]

Why are the changes needed?

Listagg is a popular function implemented by many other vendors. For now, users have to use workarounds like this. PR will close the gap.

Does this PR introduce any user-facing change?

Yes, the new listagg function. BigQuery and PostgreSQL have the same function but with string_agg name so I added it as an alias.

How was this patch tested?

With new unit tests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot

# Conflicts: # sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.tokens

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

…qlBaseParser.g4 Co-authored-by: Jiaan Geng <beliefer@163.com>

…to SPARK-42746_listagg_function

…alysis

cloud-fan · 2024-11-21T13:20:15Z

common/utils/src/main/resources/error/error-conditions.json

  },
+  "FUNCTION_AND_ORDER_EXPRESSION_MISMATCH" : {
+    "message" : [
+      "The arguments <functionArgs> of the function <functionName> do not match to ordering within group <orderExpr> when use DISTINCT."


how about
Function <funcName> is invoked with DISTINCT. The WITHIN GROUP ordering expressions must be picked from the function inputs, but got <orderingExpr>.

We can make it a sub error condition of INVALID_WITHIN_GROUP_EXPRESSION: MISMATCH_WITH_DISTINCT_INPUT

Good idea. Now with the common INVALID_WITHIN_GROUP_EXPRESSION prefix, it looks like Invalid function <funcName> with WITHIN GROUP. The function is invoked with DISTINCT and WITHIN GROUP but expressions <funcArg> and <orderingExpr> do not match. The WITHIN GROUP ordering expression must be picked from the function inputs.

cloud-fan · 2024-11-21T13:31:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+
+  /**
+   * Sort buffer according orderExpressions.
+   * If orderExpressions is empty them returns buffer as is.


Suggested change

* If orderExpressions is empty them returns buffer as is.

* If orderExpressions is empty then returns buffer as is.

cloud-fan · 2024-11-21T13:37:42Z

sql/core/src/test/resources/sql-tests/results/listagg.sql.out

+-- !query schema
+struct<listagg(c1, NULL):binary>
+-- !query output
+ޭ��


The golden file test framework should print the hex string of binary values. We can improve it in followup PRs.

cc @mitkedb for this

cloud-fan · 2024-11-21T13:39:29Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

    }
  }

+  private[this] def hexToBytes(s: String): Array[Byte] = {


not needed now.

cloud-fan · 2024-11-21T13:40:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+
+  override def inputTypes: Seq[AbstractDataType] =
+    TypeCollection(
+      StringTypeWithCollation(supportsTrimCollation = true),


how is trim collation supported?

As I understand collation only affects comparison so it's important only for DISTINCT and ORDER BY

DISTINCT is handled by the aggregation framework and it respects trim collations

ORDER BY is handled by the code with PhysicalDataType.ordering that respects trim collations too

Added tests for trim collations

…N_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT

cloud-fan

LGTM except for https://github.com/apache/spark/pull/48748/files#r1852128035

dtenedor · 2024-11-25T22:34:11Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

    return result;
  }

+  public static byte[] concatWS(byte[] delimiter, byte[]... inputs) {


can you please add a comment saying what this function is doing?

dtenedor · 2024-11-25T22:34:17Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

+              result, Platform.BYTE_ARRAY_OFFSET + offset,
+              len);
+      offset += len;
+      if(i < inputs.length - 1) {


Suggested change

if(i < inputs.length - 1) {

if (i < inputs.length - 1) {

dtenedor · 2024-11-25T22:34:47Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

+    for (int i = 0; i < inputs.length; i++) {
+      byte[] input = inputs[i];
+      int len = input.length;
+      Platform.copyMemory(


this seems copied from L154 above, please dedup into one place?

I didn't want to accidentally change existing behavior or performance so I thought a little copy-paste was justified in this isolated code. But I probably concern too much)
Removed

dtenedor · 2024-11-25T22:43:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+    dataType match {
+      case BinaryType =>
+        val inputs = buffer.filter(_ != null).map(_.asInstanceOf[Array[Byte]])
+        ByteArray.concatWS(delimiterValue.asInstanceOf[Array[Byte]], inputs.toSeq: _*)


we repeat the .asInstanceOf[Array[Byte]] two times here, can we use a pattern match to reduce this to one?

added more type strictness

dtenedor · 2024-11-25T22:44:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+        ByteArray.concatWS(delimiterValue.asInstanceOf[Array[Byte]], inputs.toSeq: _*)
+      case _: StringType =>
+        val inputs = buffer.filter(_ != null).map(_.asInstanceOf[UTF8String])
+        UTF8String.concatWs(delimiterValue.asInstanceOf[UTF8String], inputs.toSeq : _*)


These concatenations consume input memory without bound. Do we have some kind of limit to this? If we consume a very large disk-based input table in the aggregation it could crash the executors by running out of memory. We should probably create SQLConfs with max limits for these buffers.

Yes, it's a common problem for all collect_* functions. As I tested, they now fail with OOM if the buffer is too big. And precentile_disc doc says the same

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala

Lines 403 to 407 in ad49fcf

* Because the number of elements and their partial order cannot be determined in advance.

* Therefore we have to store all the elements in memory, and so notice that too many elements can

* cause GC paused and eventually OutOfMemory Errors.

*/

case class PercentileDisc(

I think it's a common problem and should be handled in follow-ups.

cloud-fan · 2024-11-26T14:22:29Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

-        input, Platform.BYTE_ARRAY_OFFSET,
-        result, Platform.BYTE_ARRAY_OFFSET + offset,
-        len);
+              input, Platform.BYTE_ARRAY_OFFSET,


nit: 2 spaces indentation

cloud-fan · 2024-11-26T14:22:34Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/ByteArray.java

      offset += len;
+      if (delimiter.length > 0 && i < inputs.length - 1) {
+        Platform.copyMemory(
+                delimiter, Platform.BYTE_ARRAY_OFFSET,


cloud-fan · 2024-11-26T14:33:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

+      val sortOrderExpression = orderExpressions.head
+      val ascendingOrdering = PhysicalDataType.ordering(sortOrderExpression.dataType)
+      val ordering =
+        if (sortOrderExpression.direction == Ascending) ascendingOrdering


SortOrder has a nullOrdering flag, shall we respect it here?

I'm wondering if we should reuse the code in SortExec to do sorting.

listagg filters all null values from result and in this case it's sorted by the same value, so null ordering does nothing

I see, but the Spark native sorter should be more efficient and support spilling.

We can leave it for future optimization

cloud-fan · 2024-11-29T13:31:48Z

thanks, merging to master!

### What changes were proposed in this pull request? Added new function `listagg` to pyspark. Follow-up of #48748. ### Why are the changes needed? Allows to use native Python functions to write queries with `listagg`. E.g., `df.select(F.listagg(df.value, ",").alias("r"))`. ### Does this PR introduce _any_ user-facing change? Yes, new functions `listagg` and `listagg_distinct` (with aliases `string_agg` and `string_agg_distinct`) in pyspark. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? Generated-by: GitHub Copilot Closes #49231 from mikhailnik-db/SPARK-50220-listagg-for-pyspark. Authored-by: Mikhail Nikoliukin <mikhail.nikoliukin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

yaooqinn · 2025-02-11T04:11:08Z

sql/api/src/main/scala/org/apache/spark/sql/functions.scala

+   * @group agg_funcs
+   * @since 4.0.0
+   */
+  def listagg(e: Column, delimiter: Column): Column = Column.fn("listagg", e, delimiter)


Declaring the delimiter as String here can improve UX a bit. Since it only allows foldable string literals, we can rely on the compiler instead of runtime errors, WDYT @cloud-fan

Hisoka-X and others added 30 commits August 9, 2023 09:55

[SPARK-42746][SQL] Add the LIST_AGG() aggregate function

0fecbd9

update

99cf932

update

db513cf

format

68ed739

Merge branch 'master' into SPARK-42746_listagg_function

c56f291

format

d8460c8

fix review

864f658

Merge branch 'master' into SPARK-42746_listagg_function

1de7ffe

Merge branch 'master' into SPARK-42746_listagg_function

274a96b

format

d23654c

Merge branch 'master_' into SPARK-42746_listagg_function

8347939

# Conflicts: # sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.tokens

Merge branch 'master' into SPARK-42746_listagg_function

412b8e7

Merge branch 'master' into SPARK-42746_listagg_function

90b2f2a

Merge branch 'master' into SPARK-42746_listagg_function

20b45dc

Merge branch 'master' into SPARK-42746_listagg_function

7c912d5

Merge branch 'master_' into SPARK-42746_listagg_function

75b234a

update

f048dc9

update

77825a7

update

ce093b5

Merge branch 'master_' into SPARK-42746_listagg_function

f2a1fb7

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala

update

19fdafc

update

9ae3872

update

8e3aae3

Merge branch 'master_' into SPARK-42746_listagg_function

8a3f705

update

c0f1496

update

2fcd373

Update sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/S…

bd088a4

…qlBaseParser.g4 Co-authored-by: Jiaan Geng <beliefer@163.com>

Merge branch 'master' into SPARK-42746_listagg_function

269aca3

update

dd0bfaf

Merge remote-tracking branch 'origin/SPARK-42746_listagg_function' in…

433e49b

…to SPARK-42746_listagg_function

mikhailnik-db added 3 commits November 21, 2024 12:15

[SPARK-42746] return SupportsOrderingWithinGroup check

07dfd82

[SPARK-42746] remove test duplicates

be68e20

[SPARK-42746] move functionAndOrderExpressionMismatchError to CheckAn…

6a9c1fe

…alysis

cloud-fan reviewed Nov 21, 2024

View reviewed changes

mikhailnik-db added 3 commits November 22, 2024 16:49

[SPARK-42746] FUNCTION_AND_ORDER_EXPRESSION_MISMATCH -> INVALID_WITHI…

0efedf3

…N_GROUP_EXPRESSION.MISMATCH_WITH_DISTINCT_INPUT

[SPARK-42746] add trim collation tests

811c36c

[SPARK-42746] adjust error message

9c5bd3d

cloud-fan reviewed Nov 25, 2024

View reviewed changes

mikhailnik-db added 2 commits November 25, 2024 16:58

[SPARK-42746] make SortOrder a child of listagg

e6d9c70

[SPARK-42746] fix error-conditions

0bbd8af

dtenedor reviewed Nov 25, 2024

View reviewed changes

mikhailnik-db added 2 commits November 26, 2024 12:35

[SPARK-42746] deduplicate concat logic

d96ac1e

[SPARK-42746] add type safety in getDelimiterValue

aee0ac5

cloud-fan reviewed Nov 26, 2024

View reviewed changes

[SPARK-42746] fix java indent

91b759f

cloud-fan approved these changes Nov 28, 2024

View reviewed changes

cloud-fan closed this in 4b97e11 Nov 29, 2024

mikhailnik-db mentioned this pull request Dec 18, 2024

[SPARK-50220][PYTHON] Support listagg in PySpark #49231

Closed

yaooqinn reviewed Feb 11, 2025

View reviewed changes

dongjoon-hyun mentioned this pull request Feb 12, 2025

[SPARK-42746][SQL][FOLLOWUP] Correct the comments for SupportsOrderingWithinGroup and Mode #49907

Closed

uros-db mentioned this pull request Jun 9, 2025

[SPARK-42746][SQL] Fix optimizer failure for SortOrder in the LISTAGG function #51117

Open

	* If orderExpressions is empty them returns buffer as is.
	* If orderExpressions is empty then returns buffer as is.

	* Because the number of elements and their partial order cannot be determined in advance.
	* Therefore we have to store all the elements in memory, and so notice that too many elements can
	* cause GC paused and eventually OutOfMemory Errors.
	*/
	case class PercentileDisc(

[SPARK-42746][SQL] Implement LISTAGG function #48748

[SPARK-42746][SQL] Implement LISTAGG function #48748

Uh oh!

Conversation

mikhailnik-db commented Nov 4, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikhailnik-db Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 29, 2024

Uh oh!

yaooqinn Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan Nov 21, 2024 •

edited

Loading

mikhailnik-db Nov 22, 2024 •

edited

Loading

yaooqinn Feb 11, 2025 •

edited

Loading