Skip to content

[SPARK-56561][DOCS] Document order preservation for array_distinct, array_intersect, array_union, array_except#55549

Open
shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
shrirangmhalgi:SPARK-56561-doc-array-order
Open

[SPARK-56561][DOCS] Document order preservation for array_distinct, array_intersect, array_union, array_except#55549
shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
shrirangmhalgi:SPARK-56561-doc-array-order

Conversation

@shrirangmhalgi
Copy link
Copy Markdown

What changes were proposed in this pull request?

This change documents the order preservation behavior of array_distinct, array_intersect, array_union, and array_except in:

  • SQL function descriptions (@ExpressionDescription)
  • Scala API scaladoc (functions.scala)
  • PySpark docstrings (builtin.py)

Also fixes an incorrect statement in array_except's scaladoc which said "The order of elements in the result is not determined" - the implementation preserves order from the first array.

Why are the changes needed?

With this change users will not have to read implementation code to know whether these functions preserve element order. This is useful for code reviews and helps AI coding agents understand the behavior.

Does this PR introduce any user-facing change?

No - It is just updating the documentation.

How was this patch tested?

  1. Verified Unit Tests using SBT - Tests pass for CollectionExpressionsSuite and DataFrameFunctionsSuite
  • build/sbt 'catalyst/testOnly *CollectionExpressionsSuite -- -z "Array Distinct" -z "Array Union" -z "Array Except" -z "Array Intersect"'
  • build/sbt 'sql/testOnly *DataFrameFunctionsSuite -- -z "array_distinct" -z "array_intersect" -z "array_union" -z "array_except"'
  1. Runtime verification using - spark-shell:
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq((Array(3,1,2,1,3), Array(2,4,3)))).toDF("a","b")
val r1 = df.select(array_distinct(col("a"))).collect()(0).getSeq[Int](0)
println(s"array_distinct([3,1,2,1,3]) = $r1")

Result - array_distinct([3,1,2,1,3]) = ArraySeq(3, 1, 2)

import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq((Array(3,1,2,1,3), Array(2,4,3)))).toDF("a","b")
val r2 = df.select(array_union(col("a"), col("b"))).collect()(0).getSeq[Int](0)
println(s"array_union([3,1,2,1,3], [2,4,3]) = $r2")

Result - array_union([3,1,2,1,3], [2,4,3]) = ArraySeq(3, 1, 2, 4)

import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq((Array(3,1,2,1,3), Array(2,4,3)))).toDF("a","b")
val r3 = df.select(array_intersect(col("a"), col("b"))).collect()(0).getSeq[Int](0)
println(s"array_intersect([3,1,2,1,3], [2,4,3]) = $r3")

Result - array_intersect([3,1,2,1,3], [2,4,3]) = ArraySeq(3, 2)

import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq((Array(3,1,2,1,3), Array(2,4,3)))).toDF("a","b")
val r4 = df.select(array_except(col("a"), col("b"))).collect()(0).getSeq[Int](0)
println(s"array_except([3,1,2,1,3], [2,4,3]) = $r4")

Result - array_except([3,1,2,1,3], [2,4,3]) = ArraySeq(1)

What changes were proposed in this pull request?

Documentation update.

Was this patch authored or co-authored using generative AI tooling?

No.

@shrirangmhalgi shrirangmhalgi force-pushed the SPARK-56561-doc-array-order branch from 8e03f20 to 534cf1b Compare April 25, 2026 06:54
@shrirangmhalgi shrirangmhalgi force-pushed the SPARK-56561-doc-array-order branch from 534cf1b to 0b421b0 Compare April 25, 2026 06:56
@shrirangmhalgi
Copy link
Copy Markdown
Author

Could somebody please help with the PR review

@sarutak
Copy link
Copy Markdown
Member

sarutak commented Apr 30, 2026

Hi @shrirangmhalgi, the current implementation does preserve element order, but "the implementation happens to do X" is different from "the API guarantees X." Documenting it as a guarantee constrains future optimizations (sort-based deduplication, parallelization, etc.) and cannot easily be walked back.

When array_union, array_intersect, and array_except were added (SPARK-23913/23914/23915), each PR description explicitly stated "The order of elements in the result is not defined." While the PR description for array_distinct does not mention the order of the result elements, it should be consistent with array_union and the others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants