[SPARK-33391][SQL] element_at with CreateArray not respect one based index. #30296

leanken-zz · 2020-11-09T08:22:35Z

What changes were proposed in this pull request?

element_at with CreateArray not respect one based index.

repo step:

var df = spark.sql("select element_at(array(3, 2, 1), 0)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 1)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 2)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 3)")
df.printSchema()

root
– element_at(array(3, 2, 1), 0): integer (nullable = false)

root
– element_at(array(3, 2, 1), 1): integer (nullable = false)

root
– element_at(array(3, 2, 1), 2): integer (nullable = false)

root
– element_at(array(3, 2, 1), 3): integer (nullable = true)

correct answer should be 
0 false because it throw new ArrayIndexOutOfBoundsException("SQL array indices start at 1")
1 false
2 false
3 false
bigger than 3, etc. 4,5,6,7 should return true

For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using computeNullabilityFromArray.

Why are the changes needed?

Correctness issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UT and existing UT.

Change-Id: Ib0f29cdf1016d53c0ca121fa463c5a2a8fd6b960

leanken-zz · 2020-11-09T08:22:57Z

@cloud-fan FYI.

SparkQA · 2020-11-09T09:05:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35385/

cloud-fan · 2020-11-09T09:12:00Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

@@ -1966,7 +1966,7 @@ case class ElementAt(left: Expression, right: Expression)
  }

  override def nullable: Boolean = left.dataType match {
-    case _: ArrayType => computeNullabilityFromArray(left, right)
+    case _: ArrayType => computeNullabilityFromArray(left, right, isOneBasedIndex = true)


how about computeNullabilityFromArray(left, Subtract(right, Literal(1)))?

what if it's a negative number like -1. -1 means the last one from right to left.

_FUNC_(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.

Got it. Then isOneBasedIndex is a misleading name. Maybe the parameter should be normalizeIndex: (Int, Int) => Int = _._2, which takes the array length and the index, and return the normalized 0-based non-negative index.

SparkQA · 2020-11-09T09:35:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35385/

SparkQA · 2020-11-09T10:15:43Z

Test build #130776 has finished for PR 30296 at commit 10090a7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: I702764bf4eb48e361f138f1c18246495ab99e570

cloud-fan · 2020-11-09T12:54:46Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+      def specialNormalizeIndex: (Int, Int) => Int = {
+        (arrayLength: Int, index: Int) => {
+          if (index < 0) {
+            arrayLength + index


this can still be negative and fail, right?

calling nullable will not get exception or failed, if it's out of bounds, it's just returning a default true.

yes, if the passing index is negative and arrayLength + index still < 0, it will still failed.

do we need to cover the arrayLength + index still < 0 inside this specialNormalizeIndex ?

cloud-fan · 2020-11-09T12:55:18Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+        (arrayLength: Int, index: Int) => {
+          if (index < 0) {
+            arrayLength + index
+          } else if (index == 0) {


why not

if (index <= 0) { arrayLength + index } ...

Actually, ElementAt fails at runtime if index == 0, so the nullable doesn't really matter.

but if the passed in index is 0, it will change to -1 and call the following code. it will throw exception, but the old behavior is return a default true.

ar(intOrdinal).nullable

I am just try to follow the old behavior.

SparkQA · 2020-11-09T13:04:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35393/

SparkQA · 2020-11-09T13:31:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35393/

SparkQA · 2020-11-09T16:57:23Z

Test build #130784 has finished for PR 30296 at commit c0bf2f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: Ib4985b7b4b63babcd7f6e2e3fdc02dc084745d72

SparkQA · 2020-11-10T02:51:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35426/

SparkQA · 2020-11-10T03:13:45Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35426/

cloud-fan · 2020-11-10T03:16:41Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

+      Seq(Row(3))
+    )
+
+    // 0 is not a valid index, return nullable = false since it throws exception.


can we test 4, -4 as well?

cloud-fan · 2020-11-10T03:19:10Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

@@ -1401,6 +1401,40 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
    assert(e3.message.contains(errorMsg3))
  }

+  test("SPARK-33391: element_at with CreateArray") {


It seems an overkill to have end-to-end test for it. How about we just add more tests in CollectionExpressionsSuite.correctly handles ElementAt nullability for arrays to test negative and invalid indices?

Change-Id: Iba60b51d72c50c048591ac5924593030d354d2c0

cloud-fan · 2020-11-10T04:21:32Z

...st/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala

-    assert(!ElementAt(array, Subtract(Literal(2), Literal(2))).nullable)
+    assert(!ElementAt(array, Literal(1)).nullable)
+    assert(ElementAt(array, Literal(2)).nullable)
+    assert(!ElementAt(array, Subtract(Literal(2), Literal(1))).nullable)


let's test valid negative ordinals.

SparkQA · 2020-11-10T05:15:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35440/

Change-Id: If21a6914cb01bbf4f0253f5d733d8acc3f2d1f00

SparkQA · 2020-11-10T05:36:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35440/

SparkQA · 2020-11-10T06:38:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35445/

SparkQA · 2020-11-10T07:00:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35445/

SparkQA · 2020-11-10T07:01:39Z

Test build #130831 has finished for PR 30296 at commit 7d09d8c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-10T07:23:42Z

GA passed, merging to master/3.0!

…index ### What changes were proposed in this pull request? element_at with CreateArray not respect one based index. repo step: ``` var df = spark.sql("select element_at(array(3, 2, 1), 0)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 1)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 2)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 3)") df.printSchema() root – element_at(array(3, 2, 1), 0): integer (nullable = false) root – element_at(array(3, 2, 1), 1): integer (nullable = false) root – element_at(array(3, 2, 1), 2): integer (nullable = false) root – element_at(array(3, 2, 1), 3): integer (nullable = true) correct answer should be 0 true which is outOfBounds return default true. 1 false 2 false 3 false ``` For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using `computeNullabilityFromArray`. ### Why are the changes needed? Correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and existing UT. Closes #30296 from leanken/leanken-SPARK-33391. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e3a768d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparkQA · 2020-11-10T07:39:10Z

Test build #130817 has finished for PR 30296 at commit 4dac08d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-10T08:05:01Z

Test build #130836 has finished for PR 30296 at commit fc84cac.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2020-11-11T16:47:06Z

@leanken If possible, please also mention when we introduced the bug.

This is a regression introduced in https://issues.apache.org/jira/browse/SPARK-26965. Thus, Spark 2.4 is safe.

itholic · 2020-11-17T08:25:34Z

Hi @leanken !
I'd ask maybe Is the PR description for spark.sql("select element_at(array(3, 2, 1), 0)").printSchema() correct?
I ran the reproducers in the PR description and it shows a different result in the master branch.

scala> spark.sql("select element_at(array(3, 2, 1), 0)").printSchema()
root
 |-- element_at(array(3, 2, 1), 0): integer (nullable = false)

It returned nullable = false, but in the PR description says that true is expected.
Or could you please correct me if I missed something??
Thanks :)

leanken-zz · 2020-11-17T09:23:00Z

Hi @leanken !
I'd ask maybe Is the PR description for spark.sql("select element_at(array(3, 2, 1), 0)").printSchema() correct?
I ran the reproducers in the PR description and it shows a different result in the master branch.
scala> spark.sql("select element_at(array(3, 2, 1), 0)").printSchema()
root
 |-- element_at(array(3, 2, 1), 0): integer (nullable = false)
It returned nullable = false, but in the PR description says that true is expected.
Or could you please correct me if I missed something??
Thanks :)

Oh. the PR desc is outdated, thanks for the mentions. And as for the correct answer, it should return false.
because spark.sql("select element_at(array(3, 2, 1), 0)").collect will return a runtime exception when ansi mode is on or off, in this case, it should be false, because if nullable = true, it should return null.

itholic · 2020-11-17T09:30:17Z

Oh, I got it. Thanks for the quick response, @leanken !! :D

element_at with CreateArray not respect one based index.

10090a7

Change-Id: Ib0f29cdf1016d53c0ca121fa463c5a2a8fd6b960

github-actions bot added the SQL label Nov 9, 2020

cloud-fan reviewed Nov 9, 2020

View reviewed changes

code refine and fix UT failures.

c0bf2f2

Change-Id: I702764bf4eb48e361f138f1c18246495ab99e570

cloud-fan reviewed Nov 9, 2020

View reviewed changes

code refine.

4dac08d

Change-Id: Ib4985b7b4b63babcd7f6e2e3fdc02dc084745d72

cloud-fan reviewed Nov 10, 2020

View reviewed changes

UT refine.

7d09d8c

Change-Id: Iba60b51d72c50c048591ac5924593030d354d2c0

cloud-fan reviewed Nov 10, 2020

View reviewed changes

test valid negative indices.

fc84cac

Change-Id: If21a6914cb01bbf4f0253f5d733d8acc3f2d1f00

cloud-fan approved these changes Nov 10, 2020

View reviewed changes

cloud-fan closed this in e3a768d Nov 10, 2020

[SPARK-33391][SQL] element_at with CreateArray not respect one based index. #30296

[SPARK-33391][SQL] element_at with CreateArray not respect one based index. #30296

Conversation

leanken-zz commented Nov 9, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

leanken-zz commented Nov 9, 2020

SparkQA commented Nov 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 9, 2020

SparkQA commented Nov 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 9, 2020

SparkQA commented Nov 9, 2020

SparkQA commented Nov 9, 2020

SparkQA commented Nov 10, 2020

SparkQA commented Nov 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 10, 2020

SparkQA commented Nov 10, 2020

SparkQA commented Nov 10, 2020

SparkQA commented Nov 10, 2020

SparkQA commented Nov 10, 2020

cloud-fan commented Nov 10, 2020 • edited Loading

SparkQA commented Nov 10, 2020

SparkQA commented Nov 10, 2020

gatorsmile commented Nov 11, 2020

itholic commented Nov 17, 2020

leanken-zz commented Nov 17, 2020

itholic commented Nov 17, 2020

leanken-zz commented Nov 9, 2020 •

edited

Loading

cloud-fan commented Nov 10, 2020 •

edited

Loading