Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33391][SQL] element_at with CreateArray not respect one based index. #30296

Closed
wants to merge 5 commits into from

Conversation

leanken-zz
Copy link
Contributor

@leanken-zz leanken-zz commented Nov 9, 2020

What changes were proposed in this pull request?

element_at with CreateArray not respect one based index.

repo step:

var df = spark.sql("select element_at(array(3, 2, 1), 0)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 1)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 2)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 3)")
df.printSchema()

root
– element_at(array(3, 2, 1), 0): integer (nullable = false)

root
– element_at(array(3, 2, 1), 1): integer (nullable = false)

root
– element_at(array(3, 2, 1), 2): integer (nullable = false)

root
– element_at(array(3, 2, 1), 3): integer (nullable = true)

correct answer should be 
0 false because it throw new ArrayIndexOutOfBoundsException("SQL array indices start at 1")
1 false
2 false
3 false
bigger than 3, etc. 4,5,6,7 should return true

For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using computeNullabilityFromArray.

Why are the changes needed?

Correctness issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added UT and existing UT.

Change-Id: Ib0f29cdf1016d53c0ca121fa463c5a2a8fd6b960
@github-actions github-actions bot added the SQL label Nov 9, 2020
@leanken-zz
Copy link
Contributor Author

@cloud-fan FYI.

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35385/

@@ -1966,7 +1966,7 @@ case class ElementAt(left: Expression, right: Expression)
}

override def nullable: Boolean = left.dataType match {
case _: ArrayType => computeNullabilityFromArray(left, right)
case _: ArrayType => computeNullabilityFromArray(left, right, isOneBasedIndex = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about computeNullabilityFromArray(left, Subtract(right, Literal(1)))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it's a negative number like -1. -1 means the last one from right to left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_FUNC_(array, index) - Returns element of array at given (1-based) index. If index < 0,
      accesses elements from the last to the first. Returns NULL if the index exceeds the length
      of the array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Then isOneBasedIndex is a misleading name. Maybe the parameter should be normalizeIndex: (Int, Int) => Int = _._2, which takes the array length and the index, and return the normalized 0-based non-negative index.

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35385/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Test build #130776 has finished for PR 30296 at commit 10090a7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Change-Id: I702764bf4eb48e361f138f1c18246495ab99e570
def specialNormalizeIndex: (Int, Int) => Int = {
(arrayLength: Int, index: Int) => {
if (index < 0) {
arrayLength + index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can still be negative and fail, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling nullable will not get exception or failed, if it's out of bounds, it's just returning a default true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if the passing index is negative and arrayLength + index still < 0, it will still failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to cover the arrayLength + index still < 0 inside this specialNormalizeIndex ?

(arrayLength: Int, index: Int) => {
if (index < 0) {
arrayLength + index
} else if (index == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not

if (index <= 0) {
  arrayLength + index
} ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, ElementAt fails at runtime if index == 0, so the nullable doesn't really matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if the passed in index is 0, it will change to -1 and call the following code. it will throw exception, but the old behavior is return a default true.

ar(intOrdinal).nullable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just try to follow the old behavior.

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35393/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35393/

@SparkQA
Copy link

SparkQA commented Nov 9, 2020

Test build #130784 has finished for PR 30296 at commit c0bf2f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Change-Id: Ib4985b7b4b63babcd7f6e2e3fdc02dc084745d72
@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35426/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35426/

Seq(Row(3))
)

// 0 is not a valid index, return nullable = false since it throws exception.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test 4, -4 as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -1401,6 +1401,40 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
assert(e3.message.contains(errorMsg3))
}

test("SPARK-33391: element_at with CreateArray") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems an overkill to have end-to-end test for it. How about we just add more tests in CollectionExpressionsSuite.correctly handles ElementAt nullability for arrays to test negative and invalid indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Change-Id: Iba60b51d72c50c048591ac5924593030d354d2c0
assert(!ElementAt(array, Subtract(Literal(2), Literal(2))).nullable)
assert(!ElementAt(array, Literal(1)).nullable)
assert(ElementAt(array, Literal(2)).nullable)
assert(!ElementAt(array, Subtract(Literal(2), Literal(1))).nullable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's test valid negative ordinals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35440/

Change-Id: If21a6914cb01bbf4f0253f5d733d8acc3f2d1f00
@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35440/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35445/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35445/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130831 has finished for PR 30296 at commit 7d09d8c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Nov 10, 2020

GA passed, merging to master/3.0!

@cloud-fan cloud-fan closed this in e3a768d Nov 10, 2020
cloud-fan pushed a commit that referenced this pull request Nov 10, 2020
…index

### What changes were proposed in this pull request?

element_at with CreateArray not respect one based index.

repo step:

```
var df = spark.sql("select element_at(array(3, 2, 1), 0)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 1)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 2)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 3)")
df.printSchema()

root
– element_at(array(3, 2, 1), 0): integer (nullable = false)

root
– element_at(array(3, 2, 1), 1): integer (nullable = false)

root
– element_at(array(3, 2, 1), 2): integer (nullable = false)

root
– element_at(array(3, 2, 1), 3): integer (nullable = true)

correct answer should be
0 true which is outOfBounds return default true.
1 false
2 false
3 false

```

For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using `computeNullabilityFromArray`.

### Why are the changes needed?

Correctness issue.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added UT and existing UT.

Closes #30296 from leanken/leanken-SPARK-33391.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e3a768d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130817 has finished for PR 30296 at commit 4dac08d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130836 has finished for PR 30296 at commit fc84cac.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@leanken If possible, please also mention when we introduced the bug.

This is a regression introduced in https://issues.apache.org/jira/browse/SPARK-26965. Thus, Spark 2.4 is safe.

@itholic
Copy link
Contributor

itholic commented Nov 17, 2020

Hi @leanken !
I'd ask maybe Is the PR description for spark.sql("select element_at(array(3, 2, 1), 0)").printSchema() correct?
I ran the reproducers in the PR description and it shows a different result in the master branch.

scala> spark.sql("select element_at(array(3, 2, 1), 0)").printSchema()
root
 |-- element_at(array(3, 2, 1), 0): integer (nullable = false)

It returned nullable = false, but in the PR description says that true is expected.
Or could you please correct me if I missed something??
Thanks :)

@leanken-zz
Copy link
Contributor Author

Hi @leanken !
I'd ask maybe Is the PR description for spark.sql("select element_at(array(3, 2, 1), 0)").printSchema() correct?
I ran the reproducers in the PR description and it shows a different result in the master branch.

scala> spark.sql("select element_at(array(3, 2, 1), 0)").printSchema()
root
 |-- element_at(array(3, 2, 1), 0): integer (nullable = false)

It returned nullable = false, but in the PR description says that true is expected.
Or could you please correct me if I missed something??
Thanks :)

Oh. the PR desc is outdated, thanks for the mentions. And as for the correct answer, it should return false.
because spark.sql("select element_at(array(3, 2, 1), 0)").collect will return a runtime exception when ansi mode is on or off, in this case, it should be false, because if nullable = true, it should return null.

@itholic
Copy link
Contributor

itholic commented Nov 17, 2020

Oh, I got it. Thanks for the quick response, @leanken !! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants