Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41233][FOLLOWUP] Refactor array_prepend with RuntimeReplaceable #40563

Closed

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Mar 27, 2023

What changes were proposed in this pull request?

Recently, Spark SQL supported array_insert and array_prepend. All implementations are individual.
In fact, array_prepend is special case of array_insert and we can reuse the array_insert by extends RuntimeReplaceable.

Why are the changes needed?

Simplify the implementation of array_prepend.

Does this PR introduce any user-facing change?

'No'.
Just update the inner implementation.

How was this patch tested?

Exists test case.

@github-actions github-actions bot added the SQL label Mar 27, 2023
@beliefer beliefer changed the title [SPARK-41232][SPARK-41233][FOLLOWUP] Refactor array_append and array_prepend with RuntimeReplaceable [SPARK-41232][SPARK-41233][FOLLOWUP] Refactor array_append and array_prepend with RuntimeReplaceable Mar 27, 2023
with ImplicitCastInputTypes
with ComplexTypeMergingExpression
with QueryErrorsBase {
trait ArrayInsertEnd extends RuntimeReplaceable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename this trait? ArrayInsertEnd cannot describe ArrayPrepend well

nullSafeEval(value1, value2)
}
override protected def withNewChildrenInternal(
newLeft: Expression, newRight: Expression): Expression = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type change to ArrayAppend?


override protected def withNewChildrenInternal(
newLeft: Expression, newRight: Expression): Expression = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type change to ArrayPrepend?

@transient protected lazy val elementType: DataType =
inputTypes.head.asInstanceOf[ArrayType].elementType
override lazy val replacement: Expression =
ArrayInsert(arr, Add(ArraySize(left), Literal(1)), ele)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use both arr and left?

@@ -1855,50 +1855,6 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
checkEvaluation(ArrayRepeat(Literal("hi"), Literal(null, IntegerType)), null)
}

test("SPARK-41233: ArrayPrepend") {

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... please ignore the comments above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to delete the test cases? Should be still relevant right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to delete the test cases? Should be still relevant right?

Runtimereplace do not support execute and it will be replaced with other expression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we have reviewed that we are having all types of test cases deleted here in DataFrameFunctionsSuite. If we are missing, please help add those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and added test cases in DataFrameFunctionsSuite

@beliefer
Copy link
Contributor Author

ping @zhengruifeng @HyukjinKwon @dongjoon-hyun cc @infoankitp @navinvishy

@@ -1855,50 +1855,6 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
checkEvaluation(ArrayRepeat(Literal("hi"), Literal(null, IntegerType)), null)
}

test("SPARK-41233: ArrayPrepend") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to delete the test cases? Should be still relevant right?

with QueryErrorsBase {

override def nullable: Boolean = left.nullable
trait InsertArrayOneSide extends RuntimeReplaceable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we just call this RuntimeReplaceableBinary or something? Just a thought, since there doesn't seem to be anything specific about arrays in this trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InsertArrayOneSide should be better name. Thank you.

@@ -1,2 +1,2 @@
Project [array_append(e#0, 1) AS array_append(e, 1)#0]
Project [array_insert(e#0, (size(e#0, false) + 1), 1) AS array_append(e, 1)#0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can fix array_insert to have getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse(name), and set the alias in FunctionRegistry by setting setAlias to true

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I tested and found this can't given help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


override def nullable: Boolean = left.nullable
trait InsertArrayOneSide extends RuntimeReplaceable
with ImplicitCastInputTypes with BinaryLike[Expression] with QueryErrorsBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any diff between BinaryLike[Expression] and BinaryExpression? Also, any specific reason for removing ComplexTypeMergingExpression, I think this can help in assigning values of containsNull and nullable fields ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BinaryExpression could be evaluated. In fact, the implementation of RuntimeReplaceable could not be evaluated.
ArrayInsert already implemented the ComplexTypeMergingExpression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense ! Thanks so much!

@transient protected lazy val elementType: DataType =
inputTypes.head.asInstanceOf[ArrayType].elementType
override lazy val replacement: Expression =
ArrayInsert(left, Add(ArraySize(left), Literal(1)), right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inspite of ArraySize(left), I think ArrayInsert supports negative values, how about using Literal(-1), to select the position.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal(-1) let insert element before the last one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ! Thanks for confirming!

@@ -1855,50 +1855,6 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper
checkEvaluation(ArrayRepeat(Literal("hi"), Literal(null, IntegerType)), null)
}

test("SPARK-41233: ArrayPrepend") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we have reviewed that we are having all types of test cases deleted here in DataFrameFunctionsSuite. If we are missing, please help add those.

@zhengruifeng
Copy link
Contributor

ping @cloud-fan

finalData.update(arrayData.numElements(), elementData)
new GenericArrayData(finalData)
}
override lazy val replacement: Expression = ArrayInsert(left, Literal(0), right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but isn't the implementation of array_insert wrong? The doc says: Places val into index pos of array x. Array indices start at 1, or start from the end if index is negative.

According to the doc, array_insert(arr, 1, val) should put val as the first element in this array. array_insert(arr, 0, val) should fail as 0 is not a valid index. However, the current implementation of array_insert uses 0-based index. cc @zhengruifeng @HyukjinKwon @dtenedor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan it is still 1-based, but also treat 0 as the first item. see the discussion #38867 (comment)

+----------------------------------+----------------------------------+----------------------------------+
|array_insert(array(1, 2, 3), 1, 4)|array_insert(array(1, 2, 3), 0, 4)|array_insert(array(1, 2, 3), 2, 4)|
+----------------------------------+----------------------------------+----------------------------------+
|                      [4, 1, 2, 3]|                      [4, 1, 2, 3]|                      [1, 4, 2, 3]|
+----------------------------------+----------------------------------+----------------------------------+

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing that discussion. I think failing is more consistent with other functions that use 1-based index.

And here we should use 1 as the index to be more future-proof.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, let me send a fix for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do want to reserve index 0 for some special behavior in array_insert, I think appending the value to the end of the array seems more convenient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to remove support for index 0 ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't have to reserve index 0, make it fails seems good to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, let's fix it soon before the 3.4 release. also cc @xinrong-meng

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append value to the end of array for index 0 seems hard to understand.

@transient protected lazy val elementType: DataType =
inputTypes.head.asInstanceOf[ArrayType].elementType
override lazy val replacement: Expression =
ArrayInsert(left, Add(ArraySize(left), Literal(1)), right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A risk is left can be evaluated twice if common subexpression elimination is disabled, or the underlying execution engine (native engine) does not support it.

I'd prefer to keep ArrayAppend as it is, or update ArrayInsert to provide an easy way to do append.

MaxGekk pushed a commit that referenced this pull request Apr 4, 2023
### What changes were proposed in this pull request?
Make `array_insert` fail when input index `pos` is zero.

### Why are the changes needed?
see #40563 (comment)

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
updated UT

Closes #40641 from zhengruifeng/sql_array_insert_fails_zero.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Apr 4, 2023
### What changes were proposed in this pull request?
Make `array_insert` fail when input index `pos` is zero.

### Why are the changes needed?
see #40563 (comment)

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
updated UT

Closes #40641 from zhengruifeng/sql_array_insert_fails_zero.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 3e9574c)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@beliefer
Copy link
Contributor Author

beliefer commented Apr 4, 2023

cc @MaxGekk

@@ -1,2 +1,2 @@
Project [array_prepend(e#0, 1) AS array_prepend(e, 1)#0]
Project [array_insert(e#0, 0, 1) AS array_prepend(e, 1)#0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fix this?


override def nullable: Boolean = left.nullable
override lazy val replacement: Expression = ArrayInsert(left, Literal(0), right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will fail now.

@beliefer beliefer force-pushed the SPARK-41232_SPARK-41233_followup branch from 2196c2e to b04a175 Compare April 6, 2023 04:57
@beliefer beliefer changed the title [SPARK-41232][SPARK-41233][FOLLOWUP] Refactor array_append and array_prepend with RuntimeReplaceable [SPARK-41233][FOLLOWUP] Refactor array_prepend with RuntimeReplaceable Apr 6, 2023
@cloud-fan
Copy link
Contributor

Can we update the PR description? array_append is untouched now.

@cloud-fan
Copy link
Contributor

cloud-fan commented Apr 6, 2023

After a second thought, this makes the performance worse. We should improve ArrayInsert to generate more efficient code if the position argument is constant. For example, if the position argument is constant 1, then we don't need to generate code to check if position is negative or not.

@beliefer
Copy link
Contributor Author

beliefer commented Apr 7, 2023

Can we update the PR description? array_append is untouched now.

Updated.

@beliefer
Copy link
Contributor Author

beliefer commented Apr 7, 2023

After a second thought, this makes the performance worse. We should improve ArrayInsert to generate more efficient code if the position argument is constant. For example, if the position argument is constant 1, then we don't need to generate code to check if position is negative or not.

You means create another PR to simplify the code of ArrayInsert.

@beliefer beliefer force-pushed the SPARK-41232_SPARK-41233_followup branch from b04a175 to c939d6e Compare April 20, 2023 07:14
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3e4a32d Apr 24, 2023
@beliefer
Copy link
Contributor Author

beliefer commented May 4, 2023

@cloud-fan Thanks !

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?
Make `array_insert` fail when input index `pos` is zero.

### Why are the changes needed?
see apache#40563 (comment)

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
updated UT

Closes apache#40641 from zhengruifeng/sql_array_insert_fails_zero.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 3e9574c)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants