Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25371][SQL] struct() should allow being called with 0 args #22373

Closed
wants to merge 3 commits into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Sep 9, 2018

What changes were proposed in this pull request?

SPARK-21281 introduced a check for the inputs of CreateStructLike to be non-empty. This means that struct(), which was previously considered valid, now throws an Exception. This behavior change was introduced in 2.3.0. The change may break users' application on upgrade and it causes VectorAssembler to fail when an empty inputCols is defined.

The PR removes the added check making struct() valid again.

How was this patch tested?

added UT

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 9, 2018

cc @dongjoon-hyun @gatorsmile @maropu who worked on SPARK-21281

@SparkQA
Copy link

SparkQA commented Sep 9, 2018

Test build #95847 has finished for PR 22373 at commit 7bc026e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -256,4 +256,9 @@ class VectorAssemblerSuite
assert(runWithMetadata("keep", additional_filter = "id1 > 2").count() == 4)
}

test("SPARK-25371: VectorAssembler with empty inputCols") {
val vectorAssembler = new VectorAssembler().setInputCols(Array()).setOutputCol("a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is VectorAssembler with zero input column useful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't sound that useful, but the JIRA suggests this is the behavior in 2.2. It throws a weird error in 2.3. I could imagine just allowing this behavior, or throwing a better exception. Is there a use case for no input? maybe you have some reusable pipeline that is applied to a subset of columns and sometimes it matches nothing. The output is empty but maybe that doesn't matter for whatever purpose it serves... maybe it's assembled with something else afterwards. I could picture a valid use case.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 10, 2018

@mgaido91, BTW are you sure SPARK-21281 introduced that behaviour change?

Before:

scala> import org.apache.spark.sql.functions.struct
import org.apache.spark.sql.functions.struct

scala> spark.range(1).select(struct()).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due to data type mismatch: input to function named_struct requires at least one argument;;
'Project [unresolvedalias(named_struct(), None)]
+- Range (0, 1, step=1, splits=Some(8))

After:

scala> import org.apache.spark.sql.functions.struct
import org.apache.spark.sql.functions.struct

scala> spark.range(1).select(struct()).show()
java.lang.AssertionError: assertion failed: each serializer expression should contain at least one `BoundReference`
  at scala.Predef$.assert(Predef.scala:170)

it should be good to document that behaviour change if that's allowed before in a separate PR.

@mgaido91
Copy link
Contributor Author

@HyukjinKwon I am sure, since I tried removing the added check and the UT I added here passed.

@mgaido91
Copy link
Contributor Author

cc @cloud-fan @jerryshao despite a very minor one, this can be considered a regression, so may be considered as a blocker for 2.4/2.3.2?

@cloud-fan
Copy link
Contributor

I think we should allow struct function to take empty arguments.

@mgaido91
Copy link
Contributor Author

I think we should allow struct function to take empty arguments.

Yes, I agree @cloud-fan. At least until 3.0 IMHO. But since the change was already released in 3.0, I was not sure whether to revert that additional check or add this logic here. If nobody objects, I'd switch to remove the check, since this would also prevent other users's workflows to potentially break having a regression.

@cloud-fan
Copy link
Contributor

@maropu says it's OK to revert that part, @mgaido91 can you do that? thanks!

@mgaido91
Copy link
Contributor Author

thanks @cloud-fan @maropu , I'll update this accordingly ASAP, thanks

@maropu
Copy link
Member

maropu commented Sep 10, 2018

ya, sorry for bothering you all. thansk, @mgaido91

@mgaido91 mgaido91 changed the title [SPARK-25371][ML] VectorAssembler should not fail with empty inputCols [SPARK-25371][SQL] struct() should allow being called with 0 args Sep 10, 2018
@@ -256,4 +256,9 @@ class VectorAssemblerSuite
assert(runWithMetadata("keep", additional_filter = "id1 > 2").count() == 4)
}

test("SPARK-25371: VectorAssembler with empty inputCols") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, after this patch this test passes, before it doesn't. I think it is helpful to avoid regressions like this in the future.

@cloud-fan
Copy link
Contributor

Can you also update the PR description? thanks!

@SparkQA
Copy link

SparkQA commented Sep 10, 2018

Test build #95877 has finished for PR 22373 at commit 69ff3cb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 10, 2018

Test build #95895 has finished for PR 22373 at commit 69ff3cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

asfgit pushed a commit that referenced this pull request Sep 11, 2018
## What changes were proposed in this pull request?

SPARK-21281 introduced a check for the inputs of `CreateStructLike` to be non-empty. This means that `struct()`, which was previously considered valid, now throws an Exception.  This behavior change was introduced in 2.3.0. The change may break users' application on upgrade and it causes `VectorAssembler` to fail when an empty `inputCols` is defined.

The PR removes the added check making `struct()` valid again.

## How was this patch tested?

added UT

Closes #22373 from mgaido91/SPARK-25371.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0736e72)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.4!

@mgaido91 can you send a new PR to 2.3? it conflicts

@asfgit asfgit closed this in 0736e72 Sep 11, 2018
mgaido91 added a commit to mgaido91/spark that referenced this pull request Sep 11, 2018
SPARK-21281 introduced a check for the inputs of `CreateStructLike` to be non-empty. This means that `struct()`, which was previously considered valid, now throws an Exception.  This behavior change was introduced in 2.3.0. The change may break users' application on upgrade and it causes `VectorAssembler` to fail when an empty `inputCols` is defined.

The PR removes the added check making `struct()` valid again.

added UT

Closes apache#22373 from mgaido91/SPARK-25371.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
SPARK-21281 introduced a check for the inputs of `CreateStructLike` to be non-empty. This means that `struct()`, which was previously considered valid, now throws an Exception.  This behavior change was introduced in 2.3.0. The change may break users' application on upgrade and it causes `VectorAssembler` to fail when an empty `inputCols` is defined.

The PR removes the added check making `struct()` valid again.

added UT

Closes apache#22373 from mgaido91/SPARK-25371.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0736e72)

RB=1520856
G=superfriends-reviewers
R=fli,mshen,yezhou,edlu
A=fli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants