[SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls #28996

viirya · 2020-07-04T21:39:32Z

What changes were proposed in this pull request?

This patch proposes to make unionByName optionally fill missing columns with nulls.

Why are the changes needed?

Currently, unionByName throws exception if detecting different column names between two Datasets. It is strict requirement and sometimes users require more flexible usage that two Datasets with different subset of columns can be union by name resolution.

Does this PR introduce any user-facing change?

Yes. Adding overloading Dataset.unionByName with a boolean parameter that allows different set of column names between two Datasets. Missing columns at each side, will be filled with null values.

How was this patch tested?

Unit test.

viirya · 2020-07-04T21:42:23Z

@marmbrus @HyukjinKwon @cloud-fan

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala

maropu · 2020-07-05T12:11:53Z

Is this a syntax-sugar of df1.unionByName(df2.withColumn("c", lit(null)))? The fused operation does not look like a SQL union, so how about adding a new API if this is useful for users?

viirya · 2020-07-05T18:11:11Z

unionByName is not SQL-style union, as the API doc said.

SparkQA · 2020-07-05T19:56:42Z

Test build #124924 has started for PR 28996 at commit 6afb8e8.

maropu · 2020-07-05T23:33:31Z

Could you update the API doc, too? If the option enabled, the following statement doesn't hold?

To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]].

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala

HyukjinKwon

Looks good. I didn't have a strong feeling about it given that workarounds are possible rather easily. I will leave it to @marmbrus given the discussion at the JIRA.

viirya · 2020-07-06T04:12:49Z

To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]].

Read with the previous sentence, I think the doc means that this API doesn't deduplicate elements. The doc explains that this API resolves columns by name, not by position like union. This config doesn't change the behavior.

maropu · 2020-07-06T07:03:16Z

My bad and my last comment is ambiguous; I would mean that, how about adding some comments for this new behaviour in the API doc so that users can notice it.

SparkQA · 2020-07-06T07:05:04Z

Test build #125022 has finished for PR 28996 at commit 5e4f670.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-06T07:09:45Z

retest this please

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2020-07-06T14:04:54Z

Test build #125038 has finished for PR 28996 at commit 5e4f670.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-06T15:36:08Z

retest this please

SparkQA · 2020-07-06T17:25:06Z

Test build #125098 has finished for PR 28996 at commit 5e4f670.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-06T17:56:15Z

retest this please...

SparkQA · 2020-07-06T17:58:33Z

Test build #125111 has started for PR 28996 at commit 5e4f670.

viirya · 2020-07-06T23:10:55Z

retest this please...

shaneknapp · 2020-07-06T23:13:56Z

test this please

SparkQA · 2020-07-07T07:05:03Z

Test build #125141 has finished for PR 28996 at commit 5e4f670.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-07-07T10:25:04Z

retest this please

SparkQA · 2020-07-07T10:45:16Z

Test build #125208 has finished for PR 28996 at commit 5e4f670.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-07T15:40:12Z

retest this please

SparkQA · 2020-07-07T23:01:07Z

Test build #125232 has finished for PR 28996 at commit 5e4f670.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-10T04:00:27Z

Test build #125526 has finished for PR 28996 at commit df4e8dc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-10T06:39:30Z

retest this please

SparkQA · 2020-07-10T07:05:02Z

Test build #125563 has finished for PR 28996 at commit df4e8dc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-07-10T07:12:12Z

retest this please

SparkQA · 2020-07-10T07:17:40Z

Test build #125572 has started for PR 28996 at commit df4e8dc.

cloud-fan · 2020-07-10T07:27:50Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * Returns a new Dataset containing union of rows in this Dataset and another Dataset.
+   *
+   * This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set
+   * union (that does deduplication of elements), use this function followed by a [[distinct]].


this is not true now.

Actually in original unionByName, its doc has this section too:

This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set
union (that does deduplication of elements), use this function followed by a [[distinct]].

Re-read this doc, even with original unionByName behavior, it is a bit confusing to me. Do you think we should remove "To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]]."?

Wait really? When did we change the semantics? What was confusing about that documentation? (it was added because users were confused by the behavior...)

I read To do a SQL-style set union, it sounds like if you add distinct, you will get a SQL-style union. But it behaves different to SQL union at all.

Seems like we mistakenly copied the doc from union to unionByName.

cloud-fan · 2020-07-10T07:35:21Z

shall we add the same API to PySpark and R?

HyukjinKwon

The change itself looks good to me. We should add R and Python too but could be done separately

viirya · 2020-07-10T16:10:57Z

I'll add Python and R in a follow-up.

SparkQA · 2020-07-10T23:53:52Z

Test build #125625 has finished for PR 28996 at commit e2311fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-11T16:22:36Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with
+   * null values.
+   *


Could you add an illustrate example like 2016 ~ 2029, @viirya ?

dongjoon-hyun · 2020-07-11T16:28:59Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows different set
+   * of column names between two Datasets. Missing columns at each side, will be filled with


It's worth to document a little more about the order sensitive. Previously, it was simple because it follows the schema of original set(=left). With new options, the number of missing columns which will be added at the end are determined by other (=right).

Good advice.

dongjoon-hyun · 2020-07-11T16:30:58Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala

+    checkAnswer(df1.unionByName(df2, true),
+      Row(1, 2, null) :: Row(3, 5, 4) :: Nil)
+    checkAnswer(df2.unionByName(df1, true),
+      Row(3, 4, 5) :: Row(1, null, 2) :: Nil)


@viirya . Can we have both case-sensitive and case-insensitive test coverage?

dongjoon-hyun

+1, LGTM (except two minor comments about the test coverage and documentation.)
Thanks, @viirya .

dongjoon-hyun · 2020-07-11T22:58:12Z

Merged to master for Apache Spark 3.1.0. Thank you, @viirya and all.
At the last commit, all UTs passed already.

SparkQA · 2020-07-11T23:00:27Z

Test build #125688 has finished for PR 28996 at commit 8734983.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-11T23:04:27Z

Test build #125687 has finished for PR 28996 at commit f0bf462.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-13T04:57:01Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-        throw new AnalysisException(
-          s"""Cannot resolve column name "${lattr.name}" among """ +
-            s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
+        if (allowMissingColumns) {


Does it work with nested columns?

No, currently it doesn't.

I think the major problem here is we put the by-name logic in the API method, not in the Analyzer. Shall we add 2 boolean parameters(byName and allowMissingCol) to Union, and move the by-name logic to the type coercion rules?

Ok. I will do it in another PR.

@cloud-fan . unionByName (and by-name logic) has been here since Apache Spark 2.3.0.
Shall we proceed that refactoring suggestion as a separate JIRA?

Yea it's better to have a new JIRA.

Thanks, @cloud-fan .

…API code to analysis phase ### What changes were proposed in this pull request? Currently the by-name resolution logic of `unionByName` is put in API code. This patch moves the logic to analysis phase. See #28996 (comment). ### Why are the changes needed? Logically we should do resolution in analysis phase. This refactoring cleans up API method and makes consistent resolution. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #29107 from viirya/move-union-by-name. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

gatorsmile · 2020-09-04T06:41:07Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * @group typedrel
+   * @since 3.1.0
+   */
+  def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] = withSetOperator {


Do we have a JIRA to add the corresponding API for Python?

This is a good beginner task for new contributors.

I should create a followup PR for Python and R. But it is okay for a beginner task too.

I filed at SPARK-32798 and SPARK-32799

Make unionByName optionally fill missing columns with nulls.

6afb8e8

probot-autolabeler bot added the SQL label Jul 4, 2020

maropu reviewed Jul 5, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala Outdated Show resolved Hide resolved

maropu reviewed Jul 5, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jul 6, 2020

View reviewed changes

Add test.

5e4f670

cloud-fan reviewed Jul 6, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

Add overloading method.

df4e8dc

cloud-fan reviewed Jul 10, 2020

View reviewed changes

cloud-fan approved these changes Jul 10, 2020

View reviewed changes

HyukjinKwon approved these changes Jul 10, 2020

View reviewed changes

Remove out-of-dated doc.

e2311fa

dongjoon-hyun reviewed Jul 11, 2020

View reviewed changes

dongjoon-hyun approved these changes Jul 11, 2020

View reviewed changes

viirya force-pushed the SPARK-29358 branch 2 times, most recently from 23d52e6 to f0bf462 Compare July 11, 2020 17:35

Add more doc and test.

717e026

viirya force-pushed the SPARK-29358 branch from f0bf462 to 717e026 Compare July 11, 2020 17:38

Add more explaination.

8734983

viirya force-pushed the SPARK-29358 branch from b87ff80 to 8734983 Compare July 11, 2020 17:43

dongjoon-hyun closed this in 98504e9 Jul 11, 2020

cloud-fan reviewed Jul 13, 2020

View reviewed changes

viirya mentioned this pull request Jul 14, 2020

[SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase #29107

Closed

gatorsmile reviewed Sep 4, 2020

View reviewed changes

viirya deleted the SPARK-29358 branch December 27, 2023 18:28

[SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls #28996

[SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls #28996

Conversation

viirya commented Jul 4, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

viirya commented Jul 4, 2020

maropu commented Jul 5, 2020

viirya commented Jul 5, 2020

SparkQA commented Jul 5, 2020

maropu commented Jul 5, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

viirya commented Jul 6, 2020 • edited

maropu commented Jul 6, 2020

SparkQA commented Jul 6, 2020

maropu commented Jul 6, 2020

SparkQA commented Jul 6, 2020

viirya commented Jul 6, 2020

SparkQA commented Jul 6, 2020

viirya commented Jul 6, 2020

SparkQA commented Jul 6, 2020

viirya commented Jul 6, 2020

shaneknapp commented Jul 6, 2020

SparkQA commented Jul 7, 2020

HyukjinKwon commented Jul 7, 2020

SparkQA commented Jul 7, 2020

viirya commented Jul 7, 2020

SparkQA commented Jul 7, 2020

SparkQA commented Jul 10, 2020

viirya commented Jul 10, 2020

SparkQA commented Jul 10, 2020

viirya commented Jul 10, 2020

SparkQA commented Jul 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jul 10, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment

viirya commented Jul 10, 2020

SparkQA commented Jul 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 11, 2020

SparkQA commented Jul 11, 2020

SparkQA commented Jul 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jul 13, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Jul 13, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jul 4, 2020 •

edited

viirya commented Jul 6, 2020 •

edited

cloud-fan Jul 13, 2020 •

edited

dongjoon-hyun Jul 13, 2020 •

edited