Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls #28996

Closed
wants to merge 6 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jul 4, 2020

What changes were proposed in this pull request?

This patch proposes to make unionByName optionally fill missing columns with nulls.

Why are the changes needed?

Currently, unionByName throws exception if detecting different column names between two Datasets. It is strict requirement and sometimes users require more flexible usage that two Datasets with different subset of columns can be union by name resolution.

Does this PR introduce any user-facing change?

Yes. Adding overloading Dataset.unionByName with a boolean parameter that allows different set of column names between two Datasets. Missing columns at each side, will be filled with null values.

How was this patch tested?

Unit test.

@viirya
Copy link
Member Author

viirya commented Jul 4, 2020

@maropu
Copy link
Member

maropu commented Jul 5, 2020

Is this a syntax-sugar of df1.unionByName(df2.withColumn("c", lit(null)))? The fused operation does not look like a SQL union, so how about adding a new API if this is useful for users?

@viirya
Copy link
Member Author

viirya commented Jul 5, 2020

unionByName is not SQL-style union, as the API doc said.

@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124924 has started for PR 28996 at commit 6afb8e8.

@maropu
Copy link
Member

maropu commented Jul 5, 2020

Could you update the API doc, too? If the option enabled, the following statement doesn't hold?

To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]].

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I didn't have a strong feeling about it given that workarounds are possible rather easily. I will leave it to @marmbrus given the discussion at the JIRA.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2020

To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]].

Read with the previous sentence, I think the doc means that this API doesn't deduplicate elements. The doc explains that this API resolves columns by name, not by position like union. This config doesn't change the behavior.

@maropu
Copy link
Member

maropu commented Jul 6, 2020

My bad and my last comment is ambiguous; I would mean that, how about adding some comments for this new behaviour in the API doc so that users can notice it.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125022 has finished for PR 28996 at commit 5e4f670.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 6, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125038 has finished for PR 28996 at commit 5e4f670.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125098 has finished for PR 28996 at commit 5e4f670.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2020

retest this please...

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125111 has started for PR 28996 at commit 5e4f670.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2020

retest this please...

@shaneknapp
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125141 has finished for PR 28996 at commit 5e4f670.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125208 has finished for PR 28996 at commit 5e4f670.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 7, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 7, 2020

Test build #125232 has finished for PR 28996 at commit 5e4f670.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125526 has finished for PR 28996 at commit df4e8dc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 10, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125563 has finished for PR 28996 at commit df4e8dc.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 10, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125572 has started for PR 28996 at commit df4e8dc.

* Returns a new Dataset containing union of rows in this Dataset and another Dataset.
*
* This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set
* union (that does deduplication of elements), use this function followed by a [[distinct]].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in original unionByName, its doc has this section too:

This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set
union (that does deduplication of elements), use this function followed by a [[distinct]].

Re-read this doc, even with original unionByName behavior, it is a bit confusing to me. Do you think we should remove "To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]]."?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait really? When did we change the semantics? What was confusing about that documentation? (it was added because users were confused by the behavior...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read To do a SQL-style set union, it sounds like if you add distinct, you will get a SQL-style union. But it behaves different to SQL union at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we mistakenly copied the doc from union to unionByName.

@cloud-fan
Copy link
Contributor

shall we add the same API to PySpark and R?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself looks good to me. We should add R and Python too but could be done separately

@viirya
Copy link
Member Author

viirya commented Jul 10, 2020

I'll add Python and R in a follow-up.

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125625 has finished for PR 28996 at commit e2311fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* When the parameter `allowMissingColumns` is true, this function allows different set
* of column names between two Datasets. Missing columns at each side, will be filled with
* null values.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an illustrate example like 2016 ~ 2029, @viirya ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay.

* resolves columns by name (not by position).
*
* When the parameter `allowMissingColumns` is true, this function allows different set
* of column names between two Datasets. Missing columns at each side, will be filled with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth to document a little more about the order sensitive. Previously, it was simple because it follows the schema of original set(=left). With new options, the number of missing columns which will be added at the end are determined by other (=right).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good advice.

checkAnswer(df1.unionByName(df2, true),
Row(1, 2, null) :: Row(3, 5, 4) :: Nil)
checkAnswer(df2.unionByName(df1, true),
Row(3, 4, 5) :: Row(1, null, 2) :: Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya . Can we have both case-sensitive and case-insensitive test coverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (except two minor comments about the test coverage and documentation.)
Thanks, @viirya .

@viirya viirya force-pushed the SPARK-29358 branch 2 times, most recently from 23d52e6 to f0bf462 Compare July 11, 2020 17:35
@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 3.1.0. Thank you, @viirya and all.
At the last commit, all UTs passed already.

@SparkQA
Copy link

SparkQA commented Jul 11, 2020

Test build #125688 has finished for PR 28996 at commit 8734983.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 11, 2020

Test build #125687 has finished for PR 28996 at commit f0bf462.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

throw new AnalysisException(
s"""Cannot resolve column name "${lattr.name}" among """ +
s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""")
if (allowMissingColumns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work with nested columns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, currently it doesn't.

Copy link
Contributor

@cloud-fan cloud-fan Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the major problem here is we put the by-name logic in the API method, not in the Analyzer. Shall we add 2 boolean parameters(byName and allowMissingCol) to Union, and move the by-name logic to the type coercion rules?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will do it in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan . unionByName (and by-name logic) has been here since Apache Spark 2.3.0.
Shall we proceed that refactoring suggestion as a separate JIRA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it's better to have a new JIRA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @cloud-fan .

cloud-fan pushed a commit that referenced this pull request Jul 24, 2020
…API code to analysis phase

### What changes were proposed in this pull request?

Currently the by-name resolution logic of `unionByName` is put in API code. This patch moves the logic to analysis phase.
See #28996 (comment).

### Why are the changes needed?

Logically we should do resolution in analysis phase. This refactoring cleans up API method and makes consistent resolution.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #29107 from viirya/move-union-by-name.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
* @group typedrel
* @since 3.1.0
*/
def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] = withSetOperator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a JIRA to add the corresponding API for Python?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good beginner task for new contributors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should create a followup PR for Python and R. But it is okay for a beginner task too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed at SPARK-32798 and SPARK-32799

@viirya viirya deleted the SPARK-29358 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
10 participants