New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls #28996
Conversation
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
Outdated
Show resolved
Hide resolved
Is this a syntax-sugar of |
|
Test build #124924 has started for PR 28996 at commit |
Could you update the API doc, too? If the option enabled, the following statement doesn't hold?
|
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I didn't have a strong feeling about it given that workarounds are possible rather easily. I will leave it to @marmbrus given the discussion at the JIRA.
Read with the previous sentence, I think the doc means that this API doesn't deduplicate elements. The doc explains that this API resolves columns by name, not by position like |
My bad and my last comment is ambiguous; I would mean that, how about adding some comments for this new behaviour in the API doc so that users can notice it. |
Test build #125022 has finished for PR 28996 at commit
|
retest this please |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
Test build #125038 has finished for PR 28996 at commit
|
retest this please |
Test build #125098 has finished for PR 28996 at commit
|
retest this please... |
Test build #125111 has started for PR 28996 at commit |
retest this please... |
test this please |
Test build #125141 has finished for PR 28996 at commit
|
retest this please |
Test build #125208 has finished for PR 28996 at commit
|
retest this please |
Test build #125232 has finished for PR 28996 at commit
|
Test build #125526 has finished for PR 28996 at commit
|
retest this please |
Test build #125563 has finished for PR 28996 at commit
|
retest this please |
Test build #125572 has started for PR 28996 at commit |
* Returns a new Dataset containing union of rows in this Dataset and another Dataset. | ||
* | ||
* This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set | ||
* union (that does deduplication of elements), use this function followed by a [[distinct]]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not true now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in original unionByName
, its doc has this section too:
This is different from both
UNION ALL
andUNION DISTINCT
in SQL. To do a SQL-style set
union (that does deduplication of elements), use this function followed by a [[distinct]].
Re-read this doc, even with original unionByName
behavior, it is a bit confusing to me. Do you think we should remove "To do a SQL-style set union (that does deduplication of elements), use this function followed by a [[distinct]]."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait really? When did we change the semantics? What was confusing about that documentation? (it was added because users were confused by the behavior...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read To do a SQL-style set union
, it sounds like if you add distinct
, you will get a SQL-style union. But it behaves different to SQL union at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we mistakenly copied the doc from union
to unionByName
.
shall we add the same API to PySpark and R? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself looks good to me. We should add R and Python too but could be done separately
I'll add Python and R in a follow-up. |
Test build #125625 has finished for PR 28996 at commit
|
* When the parameter `allowMissingColumns` is true, this function allows different set | ||
* of column names between two Datasets. Missing columns at each side, will be filled with | ||
* null values. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an illustrate example like 2016 ~ 2029, @viirya ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay.
* resolves columns by name (not by position). | ||
* | ||
* When the parameter `allowMissingColumns` is true, this function allows different set | ||
* of column names between two Datasets. Missing columns at each side, will be filled with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth to document a little more about the order sensitive. Previously, it was simple because it follows the schema of original set(=left). With new options, the number of missing columns which will be added at the end are determined by other
(=right).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good advice.
checkAnswer(df1.unionByName(df2, true), | ||
Row(1, 2, null) :: Row(3, 5, 4) :: Nil) | ||
checkAnswer(df2.unionByName(df1, true), | ||
Row(3, 4, 5) :: Row(1, null, 2) :: Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya . Can we have both case-sensitive and case-insensitive test coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (except two minor comments about the test coverage and documentation.)
Thanks, @viirya .
23d52e6
to
f0bf462
Compare
Merged to master for Apache Spark 3.1.0. Thank you, @viirya and all. |
Test build #125688 has finished for PR 28996 at commit
|
Test build #125687 has finished for PR 28996 at commit
|
throw new AnalysisException( | ||
s"""Cannot resolve column name "${lattr.name}" among """ + | ||
s"""(${rightOutputAttrs.map(_.name).mkString(", ")})""") | ||
if (allowMissingColumns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work with nested columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, currently it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the major problem here is we put the by-name logic in the API method, not in the Analyzer
. Shall we add 2 boolean parameters(byName and allowMissingCol) to Union
, and move the by-name logic to the type coercion rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will do it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan . unionByName
(and by-name logic
) has been here since Apache Spark 2.3.0.
Shall we proceed that refactoring suggestion as a separate JIRA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it's better to have a new JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @cloud-fan .
…API code to analysis phase ### What changes were proposed in this pull request? Currently the by-name resolution logic of `unionByName` is put in API code. This patch moves the logic to analysis phase. See #28996 (comment). ### Why are the changes needed? Logically we should do resolution in analysis phase. This refactoring cleans up API method and makes consistent resolution. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #29107 from viirya/move-union-by-name. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
* @group typedrel | ||
* @since 3.1.0 | ||
*/ | ||
def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] = withSetOperator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a JIRA to add the corresponding API for Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good beginner task for new contributors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should create a followup PR for Python and R. But it is okay for a beginner task too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed at SPARK-32798 and SPARK-32799
What changes were proposed in this pull request?
This patch proposes to make
unionByName
optionally fill missing columns with nulls.Why are the changes needed?
Currently,
unionByName
throws exception if detecting different column names between two Datasets. It is strict requirement and sometimes users require more flexible usage that two Datasets with different subset of columns can be union by name resolution.Does this PR introduce any user-facing change?
Yes. Adding overloading
Dataset.unionByName
with a boolean parameter that allows different set of column names between two Datasets. Missing columns at each side, will be filled with null values.How was this patch tested?
Unit test.