-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns #29587
Conversation
Test build #128065 has finished for PR 29587 at commit
|
GitHub Actions were passed. I think it should be ready for review. But I think I will still add some comments to code later. |
Test build #128069 has finished for PR 29587 at commit
|
} | ||
|
||
private def addFieldsInto(col: Expression, base: String, fields: Seq[StructField]): Expression = { | ||
var currCol = col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remove var
here, could we use fields.foldLeft(col) { case (currCol, f) =>
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. rewritten. thanks.
* `col` expression. | ||
*/ | ||
private def addFields(col: NamedExpression, target: StructType): Option[Expression] = { | ||
require(col.dataType.isInstanceOf[StructType], "Only support StructType.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert
instead?
* Returns a `StructType` that contains missing fields recursively from `source` to `target`. | ||
* Note that this doesn't support looking into array type and map type recursively. | ||
*/ | ||
def findMissingFields(source: StructType, target: StructType, resolver: Resolver): StructType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define this method in the StructType
side instead of a private method in ResolveUnion
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is more general method related to StructType
. So putting it in StructType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, one nit: a return type Option[StructType]
for findXXX
methods is more natural just like scala collection (e.g., Seq.find
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it sounds good.
|
||
/** | ||
* Returns a `StructType` that contains missing fields recursively from `source` to `target`. | ||
* Note that this doesn't support looking into array type and map type recursively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this limitation come?; we don't need to support this case, or supporting it is technically difficult? Ah, I see. Is this an unsupported case, right?
https://github.com/apache/spark/pull/29587/files#diff-4d656d696512d6bcb03a48f7e0af6251R106-R107
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leverage WithFields
to add missing nested fields into structs. WithFields
doesn't support array or map types currently.
}.getOrElse(found.get) // Data type doesn't change. We should add fields at other side. | ||
case _ => | ||
// Same struct type, or | ||
// unsupported: different types, array or map types, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array and map types aren't supported by WithFields
. I think it is still possible to add them to WithFields
. Once WithFields
supports these types, we can add them here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok.
@@ -103,4 +104,30 @@ class StructTypeSuite extends SparkFunSuite { | |||
val interval = "`a` INTERVAL" | |||
assert(fromDDL(interval).toDDL === interval) | |||
} | |||
|
|||
test("find missing (nested) fields") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you test the behaviours of unsupported cases (array and map), too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sure.
@@ -546,7 +547,8 @@ case class StringToMap(text: Expression, pairDelim: Expression, keyValueDelim: E | |||
case class WithFields( | |||
structExpr: Expression, | |||
names: Seq[String], | |||
valExprs: Seq[Expression]) extends Unevaluable { | |||
valExprs: Seq[Expression], | |||
sortColumns: Boolean = false) extends Unevaluable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about defining toString
for not displaying this value in explain output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: sortColumns
-> sortOutputColumns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed sortColumns
to sortOutputColumns
. I'm not sure we want to hide sortColumns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this param is not related to the withField
operation, so I left the comment. But, either is okay (Just a suggestion).
scala> val df = sql("SELECT named_struct('a', 1, 'b', 2) struct_col")
scala> df.select($"struct_col".withField("c", lit(3))).explain(true)
== Analyzed Logical Plan ==
with_fields(struct_col, 3): struct<a:int,b:int,c:int>
Project [with_fields(struct_col#0, c, 3, false) AS with_fields(struct_col, 3)#4]
^^^^^
+- Project [named_struct(a, 1, b, 2) AS struct_col#0]
+- OneRowRelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain if we want to show it or not. Let's keep it as is and see what others think.
Test build #128125 has finished for PR 29587 at commit
|
Test build #128185 has finished for PR 29587 at commit
|
retest this please |
Test build #128211 has finished for PR 29587 at commit
|
} | ||
|
||
/** | ||
* Adds missing fields recursively into given `col` expression. The missing fields are given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary spaces found at the beginning of the sentences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
* Returns a `StructType` that contains missing fields recursively from `source` to `target`. | ||
* Note that this doesn't support looking into array type and map type recursively. | ||
*/ | ||
def findMissingFields(source: StructType, target: StructType, resolver: Resolver): StructType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, one nit: a return type Option[StructType]
for findXXX
methods is more natural just like scala collection (e.g., Seq.find
)?
Project(leftOutputAttrs ++ missingAttrs, left) | ||
// Add missing (nested) fields to left plan. | ||
val (leftProjectList, _) = compareAndAddFields(rightChild, left, allowMissingCol) | ||
if (leftProjectList.map(_.toAttribute) != left.output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (leftProjectList.length != left.output.length ||
leftProjectList.map(_.toAttribute) != left.output) {
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't leftProjectList.map(_.toAttribute) != left.output
already cover leftProjectList.length != left.output.length
?
Test build #128233 has finished for PR 29587 at commit
|
retest this please |
Test build #128262 has finished for PR 29587 at commit
|
@maropu Any more comments? cc @cloud-fan @HyukjinKwon |
// Builds a project list for `right` based on `left` output names | ||
val (rightProjectList, aliased) = compareAndAddFields(left, right, allowMissingCol) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this unnecessary blank line.
val foundDt = found.get.dataType | ||
(foundDt, lattr.dataType) match { | ||
case (source: StructType, target: StructType) | ||
if allowMissingCol && !source.sameType(target) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the logic simpler, could we filter out all the unsupported case (e.g., nested struct in array
) here? This is it like this;
case (source: StructType, target: StructType)
if allowMissingCol && canMergeSchemas(source, target) =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure where we can simplify the logic? By adding canMergeSchemas
, doesn't it look more complicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the comment and I thought first that all the unsupported cases are handled in the line 108-112. But, it also means unsupported cases if addFields
returning None
? This might be a issue that can be fixed just by improving comments though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will add more comments explaining this.
val missing1 = StructType.fromDDL( | ||
"c2 STRUCT<c3: INT, c4: STRUCT<c5: INT, c6: INT>>") | ||
assert(StructType.findMissingFields(source1, schema, resolver) | ||
.map(_.sameType(missing1)).getOrElse(false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we use .exists()
instead?
test("find missing (nested) fields") { | ||
val schema = StructType.fromDDL( | ||
"c1 INT, c2 STRUCT<c3: INT, c4: STRUCT<c5: INT, c6: INT>>") | ||
val resolver = SQLConf.get.resolver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some tests for case-sensitivity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, added.
@@ -536,4 +536,71 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession { | |||
assert(union2.schema.fieldNames === Array("a", "B", "C", "c")) | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some tests for case-sensitivity here, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@@ -536,4 +536,71 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession { | |||
assert(union2.schema.fieldNames === Array("a", "B", "C", "c")) | |||
} | |||
} | |||
|
|||
test("SPARK-32376: Make unionByName null-filling behavior work with struct columns - 1") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make the test title clearer? e.g.,
- 1
-> simple tests
- 2
-> nested cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
|
||
/** | ||
* Resolves different children of Union to a common set of columns. | ||
*/ | ||
object ResolveUnion extends Rule[LogicalPlan] { | ||
private def unionTwoSides( | ||
/** | ||
* This method sorts recursively columns in a struct expression based on column names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorts recursively columns
-> sorts columns recursively
|
||
def simplifyWithFields(expr: Expression): Expression = { | ||
expr.transformUp { | ||
case UpdateFields(UpdateFields(struct, fieldOps1), fieldOps2) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to have this optimization inside this analyzer rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. Without optimizing the expressions, we cannot scale up well for deeply nested schema, e.g. the added test SPARK-32376: Make unionByName null-filling behavior work with struct columns - deep expr
. in DataFrameSetOperationsSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I plan to move this optimization out of ResolveUnion
into a separate rule in analyzer in #29812. For complex deeply nested schema, it is easier to write inefficient expression tree that is very slow in analysis phase. For the test case in this PR, it is unable to evaluate the query at all, but after adding this optimization, it can normally evaluate.
// We need to sort columns in result, because we might add another column in other side. | ||
// E.g., we want to union two structs "a int, b long" and "a int, c string". | ||
// If we don't sort, we will have "a int, b long, c string" and | ||
// "a int, c string, b long", which are not compatible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this behavior consistent with top-level columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is related to the comment: #29587 (comment)
assert(col.dataType.isInstanceOf[StructType], "Only support StructType.") | ||
|
||
val resolver = SQLConf.get.resolver | ||
val missingFields = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is a bit misleading and I though it's a Seq
. How about missingFieldsOpt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. Fixed.
addFieldsInto(ExtractValue(currCol, Literal(field.name), resolver), st.fields)) | ||
} | ||
case dt => | ||
UpdateFields(currCol, field.name, Literal(null, dt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if byName
is true but allowMissingCol
is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If allowMissingCol
is false, we don't compare and add top-level/nested columns. If two sides have inconsistent schema, the union doesn't pass analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The top-level columns support byName
and allowMissingCol
individually, shall we do it for nested columns as well? Or we plan to do it in followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. byName
support actually means we need to adjust columns between two sides to have a consistent schema. It could be top-level or nested column cases.
So it is actually the same issue as #29587 (comment), a.k.a adjusting the nested columns to have a more natural schema. As replied in the discussion, I plan to do it in followup.
@@ -2740,6 +2740,19 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val UNION_BYNAME_STRUCT_SUPPORT_ENABLED = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this worths a config. It's natural that the byName
and allowMissingCol
flag should apply to nested column, and these 2 flags are newly added in the master branch so there is no backward compatibility issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds more correct. I will remove this config.
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129770 has finished for PR 29587 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala
Outdated
Show resolved
Hide resolved
I will wait for a while for further comments. If no more comments from others, I will merge this and work on the followup. Thanks! |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #129903 has finished for PR 29587 at commit
|
Thanks for reviewing. Merged to master. |
Test build #129909 has finished for PR 29587 at commit
|
Test build #129914 has finished for PR 29587 at commit
|
What changes were proposed in this pull request?
SPARK-29358 added support for
unionByName
to work when the two datasets didn't necessarily have the same schema, but it does not work with nested columns like structs. This patch adds the support to work with struct columns.The behavior before this PR:
The behavior after this PR:
Why are the changes needed?
The
allowMissingColumns
ofunionByName
is a feature allowing merging different schema from two datasets when unioning them together. Nested column support makes the feature more general and flexible for usage.Does this PR introduce any user-facing change?
Yes, after this change users can union two datasets with different schema with different structs.
How was this patch tested?
Unit tests.