-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved #30033
Conversation
Btw that was hard to debug and required me to deal with Spark test code, as we get nothing from the error message on the case when all columns are matched. (In other words, considered as unresolved due to the type incompatibility on write between same column.) We can't get the information about the reason why the operator is considered as unresolved even we turn on TRACE log.
|
@@ -45,7 +45,7 @@ trait V2WriteCommand extends Command { | |||
case (inAttr, outAttr) => | |||
// names and types must match, nullability must be compatible | |||
inAttr.name == outAttr.name && | |||
DataType.equalsIgnoreCompatibleNullability(outAttr.dataType, inAttr.dataType) && | |||
DataType.equalsIgnoreCompatibleNullability(inAttr.dataType, outAttr.dataType) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @HeartSaVioR . The original code looks like the same with branch-2.4
but the issue is reported at 3.0.0+. Could you confirm that this is 3.0.0-only issue or not?
override lazy val resolved: Boolean = {
table.resolved && query.resolved && query.output.size == table.output.size &&
query.output.zip(table.output).forall {
case (inAttr, outAttr) =>
// names and types must match, nullability must be compatible
inAttr.name == outAttr.name &&
DataType.equalsIgnoreCompatibleNullability(outAttr.dataType, inAttr.dataType) &&
(outAttr.nullable || !inAttr.nullable)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for noticing. Nice finding. I found it from V2WriteCommand so thought it was added later. Will check the code path in branch-2.4 and test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of AppendData is reverted in b6e4aca for branch-2.4 and shipped to Spark 2.4.0. So while the code in AppendData for branch-2.4 is broken as well, it's a dead code.
We seem to have three options: 1) revert remaining part of AppendData in branch-2.4 2) fix the code but leave it as dead 3) leave it as it is. What's our preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer (2) because there is a downstream using it in their fork.
@@ -101,6 +100,86 @@ class DataFrameWriterV2Suite extends QueryTest with SharedSparkSession with Befo | |||
assert(v2.catalog.exists(_ == catalogPlugin)) | |||
} | |||
|
|||
case class FakeV2WriteCommand(table: NamedRelation, query: LogicalPlan) extends V2WriteCommand | |||
|
|||
test("SPARK-33136 output resolved on complex types for V2 write commands") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this bug only affects complex types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, according to the implementation of equalsIgnoreCompatibleNullability
.
private[sql] def equalsIgnoreCompatibleNullability(from: DataType, to: DataType): Boolean = {
(from, to) match {
case (ArrayType(fromElement, fn), ArrayType(toElement, tn)) =>
(tn || !fn) && equalsIgnoreCompatibleNullability(fromElement, toElement)
case (MapType(fromKey, fromValue, fn), MapType(toKey, toValue, tn)) =>
(tn || !fn) &&
equalsIgnoreCompatibleNullability(fromKey, toKey) &&
equalsIgnoreCompatibleNullability(fromValue, toValue)
case (StructType(fromFields), StructType(toFields)) =>
fromFields.length == toFields.length &&
fromFields.zip(toFields).forall { case (fromField, toField) =>
fromField.name == toField.name &&
(toField.nullable || !fromField.nullable) &&
equalsIgnoreCompatibleNullability(fromField.dataType, toField.dataType)
}
case (fromDataType, toDataType) => fromDataType == toDataType
}
}
For primitive types the order doesn't affect the result. outputResolved
itself does the right comparison, just except the swapped parameters.
This is a good point. |
Yeah. It looks to be a bit tricky to generalize as any operator can make a decision on resolvable arbitrary. Probably the operator needs to have some method to provide additional information on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @HeartSaVioR and @cloud-fan .
Merged to master/3.0.
….outputResolved ### What changes were proposed in this pull request? This PR proposes to fix a bug on calling `DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters in `V2WriteCommand.outputResolved`. The order of parameters for `DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says that the right order of matching variables are `inAttr` and `outAttr`. ### Why are the changes needed? Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call `DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been swapped. ### Does this PR introduce _any_ user-facing change? Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable. ### How was this patch tested? New UT added. Closes #30033 from HeartSaVioR/SPARK-33136. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8e5cb1d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Hi, @HeartSaVioR . |
#30043 for branch-2.4 |
Thank you, @HeartSaVioR ! |
Forgot to say, thanks all for reviewing and merging! |
….outputResolved ### What changes were proposed in this pull request? This PR proposes to fix a bug on calling `DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters in `V2WriteCommand.outputResolved`. The order of parameters for `DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says that the right order of matching variables are `inAttr` and `outAttr`. ### Why are the changes needed? Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call `DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been swapped. ### Does this PR introduce _any_ user-facing change? Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable. ### How was this patch tested? New UT added. Closes apache#30033 from HeartSaVioR/SPARK-33136. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8e5cb1d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR proposes to fix a bug on calling
DataType.equalsIgnoreCompatibleNullability
with mistakenly swapped parameters inV2WriteCommand.outputResolved
. The order of parameters forDataType.equalsIgnoreCompatibleNullability
arefrom
andto
, which says that the right order of matching variables areinAttr
andoutAttr
.Why are the changes needed?
Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call
DataType.equalsIgnoreCompatibleNullability
inoutputResolved
have been swapped.Does this PR introduce any user-facing change?
Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable.
How was this patch tested?
New UT added.