[SPARK-29890][SQL][2.4] DataFrameNaFunctions.fill should handle duplicate columns#27407
Closed
imback82 wants to merge 1 commit intoapache:branch-2.4from
Closed
[SPARK-29890][SQL][2.4] DataFrameNaFunctions.fill should handle duplicate columns#27407imback82 wants to merge 1 commit intoapache:branch-2.4from
imback82 wants to merge 1 commit intoapache:branch-2.4from
Conversation
Contributor
Author
Member
|
Thank you so much, @imback82 . |
Member
|
Also, cc @gatorsmile |
Contributor
Author
|
Do you think we need to back port #26700, which fixes a very similar issue? |
Member
|
I switched the type of SPARK-30065 to |
dongjoon-hyun
approved these changes
Jan 30, 2020
Member
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. (Pending Jenkins.)
I tested the changed part and the new test cases locally.
Thank you, @imback82 !
|
Test build #117587 has finished for PR 27407 at commit
|
Member
|
Merged to branch-2.4. |
dongjoon-hyun
pushed a commit
that referenced
this pull request
Jan 31, 2020
…cate columns (Backport of #26593) ### What changes were proposed in this pull request? `DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.fill("hello").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1268) ``` The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now the above example displays the following: ``` +----+-----+-----+ |col1| col2| col2| +----+-----+-----+ | 1|hello| 2| | 3| 4|hello| +----+-----+-----+ ``` ### How was this patch tested? Added new unit tests. Closes #27407 from imback82/backport-SPARK-29890. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(Backport of #26593)
What changes were proposed in this pull request?
DataFrameNaFunctions.filldoesn't handle duplicate columns even when column names are not specified.produces
The reason for the above failure is that columns are looked up with
DataSet.col()which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity.This PR updates
DataFrameNaFunctions.fillsuch that if the columns to fill are not specified, it will resolve ambiguity gracefully by applyingfillto all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).Why are the changes needed?
If column names are not specified,
fillshould not fail due to ambiguity since it should still be able to applyfillto the eligible columns.Does this PR introduce any user-facing change?
Yes, now the above example displays the following:
How was this patch tested?
Added new unit tests.