[SPARK-29890][SQL][2.4] DataFrameNaFunctions.fill should handle duplicate columns by imback82 · Pull Request #27407 · apache/spark

imback82 · 2020-01-30T21:32:56Z

(Backport of #26593)

What changes were proposed in this pull request?

DataFrameNaFunctions.fill doesn't handle duplicate columns even when column names are not specified.

val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.fill("hello").show

produces

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col2: string (nullable = true)

org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
  at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1268)

The reason for the above failure is that columns are looked up with DataSet.col() which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity.

This PR updates DataFrameNaFunctions.fill such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying fill to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).

Why are the changes needed?

If column names are not specified, fill should not fail due to ambiguity since it should still be able to apply fill to the eligible columns.

Does this PR introduce any user-facing change?

Yes, now the above example displays the following:

+----+-----+-----+
|col1| col2| col2|
+----+-----+-----+
|   1|hello|    2|
|   3|    4|hello|
+----+-----+-----+

How was this patch tested?

Added new unit tests.

imback82 · 2020-01-30T21:39:06Z

cc @dongjoon-hyun

dongjoon-hyun · 2020-01-30T21:58:50Z

Thank you so much, @imback82 .
cc @cloud-fan and @PavithraRamachandran

dongjoon-hyun · 2020-01-30T21:59:13Z

Also, cc @gatorsmile

imback82 · 2020-01-30T22:12:01Z

Do you think we need to back port #26700, which fixes a very similar issue?

dongjoon-hyun · 2020-01-30T22:16:30Z

I switched the type of SPARK-30065 to Bug. Yes, please proceed for that, too. Thanks!

dongjoon-hyun

+1, LGTM. (Pending Jenkins.)
I tested the changed part and the new test cases locally.
Thank you, @imback82 !

SparkQA · 2020-01-31T00:51:37Z

Test build #117587 has finished for PR 27407 at commit 776a294.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-31T00:51:56Z

Merged to branch-2.4.

…cate columns (Backport of #26593) ### What changes were proposed in this pull request? `DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.fill("hello").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1268) ``` The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now the above example displays the following: ``` +----+-----+-----+ |col1| col2| col2| +----+-----+-----+ | 1|hello| 2| | 3| 4|hello| +----+-----+-----+ ``` ### How was this patch tested? Added new unit tests. Closes #27407 from imback82/backport-SPARK-29890. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

initial commit

776a294

dongjoon-hyun added the SQL label Jan 30, 2020

dongjoon-hyun approved these changes Jan 30, 2020

View reviewed changes

dongjoon-hyun closed this Jan 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29890][SQL][2.4] DataFrameNaFunctions.fill should handle duplicate columns#27407

[SPARK-29890][SQL][2.4] DataFrameNaFunctions.fill should handle duplicate columns#27407
imback82 wants to merge 1 commit intoapache:branch-2.4from
imback82:backport-SPARK-29890

imback82 commented Jan 30, 2020

Uh oh!

imback82 commented Jan 30, 2020

Uh oh!

dongjoon-hyun commented Jan 30, 2020

Uh oh!

dongjoon-hyun commented Jan 30, 2020

Uh oh!

imback82 commented Jan 30, 2020

Uh oh!

dongjoon-hyun commented Jan 30, 2020 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

dongjoon-hyun commented Jan 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

imback82 commented Jan 30, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

imback82 commented Jan 30, 2020

Uh oh!

dongjoon-hyun commented Jan 30, 2020

Uh oh!

dongjoon-hyun commented Jan 30, 2020

Uh oh!

imback82 commented Jan 30, 2020

Uh oh!

dongjoon-hyun commented Jan 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 31, 2020

Uh oh!

dongjoon-hyun commented Jan 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun commented Jan 30, 2020 •

edited

Loading