[SPARK-30065][SQL] DataFrameNaFunctions.drop should handle duplicate columns by imback82 · Pull Request #26700 · apache/spark

imback82 · 2019-11-28T06:07:47Z

What changes were proposed in this pull request?

DataFrameNaFunctions.drop doesn't handle duplicate columns even when column names are not specified.

val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.drop("any").show

produces

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col2: string (nullable = true)

org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
  at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)

The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity.

This PR updates DataFrameNaFunctions.drop such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying drop to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).

Why are the changes needed?

If column names are not specified, drop should not fail due to ambiguity since it should still be able to apply drop to the eligible columns.

Does this PR introduce any user-facing change?

Yes, now all the rows with nulls are dropped in the above example:

scala> df.na.drop("any").show
+----+----+----+
|col1|col2|col2|
+----+----+----+
+----+----+----+

How was this patch tested?

Added new unit tests.

imback82 · 2019-11-28T06:08:15Z

cc: @cloud-fan

SparkQA · 2019-11-28T08:05:02Z

Test build #114563 has finished for PR 26700 at commit 0c2dc77.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2019-11-28T08:28:18Z

retest this please

SparkQA · 2019-11-28T10:49:03Z

Test build #114567 has finished for PR 26700 at commit 0c2dc77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-11-28T14:55:08Z

Retest this please

SparkQA · 2019-11-28T16:01:39Z

Test build #114581 has finished for PR 26700 at commit 0c2dc77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2019-11-28T19:09:12Z

retest this please

SparkQA · 2019-11-28T20:45:06Z

Test build #114586 has finished for PR 26700 at commit 0c2dc77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2019-11-28T22:21:44Z

Retest this please

SparkQA · 2019-11-29T01:39:12Z

Test build #114588 has finished for PR 26700 at commit 0c2dc77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-01T06:43:51Z

Test build #114675 has finished for PR 26700 at commit bbc8852.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait OperationHelper
class HiveThriftServer2AppStatusStore(
class HiveThriftServer2HistoryServerPlugin extends AppHistoryServerPlugin

cloud-fan · 2019-12-02T04:25:39Z

thanks, merging to master!

…columns ### What changes were proposed in this pull request? `DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) ``` The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now all the rows with nulls are dropped in the above example: ``` scala> df.na.drop("any").show +----+----+----+ |col1|col2|col2| +----+----+----+ +----+----+----+ ``` ### How was this patch tested? Added new unit tests. Closes apache#26700 from imback82/na_drop. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…cate columns (Backport of #26700) ### What changes were proposed in this pull request? `DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) ``` The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now all the rows with nulls are dropped in the above example: ``` scala> df.na.drop("any").show +----+----+----+ |col1|col2|col2| +----+----+----+ +----+----+----+ ``` ### How was this patch tested? Added new unit tests. Closes #27411 from imback82/backport-SPARK-30065. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…olumns ### What changes were proposed in this pull request? #26700 removed the ability to drop a row whose nested column value is null. For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` ### Why are the changes needed? This seems like a regression. ### Does this PR introduce any user-facing change? Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. ### How was this patch tested? Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…olumns ### What changes were proposed in this pull request? #26700 removed the ability to drop a row whose nested column value is null. For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` ### Why are the changes needed? This seems like a regression. ### Does this PR introduce any user-facing change? Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. ### How was this patch tested? Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d7499ae) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

initial checkin

0c2dc77

Merge branch 'master' into na_drop

bbc8852

cloud-fan closed this in 5a1896a Dec 2, 2019

This was referenced Jan 30, 2020

[SPARK-29890][SQL][2.4] DataFrameNaFunctions.fill should handle duplicate columns #27407

Closed

[SPARK-30065][SQL][2.4] DataFrameNaFunctions.drop should handle duplicate columns #27411

Closed

dongjoon-hyun added the SQL label Feb 5, 2020

imback82 mentioned this pull request Apr 19, 2020

[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns #28266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30065][SQL] DataFrameNaFunctions.drop should handle duplicate columns#26700

[SPARK-30065][SQL] DataFrameNaFunctions.drop should handle duplicate columns#26700
imback82 wants to merge 2 commits intoapache:masterfrom
imback82:na_drop

imback82 commented Nov 28, 2019

Uh oh!

imback82 commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 28, 2019

Uh oh!

imback82 commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 28, 2019

Uh oh!

dongjoon-hyun commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 28, 2019

Uh oh!

imback82 commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 28, 2019

Uh oh!

imback82 commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 29, 2019

Uh oh!

SparkQA commented Dec 1, 2019

Uh oh!

cloud-fan commented Dec 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

imback82 commented Nov 28, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

imback82 commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 28, 2019

Uh oh!

imback82 commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 28, 2019

Uh oh!

dongjoon-hyun commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 28, 2019

Uh oh!

imback82 commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 28, 2019

Uh oh!

imback82 commented Nov 28, 2019

Uh oh!

SparkQA commented Nov 29, 2019

Uh oh!

SparkQA commented Dec 1, 2019

Uh oh!

cloud-fan commented Dec 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants