[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns by imback82 · Pull Request #28266 · apache/spark

imback82 · 2020-04-19T21:27:18Z

What changes were proposed in this pull request?

#26700 removed the ability to drop a row whose nested column value is null.

For example, for the following df:

val schema = new StructType()
  .add("c1", new StructType()
    .add("c1-1", StringType)
    .add("c1-2", StringType))
val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+

In Spark 2.4.4,

df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+

In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.

df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|  [, a2]|
|[b1, b2]|
|    null|
+--------+

Why are the changes needed?

This seems like a regression.

Does this PR introduce any user-facing change?

Now, the nested column can be specified:

df.na.drop("any", Seq("c1.c1-1")).show
+--------+
|      c1|
+--------+
|[b1, b2]|
+--------+

Also, if * is specified as a column, it will throw an AnalysisException that * cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.

How was this patch tested?

Updated existing tests.

imback82 · 2020-04-19T21:30:13Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

+    val exception = intercept[AnalysisException] {
+      df.na.drop("any", Seq("*"))
+    }
+    assert(exception.getMessage.contains("Cannot resolve column name \"*\""))


Note that this was the behavior in Spark 2.4.4. We can handle this more gracefully (e.g., use outputAttributes) if we need to.

On a side note, for fill, * is ignored in Spark 2.4.4.

imback82 · 2020-04-19T21:31:04Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

-    val df = spark.createDataFrame(
-      spark.sparkContext.parallelize(data), schema)
+    // Nested columns are ignored for fill().
+    checkAnswer(df.na.fill("a1", Seq("c1.c1-1")), df)


Note that nested columns are ignored for fill in Spark 2.4.4.

imback82 · 2020-04-19T21:31:50Z

@cloud-fan Please let me know if this PR (going back to 2.4.4 behavior) makes sense. Thanks!

TJX2014 · 2020-04-20T01:08:21Z

nice,duplicate columns is same as struct alias which not works at toAttributes method in DataFrameNaFunctions.

SparkQA · 2020-04-20T02:02:58Z

Test build #121490 has finished for PR 28266 at commit 283fee1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

good catch!

…olumns ### What changes were proposed in this pull request? #26700 removed the ability to drop a row whose nested column value is null. For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` ### Why are the changes needed? This seems like a regression. ### Does this PR introduce any user-facing change? Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. ### How was this patch tested? Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d7499ae) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

maropu · 2020-04-20T02:59:45Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala


-    checkAnswer(df.select("c1.c1-1"),
-      Row(null) :: Row("b1") :: Row(null) :: Nil)
+  test("drop with nested columns") {


nit: This looks like a bug, so could you add the prefix: SPARK-31256.

nvm, a bit late...

…olumns For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` This seems like a regression. Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d7499ae) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2020-04-20T03:08:16Z

merging to master/3.0/2.4

dongjoon-hyun · 2020-04-20T05:15:23Z

So, SPARK-31256 made a regression at 2.4.5 and this recovers it?

cloud-fan · 2020-04-20T05:37:34Z

@dongjoon-hyun yes

dongjoon-hyun · 2020-04-20T06:11:26Z

Thank you for confirmation~

initial checkin

283fee1

probot-autolabeler bot added the SQL label Apr 19, 2020

imback82 commented Apr 19, 2020

View reviewed changes

imback82 changed the title ~~[SPARK-31256][SQL] Dropna should work for nested columns~~ [SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns Apr 19, 2020

cloud-fan approved these changes Apr 20, 2020

View reviewed changes

cloud-fan closed this in d7499ae Apr 20, 2020

maropu reviewed Apr 20, 2020

View reviewed changes

maropu approved these changes Apr 20, 2020

View reviewed changes

Conversation

imback82 commented Apr 19, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

imback82 Apr 19, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 Apr 19, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 commented Apr 19, 2020

Uh oh!

TJX2014 commented Apr 20, 2020

Uh oh!

SparkQA commented Apr 20, 2020

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

maropu Apr 20, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Apr 20, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 20, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 20, 2020

Uh oh!

dongjoon-hyun commented Apr 20, 2020

Uh oh!

cloud-fan commented Apr 20, 2020

Uh oh!

dongjoon-hyun commented Apr 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants