[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns#28266
[SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns#28266imback82 wants to merge 1 commit intoapache:masterfrom
Conversation
| val exception = intercept[AnalysisException] { | ||
| df.na.drop("any", Seq("*")) | ||
| } | ||
| assert(exception.getMessage.contains("Cannot resolve column name \"*\"")) |
There was a problem hiding this comment.
Note that this was the behavior in Spark 2.4.4. We can handle this more gracefully (e.g., use outputAttributes) if we need to.
On a side note, for fill, * is ignored in Spark 2.4.4.
| val df = spark.createDataFrame( | ||
| spark.sparkContext.parallelize(data), schema) | ||
| // Nested columns are ignored for fill(). | ||
| checkAnswer(df.na.fill("a1", Seq("c1.c1-1")), df) |
There was a problem hiding this comment.
Note that nested columns are ignored for fill in Spark 2.4.4.
|
@cloud-fan Please let me know if this PR (going back to 2.4.4 behavior) makes sense. Thanks! |
|
nice,duplicate columns is same as struct alias which not works at toAttributes method in DataFrameNaFunctions. |
|
Test build #121490 has finished for PR 28266 at commit
|
…olumns ### What changes were proposed in this pull request? #26700 removed the ability to drop a row whose nested column value is null. For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` ### Why are the changes needed? This seems like a regression. ### Does this PR introduce any user-facing change? Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. ### How was this patch tested? Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d7499ae) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
|
||
| checkAnswer(df.select("c1.c1-1"), | ||
| Row(null) :: Row("b1") :: Row(null) :: Nil) | ||
| test("drop with nested columns") { |
There was a problem hiding this comment.
nit: This looks like a bug, so could you add the prefix: SPARK-31256.
…olumns
For example, for the following `df`:
```
val schema = new StructType()
.add("c1", new StructType()
.add("c1-1", StringType)
.add("c1-2", StringType))
val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show
+--------+
| c1|
+--------+
| [, a2]|
|[b1, b2]|
| null|
+--------+
```
In Spark 2.4.4,
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
| c1|
+--------+
|[b1, b2]|
+--------+
```
In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
| c1|
+--------+
| [, a2]|
|[b1, b2]|
| null|
+--------+
```
This seems like a regression.
Now, the nested column can be specified:
```
df.na.drop("any", Seq("c1.c1-1")).show
+--------+
| c1|
+--------+
|[b1, b2]|
+--------+
```
Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.
Updated existing tests.
Closes #28266 from imback82/SPARK-31256.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d7499ae)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
merging to master/3.0/2.4 |
|
So, SPARK-31256 made a regression at 2.4.5 and this recovers it? |
|
@dongjoon-hyun yes |
|
Thank you for confirmation~ |
What changes were proposed in this pull request?
#26700 removed the ability to drop a row whose nested column value is null.
For example, for the following
df:In Spark 2.4.4,
In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored.
Why are the changes needed?
This seems like a regression.
Does this PR introduce any user-facing change?
Now, the nested column can be specified:
Also, if
*is specified as a column, it will throw anAnalysisExceptionthat*cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect.How was this patch tested?
Updated existing tests.