[SPARK-32511][SQL] Add dropFields method to Column class #29322

fqaiser94 · 2020-07-31T15:56:22Z

What changes were proposed in this pull request?

Added a new dropFields method to the Column class.
This method should allow users to drop a StructField in a StructType column (with similar semantics to the drop method on Dataset).

Why are the changes needed?

Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing StructField. To do this with the existing Spark APIs, users have to rebuild the entire struct column.

For example, let's say you have the following deeply nested data structure which has a data quality issue (5 is missing):

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val data = spark.createDataFrame(sc.parallelize(
      Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))),
      StructType(Seq(
        StructField("a", StructType(Seq(
          StructField("a", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType)))),
          StructField("b", StructType(Seq(
            StructField("a", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("b", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))), 
            StructField("c", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType))))
          ))), 
          StructField("c", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType))))
        )))))).cache

data.show(false)
+---------------------------------+                                             
|a                                |
+---------------------------------+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+---------------------------------+

Currently, to drop the missing value users would have to do something like this:

val result = data.withColumn("a", 
  struct(
    $"a.a", 
    struct(
      struct(
        $"a.b.a.a", 
        $"a.b.a.c"
      ).as("a"), 
      $"a.b.b", 
      $"a.b.c"
    ).as("b"), 
    $"a.c"
  ))

result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+

As you can see above, with the existing methods users must call the struct function and list all fields, including fields they don't want to change. This is not ideal as:

this leads to complex, fragile code that cannot survive schema evolution.
SPARK-16483

In contrast, with the method added in this PR, a user could simply do something like this to get the same result:

val result = data.withColumn("a", 'a.dropFields("b.a.b"))
result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+

This is the second of maybe 3 methods that could be added to the Column class to make it easier to manipulate nested data.
Other methods under discussion in SPARK-22231 include withFieldRenamed.
However, this should be added in a separate PR.

Does this PR introduce any user-facing change?

Only one minor change. If the user submits the following query:

df.withColumn("a", $"a".withField(null, null))

instead of throwing:

java.lang.IllegalArgumentException: requirement failed: fieldName cannot be null

it will now throw:

java.lang.IllegalArgumentException: requirement failed: col cannot be null

I don't believe its should be an issue to change this because:

neither message is incorrect
Spark 3.1.0 has yet to be released

but please feel free to correct me if I am wrong.

How was this patch tested?

New unit tests were added. Jenkins must pass them.

Related JIRAs:

More discussion on this topic can be found here:

fqaiser94 · 2020-07-31T20:38:42Z

cc @cloud-fan @dbtsai @viirya

dbtsai · 2020-08-02T18:56:03Z

Jenkins, test this please.

dbtsai · 2020-08-02T18:56:35Z

Jenkins, add to whitelist.

SparkQA · 2020-08-02T22:48:05Z

Test build #126944 has finished for PR 29322 at commit 19587e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

fqaiser94 · 2020-08-03T00:36:08Z

retest this please

SparkQA · 2020-08-03T05:02:58Z

Test build #126947 has finished for PR 29322 at commit 19587e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-04T19:32:01Z

Test build #127062 has finished for PR 29322 at commit 7342514.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)
.doc(\"The name of a class that implements \" +
trait CachedBatch
trait CachedBatchSerializer extends Serializable
trait SimpleMetricsCachedBatch extends CachedBatch
abstract class SimpleMetricsCachedBatchSerializer extends CachedBatchSerializer with Logging
case class ColumnarToRowExec(child: SparkPlan) extends ColumnarToRowTransition with CodegenSupport
case class ApplyColumnarRulesAndInsertTransitions(
class ColumnStatisticsSchema(a: Attribute) extends Serializable
class PartitionStatistics(tableSchema: Seq[Attribute]) extends Serializable
case class DefaultCachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: InternalRow)
class DefaultCachedBatchSerializer extends SimpleMetricsCachedBatchSerializer

SparkQA · 2020-08-04T23:48:18Z

Test build #127063 has finished for PR 29322 at commit 948fc9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-12T04:46:17Z

Test build #127351 has finished for PR 29322 at commit ad111ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

cloud-fan

LGTM except a few minor comments

fqaiser94 · 2020-08-12T20:46:47Z

@cloud-fan thanks for your review! Also, could you remove the CORE R WEB UI labels from this PR please? They were added incorrectly by the bot when I merged with master.

SparkQA · 2020-08-13T00:57:21Z

Test build #127390 has finished for PR 29322 at commit 2b0ac34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

fqaiser94 · 2020-08-13T03:14:19Z

retest this please

cloud-fan · 2020-08-13T03:27:57Z

The last commit already passed jenkins, I'm merging it to master, thanks!

SparkQA · 2020-08-13T07:05:02Z

Test build #127394 has finished for PR 29322 at commit 2b0ac34.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala

dongjoon-hyun · 2020-08-17T20:26:04Z

Hi, @cloud-fan . Could you update the Apache Jira issue, SPARK-32511, according to your revert, please?

cloud-fan · 2020-08-18T06:43:00Z

reopened

add dropFields method to Column class

19587e8

probot-autolabeler bot added the SQL label Jul 31, 2020

Merge branch 'master' into SPARK-32511

7342514

add dropFields user-facing examples

948fc9c

fqaiser94 added 2 commits August 11, 2020 20:14

add dropFields method to Column class

6654171

Merge remote-tracking branch 'origin/SPARK-32511' into SPARK-32511

ad111ba

probot-autolabeler bot added CORE R WEB UI labels Aug 12, 2020

fqaiser94 changed the base branch from master to branch-0.5 August 12, 2020 00:27

fqaiser94 changed the base branch from branch-0.5 to master August 12, 2020 00:27

cloud-fan reviewed Aug 12, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala Show resolved Hide resolved

cloud-fan reviewed Aug 12, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala Show resolved Hide resolved

cloud-fan reviewed Aug 12, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Column.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 12, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala Show resolved Hide resolved

cloud-fan approved these changes Aug 12, 2020

View reviewed changes

fqaiser94 added 2 commits August 12, 2020 12:08

value => valueFunc

977ff31

update classdoc for WithField explaining why we extend Unevaluable

2b0ac34

cloud-fan removed CORE R WEB UI labels Aug 13, 2020

cloud-fan closed this in 0c850c7 Aug 13, 2020

fqaiser94 commented Aug 15, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32511][SQL] Add dropFields method to Column class #29322

[SPARK-32511][SQL] Add dropFields method to Column class #29322

fqaiser94 commented Jul 31, 2020 •

edited

Loading

fqaiser94 commented Jul 31, 2020

dbtsai commented Aug 2, 2020

dbtsai commented Aug 2, 2020

SparkQA commented Aug 2, 2020

fqaiser94 commented Aug 3, 2020

SparkQA commented Aug 3, 2020

SparkQA commented Aug 4, 2020

SparkQA commented Aug 4, 2020

SparkQA commented Aug 12, 2020

cloud-fan left a comment

fqaiser94 commented Aug 12, 2020

SparkQA commented Aug 13, 2020

fqaiser94 commented Aug 13, 2020

cloud-fan commented Aug 13, 2020

SparkQA commented Aug 13, 2020

dongjoon-hyun commented Aug 17, 2020

cloud-fan commented Aug 18, 2020

[SPARK-32511][SQL] Add dropFields method to Column class #29322

[SPARK-32511][SQL] Add dropFields method to Column class #29322

Conversation

fqaiser94 commented Jul 31, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Related JIRAs:

fqaiser94 commented Jul 31, 2020

dbtsai commented Aug 2, 2020

dbtsai commented Aug 2, 2020

SparkQA commented Aug 2, 2020

fqaiser94 commented Aug 3, 2020

SparkQA commented Aug 3, 2020

SparkQA commented Aug 4, 2020

SparkQA commented Aug 4, 2020

SparkQA commented Aug 12, 2020

cloud-fan left a comment

Choose a reason for hiding this comment

fqaiser94 commented Aug 12, 2020

SparkQA commented Aug 13, 2020

fqaiser94 commented Aug 13, 2020

cloud-fan commented Aug 13, 2020

SparkQA commented Aug 13, 2020

dongjoon-hyun commented Aug 17, 2020

cloud-fan commented Aug 18, 2020

fqaiser94 commented Jul 31, 2020 •

edited

Loading