[SPARK-31793][SQL] Reduce the memory usage in file scan location metadata #28610

gengliangwang · 2020-05-22T07:24:22Z

What changes were proposed in this pull request?

Currently, the data source scan node stores all the paths in its metadata. The metadata is kept when a SparkPlan is converted into SparkPlanInfo. SparkPlanInfo can be used to construct the Spark plan graph in UI.

However, the paths can be very large (e.g. it can be many partitions after partition pruning), while UI pages only require up to 100 bytes for the location metadata. We can reduce the paths stored in metadata to reduce memory usage.

Why are the changes needed?

Reduce unnecessary memory cost.
In the heap dump of a driver, the SparkPlanInfo instances are quite large and it should be avoided:

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

gengliangwang · 2020-05-22T07:25:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala

@@ -116,6 +116,39 @@ class DataSourceScanExecRedactionSuite extends DataSourceScanRedactionTest {
      assert(isIncluded(df.queryExecution, "Location"))
    }
  }
+
+  test("FileSourceScanExec metadata should contain limited file paths") {


The metadata in V2 is not accessible. As this is simple, I think we can just test V1 here.

core/src/main/scala/org/apache/spark/util/Utils.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala

cloud-fan · 2020-05-22T07:39:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala

+        case f: FileSourceScanExec => f.metadata("Location")
+      }
+      assert(location.isDefined)
+      var found = false


how about assert(location.drop(1).dropRight(1).split(",").length < paths.length)?

SparkQA · 2020-05-22T07:43:16Z

Test build #122976 has finished for PR 28610 at commit 74dd5c7.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala

SparkQA · 2020-05-22T08:26:50Z

Test build #122984 has finished for PR 28610 at commit 757626f.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-22T08:32:24Z

Test build #122985 has finished for PR 28610 at commit c64bf20.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-22T09:07:13Z

retest this please

SparkQA · 2020-05-22T09:28:56Z

Test build #122989 has finished for PR 28610 at commit c64bf20.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-22T11:12:28Z

core/src/main/scala/org/apache/spark/util/Utils.scala

@@ -2904,6 +2904,24 @@ private[spark] object Utils extends Logging {
    props.forEach((k, v) => resultProps.put(k, v))
    resultProps
  }
+
+  /**
+   * Convert a sequence of [[Path]] to a metadata string. When the length of metadata string


[error] /home/jenkins/workspace/SparkPullRequestBuilder@4/core/target/java/org/apache/spark/util/Utils.java:949: error: reference not found [error] * Convert a sequence of {@link Path} to a metadata string. When the length of metadata string [error]

The real error seems to be this. Maybe we should just convert it to `Path`.

HyukjinKwon · 2020-05-22T11:14:03Z

sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala

@@ -116,6 +118,30 @@ class DataSourceScanExecRedactionSuite extends DataSourceScanRedactionTest {
      assert(isIncluded(df.queryExecution, "Location"))
    }
  }
+
+  test("FileSourceScanExec metadata should contain limited file paths") {


Let's add JIRA prefix SPARK-31793: although the JIRA itself is an implement.

maropu · 2020-05-22T13:35:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -55,10 +55,12 @@ trait DataSourceScanExec extends LeafExecNode {
  // Metadata that describes more details of this scan.
  protected def metadata: Map[String, String]

+  protected val maxMetadataValueLength = 100


No, it is used both in DataSourceScanExec and FileSourceScanExec

SparkQA · 2020-05-22T22:00:02Z

Test build #123018 has finished for PR 28610 at commit 4fe6be1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-22T22:14:31Z

Test build #123016 has finished for PR 28610 at commit 0a089ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-23T00:48:35Z

Test build #123021 has finished for PR 28610 at commit 9d736ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-05-23T21:59:22Z

@cloud-fan @HyukjinKwon @maropu Thanks for the review

gengliangwang · 2020-05-23T21:59:27Z

Merging to master

gengliangwang added 2 commits May 22, 2020 00:06

reduce location length

751495d

revise naming

74dd5c7

probot-autolabeler bot added CORE SQL labels May 22, 2020

gengliangwang commented May 22, 2020

View reviewed changes

gengliangwang requested a review from cloud-fan May 22, 2020 07:25