Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31793][SQL] Reduce the memory usage in file scan location metadata #28610

Closed

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Currently, the data source scan node stores all the paths in its metadata. The metadata is kept when a SparkPlan is converted into SparkPlanInfo. SparkPlanInfo can be used to construct the Spark plan graph in UI.

However, the paths can be very large (e.g. it can be many partitions after partition pruning), while UI pages only require up to 100 bytes for the location metadata. We can reduce the paths stored in metadata to reduce memory usage.

Why are the changes needed?

Reduce unnecessary memory cost.
In the heap dump of a driver, the SparkPlanInfo instances are quite large and it should be avoided:
image

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

@@ -116,6 +116,39 @@ class DataSourceScanExecRedactionSuite extends DataSourceScanRedactionTest {
assert(isIncluded(df.queryExecution, "Location"))
}
}

test("FileSourceScanExec metadata should contain limited file paths") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata in V2 is not accessible. As this is simple, I think we can just test V1 here.

case f: FileSourceScanExec => f.metadata("Location")
}
assert(location.isDefined)
var found = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about assert(location.drop(1).dropRight(1).split(",").length < paths.length)?

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122976 has finished for PR 28610 at commit 74dd5c7.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122984 has finished for PR 28610 at commit 757626f.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122985 has finished for PR 28610 at commit c64bf20.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #122989 has finished for PR 28610 at commit c64bf20.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -2904,6 +2904,24 @@ private[spark] object Utils extends Logging {
props.forEach((k, v) => resultProps.put(k, v))
resultProps
}

/**
* Convert a sequence of [[Path]] to a metadata string. When the length of metadata string
Copy link
Member

@HyukjinKwon HyukjinKwon May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[error] /home/jenkins/workspace/SparkPullRequestBuilder@4/core/target/java/org/apache/spark/util/Utils.java:949: error: reference not found
[error]    * Convert a sequence of {@link Path} to a metadata string. When the length of metadata string
[error]    

The real error seems to be this. Maybe we should just convert it to `Path`.

@@ -116,6 +118,30 @@ class DataSourceScanExecRedactionSuite extends DataSourceScanRedactionTest {
assert(isIncluded(df.queryExecution, "Location"))
}
}

test("FileSourceScanExec metadata should contain limited file paths") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add JIRA prefix SPARK-31793: although the JIRA itself is an implement.

@@ -55,10 +55,12 @@ trait DataSourceScanExec extends LeafExecNode {
// Metadata that describes more details of this scan.
protected def metadata: Map[String, String]

protected val maxMetadataValueLength = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is used both in DataSourceScanExec and FileSourceScanExec

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #123018 has finished for PR 28610 at commit 4fe6be1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #123016 has finished for PR 28610 at commit 0a089ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 23, 2020

Test build #123021 has finished for PR 28610 at commit 9d736ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

@cloud-fan @HyukjinKwon @maropu Thanks for the review

@gengliangwang
Copy link
Member Author

Merging to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants