-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31793][SQL] Reduce the memory usage in file scan location metadata #28610
[SPARK-31793][SQL] Reduce the memory usage in file scan location metadata #28610
Conversation
@@ -116,6 +116,39 @@ class DataSourceScanExecRedactionSuite extends DataSourceScanRedactionTest { | |||
assert(isIncluded(df.queryExecution, "Location")) | |||
} | |||
} | |||
|
|||
test("FileSourceScanExec metadata should contain limited file paths") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata in V2 is not accessible. As this is simple, I think we can just test V1 here.
sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
Outdated
Show resolved
Hide resolved
case f: FileSourceScanExec => f.metadata("Location") | ||
} | ||
assert(location.isDefined) | ||
var found = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about assert(location.drop(1).dropRight(1).split(",").length < paths.length)
?
Test build #122976 has finished for PR 28610 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala
Outdated
Show resolved
Hide resolved
Test build #122984 has finished for PR 28610 at commit
|
Test build #122985 has finished for PR 28610 at commit
|
retest this please |
Test build #122989 has finished for PR 28610 at commit
|
@@ -2904,6 +2904,24 @@ private[spark] object Utils extends Logging { | |||
props.forEach((k, v) => resultProps.put(k, v)) | |||
resultProps | |||
} | |||
|
|||
/** | |||
* Convert a sequence of [[Path]] to a metadata string. When the length of metadata string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[error] /home/jenkins/workspace/SparkPullRequestBuilder@4/core/target/java/org/apache/spark/util/Utils.java:949: error: reference not found
[error] * Convert a sequence of {@link Path} to a metadata string. When the length of metadata string
[error]
The real error seems to be this. Maybe we should just convert it to `Path`
.
@@ -116,6 +118,30 @@ class DataSourceScanExecRedactionSuite extends DataSourceScanRedactionTest { | |||
assert(isIncluded(df.queryExecution, "Location")) | |||
} | |||
} | |||
|
|||
test("FileSourceScanExec metadata should contain limited file paths") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add JIRA prefix SPARK-31793:
although the JIRA itself is an implement.
@@ -55,10 +55,12 @@ trait DataSourceScanExec extends LeafExecNode { | |||
// Metadata that describes more details of this scan. | |||
protected def metadata: Map[String, String] | |||
|
|||
protected val maxMetadataValueLength = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is used both in DataSourceScanExec and FileSourceScanExec
Test build #123018 has finished for PR 28610 at commit
|
Test build #123016 has finished for PR 28610 at commit
|
4fe6be1
to
9d736ef
Compare
Test build #123021 has finished for PR 28610 at commit
|
@cloud-fan @HyukjinKwon @maropu Thanks for the review |
Merging to master |
What changes were proposed in this pull request?
Currently, the data source scan node stores all the paths in its metadata. The metadata is kept when a SparkPlan is converted into SparkPlanInfo. SparkPlanInfo can be used to construct the Spark plan graph in UI.
However, the paths can be very large (e.g. it can be many partitions after partition pruning), while UI pages only require up to 100 bytes for the location metadata. We can reduce the paths stored in metadata to reduce memory usage.
Why are the changes needed?
Reduce unnecessary memory cost.
In the heap dump of a driver, the SparkPlanInfo instances are quite large and it should be avoided:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests