You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched in the issues and found no similar issues.
Describe the feature
Add metrics about block size distribution, on overall or on a per-application basis.
Motivation
We know that in RSS, the size of shuffle blocks generated by Spark tasks can vary greatly due to differences in configurations (such as memory, RDD partition count, etc.). Generally speaking, the larger the shuffle block size generated by a task, the higher the efficiency of transferring shuffle data from the Spark task to the shuffle server. There is also some disk space savings after the shuffle server-side persistence (size of the index file); during shuffle read, the entire index file content is requested in a single read operation. If the index file is too large, the shuffle read process may be negatively affected. By collecting and utilizing the block size distribution, we can assist in optimizing the configuration of Spark jobs, thereby achieving better performance.
Describe the solution
While caching the shuffle partition data of jobs in the shuffle server, synchronously record the total or per-application block size, and collect it into the histogram type of metrics in Prometheus.
Additional context
No response
Are you willing to submit PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
…rt metrics (#1593)
### What changes were proposed in this pull request?
added shuffle block size metric of type histogram.
### Why are the changes needed?
related feature #1585
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
new UTs.
Co-authored-by: leslizhang <leslizhang@tencent.com>
Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Code of Conduct
Search before asking
Describe the feature
Add metrics about block size distribution, on overall or on a per-application basis.
Motivation
We know that in RSS, the size of shuffle blocks generated by Spark tasks can vary greatly due to differences in configurations (such as memory, RDD partition count, etc.). Generally speaking, the larger the shuffle block size generated by a task, the higher the efficiency of transferring shuffle data from the Spark task to the shuffle server. There is also some disk space savings after the shuffle server-side persistence (size of the index file); during shuffle read, the entire index file content is requested in a single read operation. If the index file is too large, the shuffle read process may be negatively affected. By collecting and utilizing the block size distribution, we can assist in optimizing the configuration of Spark jobs, thereby achieving better performance.
Describe the solution
While caching the shuffle partition data of jobs in the shuffle server, synchronously record the total or per-application block size, and collect it into the histogram type of metrics in Prometheus.
Additional context
No response
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: