[FEATURE] Add metrics about block size distribution, either overall or on a per-application basis. #1585

leslizhang · 2024-03-15T12:42:00Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

Describe the feature

Add metrics about block size distribution, on overall or on a per-application basis.

Motivation

We know that in RSS, the size of shuffle blocks generated by Spark tasks can vary greatly due to differences in configurations (such as memory, RDD partition count, etc.). Generally speaking, the larger the shuffle block size generated by a task, the higher the efficiency of transferring shuffle data from the Spark task to the shuffle server. There is also some disk space savings after the shuffle server-side persistence (size of the index file); during shuffle read, the entire index file content is requested in a single read operation. If the index file is too large, the shuffle read process may be negatively affected. By collecting and utilizing the block size distribution, we can assist in optimizing the configuration of Spark jobs, thereby achieving better performance.

Describe the solution

While caching the shuffle partition data of jobs in the shuffle server, synchronously record the total or per-application block size, and collect it into the histogram type of metrics in Prometheus.

Additional context

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

rickyma · 2024-03-15T17:58:56Z

The PR for this feature will be created only after #1574 is merged. Because both PRs modify the same files. @jerqi @zuston

…rt metrics (#1593) ### What changes were proposed in this pull request? added shuffle block size metric of type histogram. ### Why are the changes needed? related feature #1585 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? new UTs. Co-authored-by: leslizhang <leslizhang@tencent.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev>

rickyma · 2024-05-09T13:46:58Z

@jerqi @zuston We can close this.

leslizhang mentioned this issue Mar 19, 2024

[#1585] feat(server): Support app-level block size statistics to report metrics #1593

Merged

zuston closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add metrics about block size distribution, either overall or on a per-application basis. #1585

[FEATURE] Add metrics about block size distribution, either overall or on a per-application basis. #1585

leslizhang commented Mar 15, 2024

rickyma commented Mar 15, 2024

rickyma commented May 9, 2024

[FEATURE] Add metrics about block size distribution, either overall or on a per-application basis. #1585

[FEATURE] Add metrics about block size distribution, either overall or on a per-application basis. #1585

Comments

leslizhang commented Mar 15, 2024

Code of Conduct

Search before asking

Describe the feature

Motivation

Describe the solution

Additional context

Are you willing to submit PR?

rickyma commented Mar 15, 2024

rickyma commented May 9, 2024