Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add metrics about block size distribution, either overall or on a per-application basis. #1585

Closed
3 tasks done
leslizhang opened this issue Mar 15, 2024 · 2 comments

Comments

@leslizhang
Copy link
Contributor

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the feature

Add metrics about block size distribution, on overall or on a per-application basis.

Motivation

We know that in RSS, the size of shuffle blocks generated by Spark tasks can vary greatly due to differences in configurations (such as memory, RDD partition count, etc.). Generally speaking, the larger the shuffle block size generated by a task, the higher the efficiency of transferring shuffle data from the Spark task to the shuffle server. There is also some disk space savings after the shuffle server-side persistence (size of the index file); during shuffle read, the entire index file content is requested in a single read operation. If the index file is too large, the shuffle read process may be negatively affected. By collecting and utilizing the block size distribution, we can assist in optimizing the configuration of Spark jobs, thereby achieving better performance.

Describe the solution

While caching the shuffle partition data of jobs in the shuffle server, synchronously record the total or per-application block size, and collect it into the histogram type of metrics in Prometheus.

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@rickyma
Copy link
Contributor

rickyma commented Mar 15, 2024

The PR for this feature will be created only after #1574 is merged. Because both PRs modify the same files. @jerqi @zuston

jerqi pushed a commit that referenced this issue Apr 30, 2024
…rt metrics (#1593)

### What changes were proposed in this pull request?

added shuffle block size metric of type histogram.

### Why are the changes needed?
related  feature #1585

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

new UTs.

Co-authored-by: leslizhang <leslizhang@tencent.com>
Co-authored-by: Enrico Minack <github@enrico.minack.dev>
@rickyma
Copy link
Contributor

rickyma commented May 9, 2024

@jerqi @zuston We can close this.

@zuston zuston closed this as completed May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants