Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE-309][FEATURE] Support ShuffleServer latency metrics. #327

Merged
merged 7 commits into from
Nov 16, 2022

Conversation

leixm
Copy link
Contributor

@leixm leixm commented Nov 16, 2022

What changes were proposed in this pull request?

For #309, support ShuffleServer latency metrics.

Why are the changes needed?

Accurately determine whether the current service load has caused a large delay to the client's read and write.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@@ -119,4 +138,12 @@ public Gauge getGaugeGrpcOpen() {
public Counter getCounterGrpcTotal() {
return counterGrpcTotal;
}

public Map<String, Summary> getSendTimeSummaryMap() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of sendTime?

Copy link
Contributor

@jerqi jerqi Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a better name? Transport time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its meaning is the time interval from the client sending to the ShuffleServerGrpcService receiving the request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a better name? Transport time?

Sounds great!

@@ -186,6 +186,11 @@ public void sendShuffleData(SendShuffleDataRequest req,
String appId = req.getAppId();
int shuffleId = req.getShuffleId();
long requireBufferId = req.getRequireBufferId();
long sendTime = req.getSendTime();
if (sendTime > 0) {
shuffleServer.getGrpcMetrics().recordSendTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need consider the data size when we calculate the metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the amount of data will cause great fluctuations in latency. For example, 100K costs 1ms, and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high (according to observations in the production environment) , Of course, if we consider the amount of data, we can divide the sending time by a certain amount of data. Do you have any better suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Could we add some comments to explain why we don't choose to use the size of data?

@codecov-commenter
Copy link

codecov-commenter commented Nov 16, 2022

Codecov Report

Merging #327 (55719de) into master (eae2621) will decrease coverage by 0.13%.
The diff coverage is 31.91%.

@@             Coverage Diff              @@
##             master     #327      +/-   ##
============================================
- Coverage     61.21%   61.08%   -0.14%     
- Complexity     1506     1507       +1     
============================================
  Files           185      185              
  Lines          9360     9405      +45     
  Branches        908      914       +6     
============================================
+ Hits           5730     5745      +15     
- Misses         3325     3355      +30     
  Partials        305      305              
Impacted Files Coverage Δ
...java/org/apache/uniffle/common/util/Constants.java 0.00% <ø> (ø)
...pache/uniffle/server/ShuffleServerGrpcService.java 0.83% <0.00%> (-0.03%) ⬇️
...org/apache/uniffle/common/metrics/GRPCMetrics.java 40.00% <16.66%> (-6.52%) ⬇️
.../apache/uniffle/common/metrics/MetricsManager.java 68.42% <20.00%> (-17.30%) ⬇️
...pache/uniffle/server/ShuffleServerGrpcMetrics.java 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jerqi jerqi changed the title [FEATURE] Support ShuffleServer latency metrics. [ISSUE-309][FEATURE] Support ShuffleServer latency metrics. Nov 16, 2022
transportTimeSummaryMap.putIfAbsent(GET_SHUFFLE_DATA_METHOD,
metricsManager.addSummary(GRPC_GET_SHUFFLE_DATA_SEND_LATENCY));
transportTimeSummaryMap.putIfAbsent(GET_IN_MEMORY_SHUFFLE_DATA_METHOD,
metricsManager.addSummary(GRPC_GET_IN_MEMORY_SHUFFLE_DATA_SEND_LATENCY));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SEND -> TRANSPORT?

@@ -76,6 +76,7 @@ message GetLocalShuffleDataRequest {
int32 partitionNum = 5;
int64 offset = 6;
int32 length = 7;
int64 sendTime = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we give a better name?

@@ -90,6 +91,7 @@ message GetMemoryShuffleDataRequest {
int32 partitionId = 3;
int64 lastBlockId = 4;
int32 readBufferSize = 5;
int64 sendTime = 6;
Copy link
Contributor

@jerqi jerqi Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time is a duration. This should be timestamp.

* The amount of data will not cause great fluctuations in latency. For example, 100K costs 1ms,
* and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high.
* */
shuffleServer.getGrpcMetrics().recordTransportTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System.currentTimeMills() - sendTime may be less than 0, because they are generated from different machines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether does the negative number influence our metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics

OK. actually we should have a document to tell users that how to use the metrics and what to notice. But we don't have such documents, so we can only add some comments.

@jerqi jerqi linked an issue Nov 16, 2022 that may be closed by this pull request
3 tasks
Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, wait for CI, thanks @leixm

@jerqi jerqi merged commit 4004f44 into apache:master Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support ShuffleServer latency metrics
3 participants