[ISSUE-309][FEATURE] Support ShuffleServer latency metrics. #327

leixm · 2022-11-16T02:56:02Z

What changes were proposed in this pull request?

For #309, support ShuffleServer latency metrics.

Why are the changes needed?

Accurately determine whether the current service load has caused a large delay to the client's read and write.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

jerqi · 2022-11-16T03:16:17Z

common/src/main/java/org/apache/uniffle/common/metrics/GRPCMetrics.java

@@ -119,4 +138,12 @@ public Gauge getGaugeGrpcOpen() {
  public Counter getCounterGrpcTotal() {
    return counterGrpcTotal;
  }
+
+  public Map<String, Summary> getSendTimeSummaryMap() {


What's the meaning of sendTime?

Could we have a better name? Transport time?

Its meaning is the time interval from the client sending to the ShuffleServerGrpcService receiving the request.

Could we have a better name? Transport time?

Sounds great!

jerqi · 2022-11-16T03:18:17Z

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

@@ -186,6 +186,11 @@ public void sendShuffleData(SendShuffleDataRequest req,
    String appId = req.getAppId();
    int shuffleId = req.getShuffleId();
    long requireBufferId = req.getRequireBufferId();
+    long sendTime = req.getSendTime();
+    if (sendTime > 0) {
+      shuffleServer.getGrpcMetrics().recordSendTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD,


Do we need consider the data size when we calculate the metrics?

I don't think the amount of data will cause great fluctuations in latency. For example, 100K costs 1ms, and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high (according to observations in the production environment) , Of course, if we consider the amount of data, we can divide the sending time by a certain amount of data. Do you have any better suggestions?

Make sense. Could we add some comments to explain why we don't choose to use the size of data?

codecov-commenter · 2022-11-16T03:19:44Z

Codecov Report

Merging #327 (55719de) into master (eae2621) will decrease coverage by 0.13%.
The diff coverage is 31.91%.

@@             Coverage Diff              @@
##             master     #327      +/-   ##
============================================
- Coverage     61.21%   61.08%   -0.14%     
- Complexity     1506     1507       +1     
============================================
  Files           185      185              
  Lines          9360     9405      +45     
  Branches        908      914       +6     
============================================
+ Hits           5730     5745      +15     
- Misses         3325     3355      +30     
  Partials        305      305

Impacted Files	Coverage Δ
...java/org/apache/uniffle/common/util/Constants.java	`0.00% <ø> (ø)`
...pache/uniffle/server/ShuffleServerGrpcService.java	`0.83% <0.00%> (-0.03%)`	⬇️
...org/apache/uniffle/common/metrics/GRPCMetrics.java	`40.00% <16.66%> (-6.52%)`	⬇️
.../apache/uniffle/common/metrics/MetricsManager.java	`68.42% <20.00%> (-17.30%)`	⬇️
...pache/uniffle/server/ShuffleServerGrpcMetrics.java	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…uniffle into latency_metrics merge.

jerqi · 2022-11-16T04:12:05Z

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcMetrics.java

+    transportTimeSummaryMap.putIfAbsent(GET_SHUFFLE_DATA_METHOD,
+        metricsManager.addSummary(GRPC_GET_SHUFFLE_DATA_SEND_LATENCY));
+    transportTimeSummaryMap.putIfAbsent(GET_IN_MEMORY_SHUFFLE_DATA_METHOD,
+        metricsManager.addSummary(GRPC_GET_IN_MEMORY_SHUFFLE_DATA_SEND_LATENCY));


SEND -> TRANSPORT?

jerqi · 2022-11-16T04:14:03Z

proto/src/main/proto/Rss.proto

@@ -76,6 +76,7 @@ message GetLocalShuffleDataRequest {
  int32 partitionNum = 5;
  int64 offset = 6;
  int32 length = 7;
+  int64 sendTime = 8;


Could we give a better name?

jerqi · 2022-11-16T04:15:00Z

proto/src/main/proto/Rss.proto

@@ -90,6 +91,7 @@ message GetMemoryShuffleDataRequest {
  int32 partitionId = 3;
  int64 lastBlockId = 4;
  int32 readBufferSize = 5;
+  int64 sendTime = 6;


Time is a duration. This should be timestamp.

jerqi · 2022-11-16T06:32:42Z

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

+      * The amount of data will not cause great fluctuations in latency. For example, 100K costs 1ms,
+      * and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high.
+      * */
+      shuffleServer.getGrpcMetrics().recordTransportTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD,


System.currentTimeMills() - sendTime may be less than 0, because they are generated from different machines.

Whether does the negative number influence our metrics?

Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics

Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics

OK. actually we should have a document to tell users that how to use the metrics and what to notice. But we don't have such documents, so we can only add some comments.

jerqi

LGTM, wait for CI, thanks @leixm

leixianming and others added 3 commits November 16, 2022 00:59

[FEATURE] Support ShuffleServer latency metrics.

e243def

Add UT.

bc1233e

Merge branch 'master' into latency_metrics

2ad8c79

jerqi reviewed Nov 16, 2022

View reviewed changes

jerqi changed the title ~~[FEATURE] Support ShuffleServer latency metrics.~~ [ISSUE-309][FEATURE] Support ShuffleServer latency metrics. Nov 16, 2022

leixianming added 2 commits November 16, 2022 11:45

Fix.

dd72bd9

Merge branch 'latency_metrics' of https://github.com/leixm/incubator-…

e53357f

…uniffle into latency_metrics merge.

jerqi reviewed Nov 16, 2022

View reviewed changes

Fix.

55719de

jerqi reviewed Nov 16, 2022

View reviewed changes

Fix.

91809d8

jerqi linked an issue Nov 16, 2022 that may be closed by this pull request

[FEATURE] Support ShuffleServer latency metrics #309

Closed

3 tasks

jerqi approved these changes Nov 16, 2022

View reviewed changes

jerqi merged commit 4004f44 into apache:master Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ISSUE-309][FEATURE] Support ShuffleServer latency metrics. #327

[ISSUE-309][FEATURE] Support ShuffleServer latency metrics. #327

leixm commented Nov 16, 2022

jerqi Nov 16, 2022

jerqi Nov 16, 2022 •

edited

leixm Nov 16, 2022

leixm Nov 16, 2022

jerqi Nov 16, 2022

leixm Nov 16, 2022

jerqi Nov 16, 2022

codecov-commenter commented Nov 16, 2022 •

edited

jerqi Nov 16, 2022

jerqi Nov 16, 2022

jerqi Nov 16, 2022 •

edited

jerqi Nov 16, 2022

jerqi Nov 16, 2022

leixm Nov 16, 2022

jerqi Nov 16, 2022

jerqi left a comment

[ISSUE-309][FEATURE] Support ShuffleServer latency metrics. #327

[ISSUE-309][FEATURE] Support ShuffleServer latency metrics. #327

Conversation

leixm commented Nov 16, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

jerqi Nov 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 16, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi Nov 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerqi left a comment

Choose a reason for hiding this comment

jerqi Nov 16, 2022 •

edited

codecov-commenter commented Nov 16, 2022 •

edited

jerqi Nov 16, 2022 •

edited