-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ISSUE-309][FEATURE] Support ShuffleServer latency metrics. #327
Conversation
@@ -119,4 +138,12 @@ public Gauge getGaugeGrpcOpen() { | |||
public Counter getCounterGrpcTotal() { | |||
return counterGrpcTotal; | |||
} | |||
|
|||
public Map<String, Summary> getSendTimeSummaryMap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the meaning of sendTime
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a better name? Transport time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its meaning is the time interval from the client sending to the ShuffleServerGrpcService receiving the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a better name? Transport time?
Sounds great!
@@ -186,6 +186,11 @@ public void sendShuffleData(SendShuffleDataRequest req, | |||
String appId = req.getAppId(); | |||
int shuffleId = req.getShuffleId(); | |||
long requireBufferId = req.getRequireBufferId(); | |||
long sendTime = req.getSendTime(); | |||
if (sendTime > 0) { | |||
shuffleServer.getGrpcMetrics().recordSendTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need consider the data size when we calculate the metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the amount of data will cause great fluctuations in latency. For example, 100K costs 1ms, and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high (according to observations in the production environment) , Of course, if we consider the amount of data, we can divide the sending time by a certain amount of data. Do you have any better suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Could we add some comments to explain why we don't choose to use the size of data?
Codecov Report
@@ Coverage Diff @@
## master #327 +/- ##
============================================
- Coverage 61.21% 61.08% -0.14%
- Complexity 1506 1507 +1
============================================
Files 185 185
Lines 9360 9405 +45
Branches 908 914 +6
============================================
+ Hits 5730 5745 +15
- Misses 3325 3355 +30
Partials 305 305
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…uniffle into latency_metrics merge.
transportTimeSummaryMap.putIfAbsent(GET_SHUFFLE_DATA_METHOD, | ||
metricsManager.addSummary(GRPC_GET_SHUFFLE_DATA_SEND_LATENCY)); | ||
transportTimeSummaryMap.putIfAbsent(GET_IN_MEMORY_SHUFFLE_DATA_METHOD, | ||
metricsManager.addSummary(GRPC_GET_IN_MEMORY_SHUFFLE_DATA_SEND_LATENCY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SEND -> TRANSPORT?
proto/src/main/proto/Rss.proto
Outdated
@@ -76,6 +76,7 @@ message GetLocalShuffleDataRequest { | |||
int32 partitionNum = 5; | |||
int64 offset = 6; | |||
int32 length = 7; | |||
int64 sendTime = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we give a better name?
proto/src/main/proto/Rss.proto
Outdated
@@ -90,6 +91,7 @@ message GetMemoryShuffleDataRequest { | |||
int32 partitionId = 3; | |||
int64 lastBlockId = 4; | |||
int32 readBufferSize = 5; | |||
int64 sendTime = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time is a duration. This should be timestamp.
* The amount of data will not cause great fluctuations in latency. For example, 100K costs 1ms, | ||
* and 1M costs 10ms. This seems like a normal fluctuation, but it may rise to 10s when the server load is high. | ||
* */ | ||
shuffleServer.getGrpcMetrics().recordTransportTime(ShuffleServerGrpcMetrics.SEND_SHUFFLE_DATA_METHOD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System.currentTimeMills() - sendTime
may be less than 0, because they are generated from different machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether does the negative number influence our metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can add a comment stating that the time of the client machine and the server machine should be in sync. For the case of less than 0, we do a judgment filter, but time out of sync will also affect metrics
OK. actually we should have a document to tell users that how to use the metrics and what to notice. But we don't have such documents, so we can only add some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, wait for CI, thanks @leixm
What changes were proposed in this pull request?
For #309, support ShuffleServer latency metrics.
Why are the changes needed?
Accurately determine whether the current service load has caused a large delay to the client's read and write.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT