New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-4923] [Metrics] Expose input/output buffers and bufferPool usa… #2693
Conversation
3beb65b
to
e6edf01
Compare
@@ -220,4 +220,10 @@ boolean registerListener(NotificationListener listener) { | |||
throw new IllegalStateException("Already registered listener."); | |||
} | |||
} | |||
|
|||
public int getNumberOfQueuedBuffers() { | |||
synchronized (buffers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid synchronization here? The metrics should never influence (block) the other code.
I would rather have the metrics be one off once in a while and avoid the lock here.
|
||
for (ResultSubpartition subpartition : subpartitions) { | ||
|
||
if (subpartition instanceof PipelinedSubpartition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add the getNumberOfQueuedBuffers()
method to ResultSubpartition
then you do not need to check and cast here.
@@ -227,4 +227,10 @@ public String toString() { | |||
getTotalNumberOfBuffers(), getTotalNumberOfBytes(), isFinished, readView != null, | |||
spillWriter != null); | |||
} | |||
|
|||
public int getNumberOfQueuedBuffers() { | |||
synchronized (buffers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, try to avoid synchronization.
public int getNumberOfQueuedBuffers() { | ||
int totalBuffers = 0; | ||
|
||
for (Map.Entry<IntermediateResultPartitionID, InputChannel> entry: inputChannels.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this map may change asynchronously, so it is probably better to catch exceptions here and re-try for some times. It it fails three times, return -1
for "unknown"
// metrics | ||
// ------------------------------------------------------------------------ | ||
|
||
private class InputBuffersGauge implements Gauge<Integer> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pull these classes out into a separate file? The Task
class is very large already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had pull these classes out into TaskIOMetricGroup since the buffers metrics is Task I/O scope
Good addition, with some request for changes. I would suggest some name changes for the Gauges:
|
…ge as a metric for a Task
e6edf01
to
986d96f
Compare
This looks good, thanks! |
…s and bufferPool usage as a metrics This closes apache#2693
@@ -347,6 +348,12 @@ public Task( | |||
|
|||
// finally, create the executing thread, but do not start it | |||
executingThread = new Thread(TASK_THREADS_GROUP, this, taskNameWithSubtask); | |||
|
|||
// add metrics for buffers | |||
this.metrics.getIOMetricGroup().getBuffersGroup().gauge("inputQueueLength", new TaskIOMetricGroup.InputBuffersGauge(this)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could have a static iniitializer method in the TaskIOMetricGroup for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean one method that initializes all gauges. it is a simple way of consolidating the metrics in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have rebased merged this already.
How about doing that in a followup patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stephan, have you merged this already? I will do that in a patch after discussion with Chesnay if you had merged it.
…s and bufferPool usage as a metrics This closes apache#2693
…s and bufferPool usage as a metrics This closes apache#2693
…s and bufferPool usage as a metrics This closes apache#2693
…s and bufferPool usage as a metrics This closes apache#2693
The buffers and buffer usage of Input/output bufferPool for a task reflect wether a task congestion.
So we expose the following Metrics as a Buffers MetricGroup on the TaskIOMetricGroup, all these metrics is a gauge.