-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-4888][metrics] instantiated job manager metrics missing important job statistics #2683
Conversation
val archivedJobs : JobsOverview = Await.result(future, timeout).asInstanceOf[JobsOverview] | ||
finishedJobs += ourJobs.getNumJobsFinished() + archivedJobs.getNumJobsFinished() | ||
finishedJobs | ||
}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zentol What is your take on this change?
I'm uncertain if doing RPC calls in gauges is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally i would say no, since there is always the chance it may block for the full timeout duration.
So in this case, in theory, with the default timeout of 10 seconds, we could block the reporter thread for half a minute. Now this isn't very likely since we query the MemoryArchivist within the JM, but still.
I'm just wondering whether it makes sense to add this metric; with FLIP-6 around the corner, which will make it obsolete anyway.
if we merge it I would like to see some shared object so that we don't do the same RPC call 3 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zentol , @rmetzger not sure if you meant something like the following perhaps to work around the RPC calls (I guess this would eliminate them completely however are the changes a bit more "invasive" I would think):
So what we could do is where we receive the Job status change
case JobStatusChanged(jobID, newJobStatus, timeStamp, error) =>
Send a RemoveJob message with the newJobStatus (could be "cancelled" as far as I can see)
self ! decorateMessage(RemoveJob(jobID, removeJobFromStateBackend = true, newJobStatus))
Then for the RemoveJob message propagate this to the method that actually removes the job:
removeJob(graph.getJobID, clearPersistedJob, newJobStatus)
And finally besides taking the job out of the map representing the currentJobs maybe increment a counter for canceled archived jobs or have a different map that stores archived canceled jobs (and others for failed and finished) so that when it is time for sending out the monitoring metrics we would use this map/counter rather than doing any RPC calls at all.
What do you guys think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that would result in too much additional bookkeeping for the JobManager; as i understand it we would effectively copy a lot of the Archivist functionality into the JobManager.
One simple way to reduce the # of RPC calls from 3 to 1 is something like this:
public class RpcSavingContainer {
private JobsOverview overview;
private int count = 0;
public synchronized JobsOverview getOverview() {
if (count == 0) {
overview = // result of RPC call
}
count = count++ % 3;
return overview;
}
}
Each gauge would use the same instance of this container.
Another option would be to instantiate these metrics in the MemoryArchivist and not the JobManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing everything in one request and setting a short timeout (1 second?) is a good solution.
Thank you for opening a pull request. I agree that we should expose all numbers we show in the web interface as a metric as well. |
Abandoned |
No description provided.