Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-4888][metrics] instantiated job manager metrics missing important job statistics #2683

Closed
wants to merge 1 commit into from

Conversation

philippbussche
Copy link

No description provided.

val archivedJobs : JobsOverview = Await.result(future, timeout).asInstanceOf[JobsOverview]
finishedJobs += ourJobs.getNumJobsFinished() + archivedJobs.getNumJobsFinished()
finishedJobs
}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zentol What is your take on this change?
I'm uncertain if doing RPC calls in gauges is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally i would say no, since there is always the chance it may block for the full timeout duration.
So in this case, in theory, with the default timeout of 10 seconds, we could block the reporter thread for half a minute. Now this isn't very likely since we query the MemoryArchivist within the JM, but still.

I'm just wondering whether it makes sense to add this metric; with FLIP-6 around the corner, which will make it obsolete anyway.

if we merge it I would like to see some shared object so that we don't do the same RPC call 3 times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zentol , @rmetzger not sure if you meant something like the following perhaps to work around the RPC calls (I guess this would eliminate them completely however are the changes a bit more "invasive" I would think):
So what we could do is where we receive the Job status change

case JobStatusChanged(jobID, newJobStatus, timeStamp, error) =>

Send a RemoveJob message with the newJobStatus (could be "cancelled" as far as I can see)

self ! decorateMessage(RemoveJob(jobID, removeJobFromStateBackend = true, newJobStatus))

Then for the RemoveJob message propagate this to the method that actually removes the job:

removeJob(graph.getJobID, clearPersistedJob, newJobStatus)

And finally besides taking the job out of the map representing the currentJobs maybe increment a counter for canceled archived jobs or have a different map that stores archived canceled jobs (and others for failed and finished) so that when it is time for sending out the monitoring metrics we would use this map/counter rather than doing any RPC calls at all.
What do you guys think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that would result in too much additional bookkeeping for the JobManager; as i understand it we would effectively copy a lot of the Archivist functionality into the JobManager.

One simple way to reduce the # of RPC calls from 3 to 1 is something like this:

public class RpcSavingContainer {
    private JobsOverview overview;
    private int count = 0;

    public synchronized JobsOverview getOverview() {
        if (count == 0) {
            overview = // result of RPC call
        }
        count = count++ % 3;
        return overview;
    }
}

Each gauge would use the same instance of this container.

Another option would be to instantiate these metrics in the MemoryArchivist and not the JobManager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing everything in one request and setting a short timeout (1 second?) is a good solution.

@rmetzger
Copy link
Contributor

Thank you for opening a pull request. I agree that we should expose all numbers we show in the web interface as a metric as well.

@tillrohrmann
Copy link
Contributor

Abandoned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants