Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-8887][flip-6] ClusterClient.getJobStatus can throw FencingTokenException #5648

Closed
wants to merge 4 commits into from

Conversation

yanghua
Copy link
Contributor

@yanghua yanghua commented Mar 7, 2018

What is the purpose of the change

This pull request fixed ClusterClient.getJobStatus throw FencingTokenException issue

Brief change log

  • try-catch request job status from job master gateway if catch a exception then goto : request job status from archived execution graph store

Verifying this change

This change is already covered by existing tests, such as DispatcherTest.testCacheJobExecutionResult.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

} else {
return FutureUtils.completedExceptionally(new FlinkJobNotFoundException(jobId));
try {
return jobManagerRunner.getJobManagerGateway().requestJobStatus(timeout);
Copy link
Contributor

@zentol zentol Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an asynchronous call that isn't throwing the exception. You have to add a handler to the returned CompletableFuture. It also only properly resolves one of the exceptions, and IMO shouldn't catch Exception but the specific exceptions we want the workaround to work for to not hide other issues.

In any case, I'm not sure if adding workarounds to the Dispatcher is the right way to go. These issues revealed that some scenarios are not properly handled, and I would prefer waiting for @tillrohrmann to really fix this in the Dispatcher and related components.

We can temporarily handle both exceptions in the MiniClusterClient by adding a single retry (with a short sleep) if a specific exception occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will try this way, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think methods such as requestJob and requestOperatorBackPressureStats are equally affected by this bug. I would also like to hear @tillrohrmann's opinion first before proceeding.

@yanghua
Copy link
Contributor Author

yanghua commented Mar 7, 2018

hi @zentol how about this solution? I think try another way(fetch from archived execution graph store) from CompletableFuture.handler is a good choice.

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are now somewhat doing correctly what you originally intended, but it still doesn't solve all problems as I said in my first comment.

The JobManager can still receive messages before the fencingToken was set (which causes the first exception in the JIRA). There's also another message that can sometimes occur where the JM RpcEndpoint wasn't started yet. Both of these can be fixed with a retry, as is the issue that you're currently addressing.

if (throwable != null) {
Throwable error = ExceptionUtils.stripCompletionException(throwable);

if (error instanceof FencingTokenException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing else block causes other exceptions to be swallowed...

@zentol
Copy link
Contributor

zentol commented Oct 16, 2018

Already fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants