[FLINK-2804] [runtime] Add blocking job submission support for HA #1249

tillrohrmann · 2015-10-09T20:18:22Z

The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.

Added test cases for JobClientActor exceptions

This PR is based on extended versions of PR #1153 and #1227.

uce · 2015-10-10T09:54:02Z

+1 good to merge. We have to remove dc9daef before merging though (that was a temporary work around, because we didn't support blocking HA submissions. Should I close the other recovery PRs as this one contains all respective additions?

I've tested it out locally and it works fine. I think moving this to the job client actor is better than having it the repeated retrieval in the job client. And the addition with the submission and connection timeouts are also very good.

tillrohrmann · 2015-10-10T12:50:58Z

Thanks for the review @uce. Thanks for pointing out the temporary commit. I will remove it when I merge this PR. No don't close the PRs. I'll commit them individually and close them via the commit messages. Will then simply rebase this one. Maybe we have HA in the master at the end of the day :-)

…ctoryTest

This closes apache#1153.

Internal actor states must only be modified within the actor thread. This avoids all the well-known issues coming with concurrency. Fix RemoveCachedJob by introducing RemoveJob Fix JobManagerITCase Add removeJob which maintains the job in the SubmittedJobGraphStore Make revokeLeadership not remove the jobs from the state backend Fix shading problem with curator by hiding CuratorFramework in ChaosMonkeyITCase

Move StateBackend enum to top level and org.apache.flink.runtime.state Abstract blob store in blob server for recovery This closes apache#1227.

The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions This closes apache#1249.

uce and others added 7 commits October 12, 2015 00:56

[runtime] Add type parameter to ByteStreamStateHandle

3111a64

[FLINK-2652] [tests] Temporary ignore flakey PartitionRequestClientFa…

c41999e

…ctoryTest

[FLINK-2792] [jobmanager, logging] Set actor message log level to TRACE

02ab42d

[FLINK-2354] [runtime] Add job graph and checkpoint recovery

ce6a943

This closes apache#1153.

[FLINK-2805] [blobmanager] Write JARs to file state backend for recovery

c8e6336

Move StateBackend enum to top level and org.apache.flink.runtime.state Abstract blob store in blob server for recovery This closes apache#1227.

tillrohrmann force-pushed the client-recovery branch from 6d81141 to 1da0d78 Compare October 12, 2015 12:08

asfgit closed this in d18f580 Oct 20, 2015

tillrohrmann deleted the client-recovery branch August 19, 2016 12:12

rmetzger added the component=<none> label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-2804] [runtime] Add blocking job submission support for HA #1249

[FLINK-2804] [runtime] Add blocking job submission support for HA #1249

tillrohrmann commented Oct 9, 2015

uce commented Oct 10, 2015

tillrohrmann commented Oct 10, 2015

[FLINK-2804] [runtime] Add blocking job submission support for HA #1249

[FLINK-2804] [runtime] Add blocking job submission support for HA #1249

Conversation

tillrohrmann commented Oct 9, 2015

uce commented Oct 10, 2015

tillrohrmann commented Oct 10, 2015