Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-2804] [runtime] Add blocking job submission support for HA #1249

Closed
wants to merge 7 commits into from

Conversation

tillrohrmann
Copy link
Contributor

The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.

Added test cases for JobClientActor exceptions

This PR is based on extended versions of PR #1153 and #1227.

@uce
Copy link
Contributor

uce commented Oct 10, 2015

+1 good to merge. We have to remove dc9daef before merging though (that was a temporary work around, because we didn't support blocking HA submissions. Should I close the other recovery PRs as this one contains all respective additions?

I've tested it out locally and it works fine. I think moving this to the job client actor is better than having it the repeated retrieval in the job client. And the addition with the submission and connection timeouts are also very good.

@tillrohrmann
Copy link
Contributor Author

Thanks for the review @uce. Thanks for pointing out the temporary commit. I will remove it when I merge this PR. No don't close the PRs. I'll commit them individually and close them via the commit messages. Will then simply rebase this one. Maybe we have HA in the master at the end of the day :-)

uce and others added 7 commits October 12, 2015 00:56
Internal actor states must only be modified within the actor thread.
This avoids all the well-known issues coming with concurrency.

Fix RemoveCachedJob by introducing RemoveJob

Fix JobManagerITCase

Add removeJob which maintains the job in the SubmittedJobGraphStore

Make revokeLeadership not remove the jobs from the state backend

Fix shading problem with curator by hiding CuratorFramework in ChaosMonkeyITCase
Move StateBackend enum to top level and org.apache.flink.runtime.state

Abstract blob store in blob server for recovery

This closes apache#1227.
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.

Added test cases for JobClientActor exceptions

This closes apache#1249.
tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Oct 15, 2015
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.

Added test cases for JobClientActor exceptions

This closes apache#1249.
tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Oct 16, 2015
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.

Added test cases for JobClientActor exceptions

This closes apache#1249.
tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Oct 19, 2015
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.

Added test cases for JobClientActor exceptions

This closes apache#1249.
tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Oct 19, 2015
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.

Added test cases for JobClientActor exceptions

This closes apache#1249.
@asfgit asfgit closed this in d18f580 Oct 20, 2015
cfmcgrady pushed a commit to cfmcgrady/flink that referenced this pull request Oct 23, 2015
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.

Added test cases for JobClientActor exceptions

This closes apache#1249.
lofifnc pushed a commit to lofifnc/flink that referenced this pull request Oct 23, 2015
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.

Added test cases for JobClientActor exceptions

This closes apache#1249.
@tillrohrmann tillrohrmann deleted the client-recovery branch August 19, 2016 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants