-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-2804] [runtime] Add blocking job submission support for HA #1249
Conversation
+1 good to merge. We have to remove dc9daef before merging though (that was a temporary work around, because we didn't support blocking HA submissions. Should I close the other recovery PRs as this one contains all respective additions? I've tested it out locally and it works fine. I think moving this to the job client actor is better than having it the repeated retrieval in the job client. And the addition with the submission and connection timeouts are also very good. |
Thanks for the review @uce. Thanks for pointing out the temporary commit. I will remove it when I merge this PR. No don't close the PRs. I'll commit them individually and close them via the commit messages. Will then simply rebase this one. Maybe we have HA in the master at the end of the day :-) |
Internal actor states must only be modified within the actor thread. This avoids all the well-known issues coming with concurrency. Fix RemoveCachedJob by introducing RemoveJob Fix JobManagerITCase Add removeJob which maintains the job in the SubmittedJobGraphStore Make revokeLeadership not remove the jobs from the state backend Fix shading problem with curator by hiding CuratorFramework in ChaosMonkeyITCase
Move StateBackend enum to top level and org.apache.flink.runtime.state Abstract blob store in blob server for recovery This closes apache#1227.
The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions This closes apache#1249.
6d81141
to
1da0d78
Compare
The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions This closes apache#1249.
The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions This closes apache#1249.
The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions This closes apache#1249.
The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions This closes apache#1249.
The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions This closes apache#1249.
The JobClientActor is now repsonsible for receiving the JobStatus updates from a newly elected leader. It uses the LeaderRetrievalService to be notified about new leaders. The actor can only be used to submit a single job to the JM. Once it received a job from the Client it tries to send it to the current leader. If no leader is available, a connection timeout is triggered. If the job could be sent to the JM, a submission timeout is triggered if the JobClientActor does not receive a JobSubmitSuccess message within the timeout interval. If the connection to the leader is lost after having submitted a job, a connection timeout is triggered if the JobClientActor cannot reconnect to another JM within the timeout interval. The JobClient simply awaits on the completion of the returned future to the SubmitJobAndWait message. Added test cases for JobClientActor exceptions This closes apache#1249.
The JobClientActor is now repsonsible for receiving the JobStatus updates from
a newly elected leader. It uses the LeaderRetrievalService to be notified about
new leaders. The actor can only be used to submit a single job to the JM. Once
it received a job from the Client it tries to send it to the current leader.
If no leader is available, a connection timeout is triggered. If the job could
be sent to the JM, a submission timeout is triggered if the JobClientActor does
not receive a JobSubmitSuccess message within the timeout interval. If the
connection to the leader is lost after having submitted a job, a connection
timeout is triggered if the JobClientActor cannot reconnect to another JM within
the timeout interval. The JobClient simply awaits on the completion of the
returned future to the SubmitJobAndWait message.
Added test cases for JobClientActor exceptions
This PR is based on extended versions of PR #1153 and #1227.