[FLINK-4273] Modify JobClient to attach to running jobs #2313

mxm · 2016-07-29T16:51:10Z

These changes are required for FLINK-4272 (introduce a JobClient class for job control). Essentially, we want to be able to re-attach to a running job and monitor it. It shouldn't make any difference whether we just submitted the job or we recover it from an existing JobID.

This PR modifies the JobClientActor to support two different operation modes: a) submitJob and monitor b) re-attach to job and monitor

The JobClient class has been updated with methods to access this functionality. Before the class just had submitJobAndWait and submitJobDetachd. Now, it has the additional methods submitJob, attachToRunningJob, and awaitJobResult.

submitJob(..) Submit job and return a future which can be completed to get the result with awaitJobResult
attachToRunningJob(..) Re-attach to a runnning job, reconstruct its class loader, and return a future which can be completed with awaitJobResult
awaitJobResult(..) Blocks until the returned future from either submitJob or attachToRunningJob has been completed

mxm · 2016-07-30T13:16:44Z

flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobInfo.scala

-  val listeningBehaviour: ListeningBehaviour,
+
+  var client: ActorRef,
+  var listeningBehaviour: ListeningBehaviour,


We might want to allow multiple clients here. Otherwise only the least recent registered client will receive updates.

mxm · 2016-08-12T07:50:45Z

CC @rmetzger @tillrohrmann Could you please take a look? I would like to merge this. Tests are passing: https://travis-ci.org/mxm/flink/builds/151653198

rmetzger · 2016-08-15T12:48:50Z

flink-runtime/src/main/java/org/apache/flink/runtime/client/JobClient.java

+		Future<Object> submissionFuture = Patterns.ask(
+				jobClientActor,
+				new JobClientMessages.SubmitJobAndWait(jobGraph),
+				new Timeout(AkkaUtils.INF_TIMEOUT()));


Is there a reason not to use the default "akka.ask.timeout" here?

That's not possible because the JobClientActor will complete this future with the result of the job execution which may be be infinitely delayed. In all other cases (i.e. timeout to register at jobmanager, failure to attach to job, failure to submit job), the JobClientActor will complete the future with a failure message.

rmetzger · 2016-08-15T13:15:43Z

I did a quick pass over the code. I think this change needs another review by our Actor expert @tillrohrmann ;)

tillrohrmann · 2016-08-18T12:54:47Z

flink-runtime/src/main/java/org/apache/flink/runtime/client/JobClient.java

+		// retrieve classloader first before doing anything
+		ClassLoader classloader;
+		try {
+			classloader = retrieveClassLoader(jobID, jobManagerGateWay, configuration, timeout);


What if the JobManager has already changed at this point? We would no longer be able to retrieve the ClassLoader, wouldn't we?

True, this code assumes that the JobManager doesn't change between retrieving the leading jobmanager and retrieving the class loader. There is always some possible gap where the jobmanager could change. We could mitigate this by retrying in case is has changed.

Yes I think that would be good. The user code class loader should always be retrievable if the job is still running.

Actually, I'm not really sure about this corner case. We don't typically retry client side operations in case the leader has changed after retrieving it. Instead, we just throw an error (see all the methods in ClusterClient). The JobClientActor is exceptional in this regard and it has to be because it operates independently of the user function.

So we could fail if we can't reconstruct the class loader. That of course has the caveat that even if the user doesn't use custom classes for the JobExecutionResult or Exceptions, the job retrieval may fail (e.g. firewall blocking the blobManager port). That's why I didn't want to enforce this step but we could enforce it and fix eventual problems with the BlobManager communication if there are any.

Yeah that's the question: Shall we fail or try to perform on a best effort basis. If you have user code classes in your result, then the deserialization will fail later on, right? In this case, it would be better imo that the user tries the operation again because the failure might have been caused by a leader change. On the other hand you might only be interested in the cancel, stop job commands and are not interested in the deserialized result.

Would it be possible that we first connect to the JobManager and only if we want to wait for the job result we try to reconstruct the classloader? If that fails, then we throw an exception.

In addition to the result, you'll also need the class loader for getting accumulators of a running job.

I agree that it would be nice to fail when the class loader can't be reconstructed, but only if it is really the only option. So we could start off with the class loader set to None in the JobListeningContext. When the class loader is needed, i.e. accumulator retrieval or job execution result retrieval, it is fetched.

Yes that could be a good solution :-)

tillrohrmann · 2016-08-18T13:55:30Z

Good work @mxm. I made some minor comments inline.

Just for my own clarification: Is it still planned to have a new kind of JobClient which is bound to a specific job and which can be used to issue job specific calls such as cancel, stop, execution result retrieval, etc. I thought that the ClusterClient is used to communicate with the cluster, whereas the JobClient is responsible for the job communication. Will this be a follow-up?

The test case JobClientActorTest. testConnectionTimeoutAfterJobRegistration is failing on Travis.

After addressing the comments +1 for merging.

mxm · 2016-08-19T08:15:50Z

Thanks for the review @tillrohrmann. Yes, the plan is to have a JobClient API class (the existing JobClient class will be renamed) which uses the SubmissionContext to supervise submitted jobs or attach to existing jobs. All the job-related methods from ClusterClient will be moved to this new class.

mxm · 2016-08-19T10:26:38Z

@tillrohrmann I've refactored the JobClientActor to include the common code in JobClientActorBase and have implementations for submitting/attaching in JobSubmissionClientActor and JobAttachmentClientActor.

mxm · 2016-08-19T12:23:21Z

@tillrohrmann Pinging the actor now to check if it is still alive. Also added another test case for that.

mxm · 2016-08-19T15:59:04Z

I've made the last changes concerning the lazy reconstruction of the class loader we discussed. Rebased to master. Should be good to go now.

tillrohrmann · 2016-08-22T09:22:17Z

flink-runtime/src/main/java/org/apache/flink/runtime/client/JobClient.java

+				Await.result(
+					Patterns.ask(
+						jobClientActor,
+						JobClientMessages.getPing(),


I think we can also use Akka's build-in message Identify to do the same. Then we don't have to introduce a new message type.

mxm · 2016-08-22T13:37:01Z

Updated according to our comment discussion.

mxm · 2016-08-23T14:19:39Z

Merging this if there are no further comments.

tillrohrmann · 2016-08-23T16:18:36Z

flink-runtime/src/main/java/org/apache/flink/runtime/client/JobClient.java

-						Timeout.durationToTimeout(AkkaUtils.getDefaultTimeout())),
-					AkkaUtils.getDefaultTimeout());
+				Await.ready(jobSubmissionFuture, askTimeout);
 			} catch (Exception e) {


I think we can narrow down this exception here. Should be good to catch TimeoutException.

We throw the exception anyways afterwards. The only difference is that we wrap the exception and throw only if the future has not been completed in the meantime.

We would have to catch InterruptedException, TimeoutException, and IllegalArgumentException. I'm not convinced this is necessary.

But at least an IllegalArgumentException should not trigger the pinging of the job client actor. This should be handled differently.

Ah, I thought you were commenting on the inner catch block. Yes, it makes sense to catch only TimeoutException and InterruptedException here for Await.ready. Await.result actually throws Exception.

These changes are required for FLINK-4272 (introduce a JobClient class for job control). Essentially, we want to be able to re-attach to a running job and monitor it. It shouldn't make any difference whether we just submitted the job or we recover it from an existing JobID. This PR modifies the JobClientActor to support two different operation modes: a) submitJob and monitor b) re-attach to job and monitor The JobClient class has been updated with methods to access this functionality. Before the class just had `submitJobAndWait` and `submitJobDetached`. Now, it has the additional methods `submitJob`, `attachToRunningJob`, and `awaitJobResult`. The job submission has been split up in two phases: 1a. submitJob(..) Submit job and return a future which can be completed to get the result with `awaitJobResult` 1b. attachToRunningJob(..) Re-attach to a runnning job, reconstruct its class loader, and return a future which can be completed with `awaitJobResult` 2. awaitJobResult(..) Blocks until the returned future from either `submitJob` or `attachToRunningJob` has been completed - split up JobClientActor into a base class and two implementations - JobClient: on waiting check JobClientActor liveness - lazily reconstruct user class loader - add additional tests for JobClientActor - add test case to test resuming of jobs This closes apache#2313

mxm · 2016-08-25T10:10:53Z

Rebased to the changes on master. Merging after tests pass again.

mxm · 2016-08-25T14:00:09Z

Thanks for helpful review @tillrohrmann and @rmetzger.

mxm force-pushed the FLINK-4273 branch from 7832810 to e26909f Compare July 30, 2016 13:15

mxm reviewed Jul 30, 2016
View reviewed changes

mxm force-pushed the FLINK-4273 branch 3 times, most recently from 7636aea to 3164cc5 Compare August 11, 2016 14:47

mxm force-pushed the FLINK-4273 branch from d4b43ad to 75612ed Compare August 12, 2016 14:30

rmetzger reviewed Aug 15, 2016
View reviewed changes

mxm force-pushed the FLINK-4273 branch from 75612ed to c7f6584 Compare August 17, 2016 15:12

tillrohrmann reviewed Aug 18, 2016
View reviewed changes

mxm force-pushed the FLINK-4273 branch from 0b54a9e to f7e6569 Compare August 19, 2016 15:58

mxm force-pushed the FLINK-4273 branch 2 times, most recently from 0b92621 to b7c6787 Compare August 19, 2016 16:04

tillrohrmann reviewed Aug 22, 2016
View reviewed changes

tillrohrmann reviewed Aug 23, 2016
View reviewed changes

mxm force-pushed the FLINK-4273 branch from 011bbc8 to 9b07445 Compare August 25, 2016 10:10

asfgit closed this in 259a3a5 Aug 25, 2016

rmetzger added the component=CommandLineClient label Mar 14, 2019

Uh oh!

[FLINK-4273] Modify JobClient to attach to running jobs #2313

[FLINK-4273] Modify JobClient to attach to running jobs #2313

Uh oh!

Conversation

mxm commented Jul 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxm commented Aug 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmetzger commented Aug 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann commented Aug 18, 2016

Uh oh!

mxm commented Aug 19, 2016

Uh oh!

mxm commented Aug 19, 2016

Uh oh!

mxm commented Aug 19, 2016

Uh oh!

mxm commented Aug 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxm commented Aug 22, 2016

Uh oh!

mxm commented Aug 23, 2016

Uh oh!

tillrohrmann Aug 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxm commented Aug 25, 2016

Uh oh!

mxm commented Aug 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mxm commented Jul 29, 2016 •

edited

Loading

tillrohrmann Aug 23, 2016 •

edited

Loading