Skip to content

Conversation

@mxm
Copy link
Contributor

@mxm mxm commented Jul 29, 2016

These changes are required for FLINK-4272 (introduce a JobClient class for job control). Essentially, we want to be able to re-attach to a running job and monitor it. It shouldn't make any difference whether we just submitted the job or we recover it from an existing JobID.

This PR modifies the JobClientActor to support two different operation modes: a) submitJob and monitor b) re-attach to job and monitor

The JobClient class has been updated with methods to access this functionality. Before the class just had submitJobAndWait and submitJobDetachd. Now, it has the additional methods submitJob, attachToRunningJob, and awaitJobResult.

  • submitJob(..) Submit job and return a future which can be completed to get the result with awaitJobResult
  • attachToRunningJob(..) Re-attach to a runnning job, reconstruct its class loader, and return a future which can be completed with awaitJobResult
  • awaitJobResult(..) Blocks until the returned future from either submitJob or attachToRunningJob has been completed

val listeningBehaviour: ListeningBehaviour,

var client: ActorRef,
var listeningBehaviour: ListeningBehaviour,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to allow multiple clients here. Otherwise only the least recent registered client will receive updates.

@mxm mxm force-pushed the FLINK-4273 branch 3 times, most recently from 7636aea to 3164cc5 Compare August 11, 2016 14:47
@mxm
Copy link
Contributor Author

mxm commented Aug 12, 2016

CC @rmetzger @tillrohrmann Could you please take a look? I would like to merge this. Tests are passing: https://travis-ci.org/mxm/flink/builds/151653198

Future<Object> submissionFuture = Patterns.ask(
jobClientActor,
new JobClientMessages.SubmitJobAndWait(jobGraph),
new Timeout(AkkaUtils.INF_TIMEOUT()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use the default "akka.ask.timeout" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not possible because the JobClientActor will complete this future with the result of the job execution which may be be infinitely delayed. In all other cases (i.e. timeout to register at jobmanager, failure to attach to job, failure to submit job), the JobClientActor will complete the future with a failure message.

@rmetzger
Copy link
Contributor

I did a quick pass over the code. I think this change needs another review by our Actor expert @tillrohrmann ;)

// retrieve classloader first before doing anything
ClassLoader classloader;
try {
classloader = retrieveClassLoader(jobID, jobManagerGateWay, configuration, timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the JobManager has already changed at this point? We would no longer be able to retrieve the ClassLoader, wouldn't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this code assumes that the JobManager doesn't change between retrieving the leading jobmanager and retrieving the class loader. There is always some possible gap where the jobmanager could change. We could mitigate this by retrying in case is has changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that would be good. The user code class loader should always be retrievable if the job is still running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not really sure about this corner case. We don't typically retry client side operations in case the leader has changed after retrieving it. Instead, we just throw an error (see all the methods in ClusterClient). The JobClientActor is exceptional in this regard and it has to be because it operates independently of the user function.

So we could fail if we can't reconstruct the class loader. That of course has the caveat that even if the user doesn't use custom classes for the JobExecutionResult or Exceptions, the job retrieval may fail (e.g. firewall blocking the blobManager port). That's why I didn't want to enforce this step but we could enforce it and fix eventual problems with the BlobManager communication if there are any.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the question: Shall we fail or try to perform on a best effort basis. If you have user code classes in your result, then the deserialization will fail later on, right? In this case, it would be better imo that the user tries the operation again because the failure might have been caused by a leader change. On the other hand you might only be interested in the cancel, stop job commands and are not interested in the deserialized result.

Would it be possible that we first connect to the JobManager and only if we want to wait for the job result we try to reconstruct the classloader? If that fails, then we throw an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the result, you'll also need the class loader for getting accumulators of a running job.

I agree that it would be nice to fail when the class loader can't be reconstructed, but only if it is really the only option. So we could start off with the class loader set to None in the JobListeningContext. When the class loader is needed, i.e. accumulator retrieval or job execution result retrieval, it is fetched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that could be a good solution :-)

@tillrohrmann
Copy link
Contributor

Good work @mxm. I made some minor comments inline.

Just for my own clarification: Is it still planned to have a new kind of JobClient which is bound to a specific job and which can be used to issue job specific calls such as cancel, stop, execution result retrieval, etc. I thought that the ClusterClient is used to communicate with the cluster, whereas the JobClient is responsible for the job communication. Will this be a follow-up?

The test case JobClientActorTest. testConnectionTimeoutAfterJobRegistration is failing on Travis.

After addressing the comments +1 for merging.

@mxm
Copy link
Contributor Author

mxm commented Aug 19, 2016

Thanks for the review @tillrohrmann. Yes, the plan is to have a JobClient API class (the existing JobClient class will be renamed) which uses the SubmissionContext to supervise submitted jobs or attach to existing jobs. All the job-related methods from ClusterClient will be moved to this new class.

@mxm
Copy link
Contributor Author

mxm commented Aug 19, 2016

@tillrohrmann I've refactored the JobClientActor to include the common code in JobClientActorBase and have implementations for submitting/attaching in JobSubmissionClientActor and JobAttachmentClientActor.

@mxm
Copy link
Contributor Author

mxm commented Aug 19, 2016

@tillrohrmann Pinging the actor now to check if it is still alive. Also added another test case for that.

@mxm
Copy link
Contributor Author

mxm commented Aug 19, 2016

I've made the last changes concerning the lazy reconstruction of the class loader we discussed. Rebased to master. Should be good to go now.

@mxm mxm force-pushed the FLINK-4273 branch 2 times, most recently from 0b92621 to b7c6787 Compare August 19, 2016 16:04
Await.result(
Patterns.ask(
jobClientActor,
JobClientMessages.getPing(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also use Akka's build-in message Identify to do the same. Then we don't have to introduce a new message type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@mxm
Copy link
Contributor Author

mxm commented Aug 22, 2016

Updated according to our comment discussion.

@mxm
Copy link
Contributor Author

mxm commented Aug 23, 2016

Merging this if there are no further comments.

Timeout.durationToTimeout(AkkaUtils.getDefaultTimeout())),
AkkaUtils.getDefaultTimeout());
Await.ready(jobSubmissionFuture, askTimeout);
} catch (Exception e) {
Copy link
Contributor

@tillrohrmann tillrohrmann Aug 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can narrow down this exception here. Should be good to catch TimeoutException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We throw the exception anyways afterwards. The only difference is that we wrap the exception and throw only if the future has not been completed in the meantime.

We would have to catch InterruptedException, TimeoutException, and IllegalArgumentException. I'm not convinced this is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But at least an IllegalArgumentException should not trigger the pinging of the job client actor. This should be handled differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought you were commenting on the inner catch block. Yes, it makes sense to catch only TimeoutException and InterruptedException here for Await.ready. Await.result actually throws Exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

These changes are required for FLINK-4272 (introduce a JobClient class
for job control). Essentially, we want to be able to re-attach to a
running job and monitor it. It shouldn't make any difference whether we
just submitted the job or we recover it from an existing JobID.

This PR modifies the JobClientActor to support two different operation
modes: a) submitJob and monitor b) re-attach to job and monitor

The JobClient class has been updated with methods to access this
functionality. Before the class just had `submitJobAndWait` and
`submitJobDetached`. Now, it has the additional methods `submitJob`,
`attachToRunningJob`, and `awaitJobResult`.

The job submission has been split up in two phases:

1a. submitJob(..)
Submit job and return a future which can be completed to
get the result with `awaitJobResult`

1b. attachToRunningJob(..)
Re-attach to a runnning job, reconstruct its class loader, and return a
future which can be completed with `awaitJobResult`

2. awaitJobResult(..)
Blocks until the returned future from either `submitJob` or
`attachToRunningJob` has been completed

- split up JobClientActor into a base class and two implementations
- JobClient: on waiting check JobClientActor liveness
- lazily reconstruct user class loader
- add additional tests for JobClientActor
- add test case to test resuming of jobs

This closes apache#2313
@mxm
Copy link
Contributor Author

mxm commented Aug 25, 2016

Rebased to the changes on master. Merging after tests pass again.

@asfgit asfgit closed this in 259a3a5 Aug 25, 2016
@mxm
Copy link
Contributor Author

mxm commented Aug 25, 2016

Thanks for helpful review @tillrohrmann and @rmetzger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants