[FLINK-4272] Create a JobClient for job control and monitoring #2732

mxm · 2016-10-31T18:02:27Z

Also includes: [FLINK-4274] Expose new JobClient in the DataSet/DataStream API

rename JobClient class to JobClientActorUtils
introduce JobClient interface with two implementations
- JobClientEager: starts an actor system right away and monitors the job
  - Move ClusterClient#cancel, ClusterClient#stop,
    ClusterClient#getAccumulators to JobClient
- JobClientLazy: starts an actor system when requests are made by
  encapsulating the eager job client
Java and Scala API
- JobClient integration
- introduce ExecutionEnvironment#executeWithControl()
- introduce StreamExecutionEnvironment#executeWithControl()
report errors during job execution as JobExecutionException instead of
ProgramInvocationException and adapt test cases
provide finalizers to run code upon shutdown of client
use ActorGateway in JobListeningContext
add test case for JobClient implementations

mxm · 2016-11-01T11:06:36Z

Rebased to the latest changes on the master.

mxm · 2016-11-01T11:11:28Z

CC @rmetzger @aljoscha Could you take a look at the changes?

Also includes: [FLINK-4274] Expose new JobClient in the DataSet/DataStream API - rename JobClient class to JobClientActorUtils - introduce JobClient interface with two implementations - JobClientEager: starts an actor system right away and monitors the job - Move ClusterClient#cancel, ClusterClient#stop, ClusterClient#getAccumulators to JobClient - JobClientLazy: starts an actor system when requests are made by encapsulating the eager job client - Java and Scala API - JobClient integration - introduce ExecutionEnvironment#executeWithControl() - introduce StreamExecutionEnvironment#executeWithControl() - report errors during job execution as JobExecutionException instead of ProgramInvocationException and adapt test cases - provide finalizers to run code upon shutdown of client - use ActorGateway in JobListeningContext - add test case for JobClient implementations

tillrohrmann

Thanks for you contribution @mxm. It's good to refactor how the client interacts with a running job.

I think that we should revisit the public interface of JobClient in order to decide which functionality we actually want to expose. I made some comments inline.

I think it would also be good to decouple the JobClient and then also the ClusterClient from the underlying RPC implementation. At the moment it is tightly coupled with Akka and the ActorGatways. Having Flip-6 in mind, it would be good to have an abstraction which hides these details. Otherwise, the newly introduced code has to be rewritten for Flip-6 again.

tillrohrmann · 2016-11-08T18:07:10Z

flink-clients/src/main/java/org/apache/flink/client/LocalExecutor.java

-					stop();
+			boolean sysoutPrint = isPrintingStatusDuringExecution();
+
+


two line linebreak

tillrohrmann · 2016-11-08T18:10:18Z

flink-clients/src/main/java/org/apache/flink/client/LocalExecutor.java

+						try {
+							stop();
+						} catch (Exception e) {
+							throw new RuntimeException("Failed to run cleanup", e);


This will crash the JobClientEager when calling JobClientEager.shutdown. Is this intended?

Thanks, catching exceptions per finalizer would make sense.

tillrohrmann · 2016-11-08T18:10:51Z

flink-clients/src/main/java/org/apache/flink/client/LocalExecutor.java

+			Runnable cleanup = new Runnable() {
+				@Override
+				public void run() {
+					if (shutDownAtEnd) {


Can't we move this if condition out of the runnable and only add the clean up runnable if shutDownAtEnd == true?

We could but it wouldn't make any semantic difference since the enclosed variable must be final. Ok granted, it would spare as one object allocation but that's a minor benefit.

tillrohrmann · 2016-11-08T18:15:38Z

flink-clients/src/main/java/org/apache/flink/client/RemoteExecutor.java

+							try {
+								stop();
+							} catch (Exception e) {
+								throw new RuntimeException("Failed to clean up.", e);


Same here with the exception. I think it is not a good practice to masquerade checked exceptions as unchecked exceptions, because it makes it violates the contract defined by the Runnable interface.

Fine, then we need something like Runnable with a checked exception signature.

tillrohrmann · 2016-11-08T18:15:58Z

flink-clients/src/main/java/org/apache/flink/client/RemoteExecutor.java

+				new Runnable() {
+					@Override
+					public void run() {
+						if (shutDownAtEnd) {


Maybe moving this out of the runnable.

This closure should be fine since Java demands the variable to be final.

The closure is fine, but by moving this out of the runnable, we can save to register Runnables if we don't need them.

tillrohrmann · 2016-11-09T11:32:25Z

flink-core/src/main/java/org/apache/flink/api/common/JobClient.java

+
+import java.util.Map;
+
+/*


No JavaDoc comment

Missing *.

tillrohrmann · 2016-11-09T11:33:45Z

flink-core/src/main/java/org/apache/flink/api/common/JobClient.java

+/*
+ * An Flink job client interface to interact with running Flink jobs.
+ */
+public interface JobClient {


PublicEvolving?

Would it make sense to be able to retrieve the ClusterClient from the JobClient?

+1

I could pass the ClusterClient to the JobClient. I thought I would avoid that because it would expose the ClusterClient also from the regular Java API which is generally agnostic of job submission and cluster management.

tillrohrmann · 2016-11-09T11:36:57Z

flink-core/src/main/java/org/apache/flink/api/common/JobClient.java

+	 * when the client is shut down. Runnables are called
+	 * in the order they are added.
+	 */
+	void addFinalizer(Runnable finalizer) throws Exception;


Is this a method we want to expose to the user? Seems to me like something with which he shouldn't fiddle around.

Yes, that's an issue with sharing the interface across modules. Let me try to get rid of it for the interface.

tillrohrmann · 2016-11-09T11:38:11Z

flink-clients/src/main/java/org/apache/flink/client/program/JobClientEager.java

+	 */
+	@Override
+	public JobExecutionResult waitForResult() throws JobExecutionException {
+		LOG.info("Waiting for results of Job {}", jobListeningContext.getJobID());


Typo: "results of job {}"

tillrohrmann · 2016-11-09T13:04:05Z

flink-clients/src/main/java/org/apache/flink/client/program/JobClientLazy.java

+/**
+ * A detached job client which lazily initiates the cluster connection.
+ */
+public class JobClientLazy implements JobClient {


The distinction between JobClientEager and JobClientLazy feels a little bit clumsy. Can't we get rid of one them and simply have a JobClientImpl? The only place where JobClientLazy is returned is when calling submitJobDetached. I think in this case, you don't expect to get a JobClient back because it is submitted in detached mode.

I found it rather clever to have a lazy implementation of the client which can be retrieved when needed. For the sake of keeping things simple, I would opt to remove it in favor of one implementation.

aljoscha · 2016-11-09T14:26:36Z

flink-core/src/main/java/org/apache/flink/api/common/JobClient.java

+	 * Runs finalization code to shutdown the client
+	 * and its dependencies.
+	 */
+	void shutdown();


Shutdown seems to be an internal implementation detail for the new finalizers. It should therefore not be in the public API, it event seems problematic to allow users to call it because it would prematurely call finalizers.

Shutdown should not be internal. The idea was that it is used by the user to shutdown the client and any code associated with it (e.g. mini cluster).

But now there's cancel(), stop() and shutdown() which might be quite confusing for users. And nothing is done in shutdown() except calling the finalizers right now, correct?

Correct, let's see if we can solely dedicate the execution of shutdown to finalizers and shutdown hooks then.

I'm also not sure whether a JobClient should be allowed to shutdown a cluster. Imagine that you have multiple jobs running on the same cluster. Then you don't want to have this kind of behaviour.

mxm · 2016-11-22T15:05:19Z

Thank you for your comments @tillrohrmann and @aljoscha. I'll make changes and get back to you.

mxm · 2017-11-02T16:08:50Z

This probably needs an overhaul by now. Have there been any efforts undergone to introduce a job client?

aljoscha · 2017-11-02T16:56:25Z

No work yet, but we still need it. 😅

mxm · 2017-11-04T12:09:18Z

This is still based on old runtime parts (JobManager), though the interface allows it to be ported to the new runtime (JobMaster). As the new one is about to supersede the old one, it might be sensible to port this to the new one first.

aljoscha · 2019-10-15T11:35:15Z

I'm closing this as "Abandoned", since there is no more activity and the code base has moved on quite a bit. Please re-open this if you feel otherwise and work should continue.

mxm · 2019-10-15T12:01:22Z

I agree that this is obsolete now. Thanks!

mxm force-pushed the FLINK-4272 branch from a3f5cc0 to 8735ff8 Compare November 1, 2016 11:06

mxm force-pushed the FLINK-4272 branch from 8735ff8 to e807245 Compare November 1, 2016 11:21

tillrohrmann requested changes Nov 9, 2016

View reviewed changes

aljoscha reviewed Nov 9, 2016

View reviewed changes

rmetzger added component=API/DataStream component=CommandLineClient labels Mar 14, 2019

aljoscha closed this Oct 15, 2019

aljoscha self-assigned this Oct 20, 2019

flinkbot removed the component=CommandLineClient label Mar 17, 2022

		stop();
		boolean sysoutPrint = isPrintingStatusDuringExecution();


		import java.util.Map;

		/*

[FLINK-4272] Create a JobClient for job control and monitoring #2732

[FLINK-4272] Create a JobClient for job control and monitoring #2732

Conversation

mxm commented Oct 31, 2016

mxm commented Nov 1, 2016

mxm commented Nov 1, 2016

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxm Nov 22, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxm Nov 22, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxm commented Nov 22, 2016

mxm commented Nov 2, 2017

aljoscha commented Nov 2, 2017

mxm commented Nov 4, 2017

aljoscha commented Oct 15, 2019

mxm commented Oct 15, 2019

mxm Nov 22, 2016 •

edited

mxm Nov 22, 2016 •

edited