[FLINK-14762][client] Enrich JobClient API #10311

tisonkun · 2019-11-25T09:35:42Z

What is the purpose of the change

This pull request is a rebase on master of #10185 .

We generally enrich JobClient API as described in FLIP-74 as well as let ClusterClient#submitJob returns a CompletableFuture of JobClient.

For now I think we may or may not introduce dedicated tests because the only implementation is a thin wrapper of ClusterClient. Maybe we can defer the test set until other implementation comes because testing a wrapper gains us little.

Another tricky thing is about lifecycle management a.k.a. whether or not close ClusterClient on ClusterClientJobClientAdapter closed. Currently I use a boolean parameter moveOwnship for explicitly setting, but still looking for other solution.

Verifying this change

This change is already covered by existing tests.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (yes, it enriches JobClient which is to be public API).
If yes, how is the feature documented? (JavaDocs)

cc @aljoscha @kl0u

flinkbot · 2019-11-25T09:39:00Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit e661746 (Wed Dec 04 15:10:14 UTC 2019)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2019-11-25T10:02:24Z

CI report:

aadc1cf : FAILURE Build
daf85a7 : FAILURE Build
7fe9b0e : FAILURE Build
fe4dfd5 : SUCCESS Build
7d98e67 : FAILURE Build
08b96bb : SUCCESS Build
880bec8 : UNKNOWN
16fd227 : SUCCESS Build
2fb0f5b : UNKNOWN
bd5087f : SUCCESS Build
ca1cab6 : CANCELED Build
6d8f1af : FAILURE Build
7221e66 : SUCCESS Build
e661746 : FAILURE Build

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build

kl0u

Hi @tisonkun , thanks a lot for the work. As an overarching comment, I noticed that you add a flag moveOwnership to the ClusterClientJobClientAdapter and this also bubbles up till the ClusterClient.submitJob().

I am wondering if this is needed, and from a conceptual point of view, I lean towards the no. As a first point, I noticed that the only places that this is set to false are the ClientUtils.submitJob... which are mainly used in tests. And second, why not closing the ClusterClient always, when the JobClient closes (which is the responsibility of the call-site), and just let the user decide when to close the job client.

If the user wants to do something with the JobClient, he/she can create a new one (although we still need to figure out how to "retrieve" a jobClient).

WDYT?

kl0u · 2019-11-25T13:03:48Z

Also we could simply return CompletableFuture<Void> from the JobClient.cancel() method, instead of moving classes around. On this, I do not have a strong opinion, but do you think that there is any particular reason why returning an Acknowledge instead of Void?

tisonkun · 2019-11-26T01:14:07Z

Hi @tisonkun , thanks a lot for the work. As an overarching comment, I noticed that you add a flag moveOwnership to the ClusterClientJobClientAdapter and this also bubbles up till the ClusterClient.submitJob().

I am wondering if this is needed, and from a conceptual point of view, I lean towards the no. As a first point, I noticed that the only places that this is set to false are the ClientUtils.submitJob... which are mainly used in tests. And second, why not closing the ClusterClient always, when the JobClient closes (which is the responsibility of the call-site), and just let the user decide when to close the job client.

If the user wants to do something with the JobClient, he/she can create a new one (although we still need to figure out how to "retrieve" a jobClient).

WDYT?

Yes I also tend not to do so. At that moment I was a bit delirious for thinking about whether or not close cluster client on job client closed :/

for shutting down things, I think it is still a configurable action whether or not we close cluster client on job client close because cluster client "spawns" job client and maybe we call submitJob multiple times within one cluster client(normally for job management platform). Neither we want to spawn cluster client per job nor we want to close the shared cluster client on job client closed. The point here is "who is responsible for closing cluster client? job client or the caller?"

I push a commit daf85a7 for customizing actions on closed for define such manner while considering code quality. Does it make sense to you?

tisonkun · 2019-11-26T01:19:20Z

Also we could simply return CompletableFuture<Void> from the JobClient.cancel() method, instead of moving classes around. On this, I do not have a strong opinion, but do you think that there is any particular reason why returning an Acknowledge instead of Void?

I've ever thought of Void. Void should work well atm. My concern is

If we keep in mind the possibility that a rpc based implementation of JobClient, for current implementation akka doesn't allow null message. Although we don't stick to use akka as rpc implementation, a non-null unit value is better than null representing Void.
Unfortunately Java doesn't have a builtin non-null unit value so that many of Java projects have to implement their own. I think Acknowledge its Flink's unit value which is reasonable to move into flink-core.

As for this point, I don't stick to using Acknowledge and could just using Void if you ask or better, share some advantages of Void.

kl0u

Hi @tisonkun ! Thanks for the work.

I left some comments and some additional remarks are:

I would say to not move the JobStatus class from the runtime to the core but rather create a copy of the enum and put in the core, with a method that maps the runtime JobStatus to a core JobStatus. The JobClient will return a core JobStatus.
I would not move the Acknowledge to the core but make the corresponding methods of the JobClient return a Void.

...-clients/src/main/java/org/apache/flink/client/deployment/ClusterClientJobClientAdapter.java

flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java

flink-clients/src/main/java/org/apache/flink/client/program/ClusterClient.java

flink-clients/src/main/java/org/apache/flink/client/ClientUtils.java

...clients/src/main/java/org/apache/flink/client/deployment/AbstractSessionClusterExecutor.java

flink-core/src/main/java/org/apache/flink/core/execution/JobClient.java

kl0u

Thanks for the work @tisonkun ! I left some minor comments. After integrating them and having a green light on Travis, feel free to merge.

flink-clients/src/main/java/org/apache/flink/client/deployment/AbstractJobClusterExecutor.java

...-clients/src/main/java/org/apache/flink/client/deployment/ClusterClientJobClientAdapter.java

flink-runtime/src/main/java/org/apache/flink/runtime/jobgraph/JobStatus.java

...-clients/src/main/java/org/apache/flink/client/deployment/ClusterClientJobClientAdapter.java

flink-core/src/main/java/org/apache/flink/api/common/JobStatus.java

...-clients/src/main/java/org/apache/flink/client/deployment/ClusterClientJobClientAdapter.java

aljoscha · 2019-11-28T09:59:45Z

...-clients/src/main/java/org/apache/flink/client/deployment/ClusterClientJobClientAdapter.java

+	}
+
+	@Override
+	public CompletableFuture<Map<String, OptionalFailure<Object>>> getAccumulators(ClassLoader classLoader) {


Why not use Optional here? It seems OptionalFailure is a bit strange because it does not behave like an Optional, it's more like an Either type.

Actually, I think it's better to just return a CompletableFuture<Map<String, Object>>. We can fail the future if there is any failure in the map but otherwise just return a complete map. I don't think it's useful for the user to have to do all the unpacking. What do you think?

We use Map<String, OptionalFailure<Object>> as type of accumulators in JobExecutionResult & ClusterClient already. The investigation how we treat of such unpack things is independent and out of the scope here in my opinion. Shall we open a ticket for take it into consideration and prevent this one to be too complex?

tisonkun · 2019-11-28T16:31:05Z

Address comment. @aljoscha please take a look about the update as well as my reply above.

…re of JobID

tisonkun · 2019-11-29T02:06:04Z

cherry-pick from slack channel. feel free to react wherever you like.

Sorry but when rebasing I cannot convince myself about why we introduce a flink-core variant of JobStatus? ClusterClient will return runtime JobStatus while JobClient returns JobStatus. It doesn’t make sense to me for introducing such different.
Runtime version JobStatus doesn’t depend on anything inside runtime but a self-contained enum. Shall we add it into o.a.f.api.common? Different from ClosureCleaner which could be used by connectors I think JobStatus is previously totally internal concept that should not breaks user setups and dependencies if we move it.

I’ve pushed a set of commits that we all agree on. The remain problem is about getJobStatus and getAccumulator

for getJobStatus the main concern is about where JobStatus stays and whether we introduce a variant of JobStatus. My opinion is above.
for getAccumulator the main concern is about whether Flink does unpack job for the user. I think we can do so, but maybe in another pass of pull request so that we firstly move forward this set under consensus.
So my idea is that we commit this set of commit as part 1 of FLINK-14762 and I start a new pull request refactor getAccumulator and then implement its JobClient interface. While let’s align about JobStatus .
Another coin about JobStatus is that we already display this sort of status on WebUI so it is reasonable to be core/common api(at least it is effectively user-facing).

tisonkun · 2019-11-29T02:55:55Z

tison 10:48 AM
I narrow the change set to only unwrap accumulator inside client codes. Here is the diff 6d8f1af
So the remain concern from my side is about core variant of JobStatus . I will be ok if you can describe how we deal with these two JobStatus in the future.

aljoscha · 2019-11-29T13:24:27Z

This looks good for JobStatus now!

aljoscha · 2019-11-29T13:26:45Z

I think this is good to merge now! 💐

tisonkun · 2019-11-29T13:51:15Z

travis fails unstably on a known issue https://issues.apache.org/jira/browse/FLINK-14894 which cannot be reproduced locally

merging now...

thanks for your review!

rmetzger added review=description? component=Client/JobSubmission labels Nov 25, 2019

aljoscha self-assigned this Nov 25, 2019

aljoscha requested review from aljoscha and kl0u November 25, 2019 11:21

kl0u reviewed Nov 25, 2019

View reviewed changes

kl0u self-assigned this Nov 25, 2019

tisonkun force-pushed the FLINK-14762 branch from aadc1cf to daf85a7 Compare November 26, 2019 01:12

tisonkun mentioned this pull request Nov 26, 2019

[FLINK-14762][client] Implement ClusterClientJobClientAdapter #10185

Closed

tisonkun force-pushed the FLINK-14762 branch from 12debf3 to fe4dfd5 Compare November 26, 2019 05:01

kl0u requested changes Nov 27, 2019

View reviewed changes

tisonkun force-pushed the FLINK-14762 branch from fe4dfd5 to 7d98e67 Compare November 27, 2019 09:25

tisonkun commented Nov 27, 2019

View reviewed changes

flink-core/src/main/java/org/apache/flink/core/execution/JobClient.java Outdated Show resolved Hide resolved

kl0u approved these changes Nov 28, 2019

View reviewed changes

aljoscha reviewed Nov 28, 2019

View reviewed changes

tisonkun added 6 commits November 29, 2019 09:37

[FLINK-14762][client] Handle clients close gracefully

8ad05b3

[FLINK-14762][client] ClusterClient#submitJob returns CompletableFutu…

944d08a

…re of JobID

[FLINK-14762][tests] Introduce TestingJobClient

034e778

[FLINK-14762][client] Implement JobClient#cancel

7bf7bc0

[FLINK-14762][client] Implement JobClient#stopWithSavepoint

7a655d5

[FLINK-14762][client] Implement JobClient#triggerSavepoint

ca1cab6

tisonkun force-pushed the FLINK-14762 branch from bd5087f to ca1cab6 Compare November 29, 2019 01:56

[FLINK-14762][client] Implement JobClient#getAccumulators

7221e66

tisonkun force-pushed the FLINK-14762 branch from 6d8f1af to 7221e66 Compare November 29, 2019 04:12

tisonkun added 2 commits November 29, 2019 18:50

[FLINK-14762][api] Move JobStatus to flink-core

65ef2d8

[FLINK-14762][client] Implement JobClient#getJobStatus

e661746

tisonkun closed this in 3898a4b Nov 29, 2019

tisonkun deleted the FLINK-14762 branch November 29, 2019 13:55

[FLINK-14762][client] Enrich JobClient API #10311

[FLINK-14762][client] Enrich JobClient API #10311

Uh oh!

Conversation

tisonkun commented Nov 25, 2019

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Nov 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Checks

Review Progress

Uh oh!

flinkbot commented Nov 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

kl0u left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kl0u commented Nov 25, 2019

Uh oh!

tisonkun commented Nov 26, 2019

Uh oh!

tisonkun commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kl0u left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kl0u left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aljoscha Nov 28, 2019

Choose a reason for hiding this comment

Uh oh!

aljoscha Nov 28, 2019

Choose a reason for hiding this comment

Uh oh!

tisonkun Nov 28, 2019

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Nov 28, 2019

Uh oh!

tisonkun commented Nov 29, 2019

Uh oh!

tisonkun commented Nov 29, 2019

Uh oh!

aljoscha commented Nov 29, 2019

Uh oh!

aljoscha commented Nov 29, 2019

Uh oh!

tisonkun commented Nov 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

flinkbot commented Nov 25, 2019 •

edited

Loading

flinkbot commented Nov 25, 2019 •

edited

Loading

kl0u left a comment •

edited

Loading

tisonkun commented Nov 26, 2019 •

edited

Loading