[FLINK-15086][client] Explictly close JobCluster in attach mode #10451

tisonkun · 2019-12-06T02:20:01Z

What is the purpose of the change

If a JobCluster started in attached execution mode, it will wait for a client requesting its result and shutdown itself. However, there is no permission that a user got JobClient from executeAsync must call getJobExecutionResult so the cluster may leak.

Since currently the choice detach or attach can be done by user itself, my opinion is that we doesn't close JobCluster in MiniDispatcher in attach mode, but let the client send a shutdown request on close.

Verifying this change

Maybe an end-to-end test for resource leak? I don't have an idea to verify per-job resource leak in UT/IT framework.

Does this pull request potentially affect one of the following parts:

N/A

Documentation

N/A

flinkbot · 2019-12-06T02:22:19Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 219d652 (Fri Dec 06 02:22:18 UTC 2019)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2019-12-06T02:40:44Z

CI report:

219d652 : FAILURE Build

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build

tisonkun · 2019-12-06T05:49:23Z

Verified locally

	without patch	with patch
execute + detach	cluster correctly closed	cluster correctly closed
execute + attach	cluster correctly closed	cluster correctly closed
executeAsync + detach	cluster correctly closed	cluster correctly closed
executeAsync + attach	cluster doesn't closed	cluster correctly closed

actually here we have some statements.

executeAsync + detach:
- cluster will close itself on job finish, so the job client might cannot query the cluster before JobClient closed. but since it somehow means attach execution because you still communicate the job with job client but config detach, it is an undefined behavior. In this case, the recommended practice would be: fire env.executeAsync and forget it.
executeAsync + attach:
- resource leak if user doesn't call JobClient#close in main method. OK we document it and highlight it. Now that the ownership of the cluster move to user program, he should take care of it.
- cluster shutdown before job finished. It happens if user doesn't wait for job result but close job client before job finished. The job client is the handle you talk with the job; if you release it, the job is an orphan so close it would be OK.

TBH for streaming workload which doesn't have a result anyway, it doesn't matter we just fire & forget. And close it when needed.

aljoscha · 2019-12-06T09:18:58Z

I think for executeAsync() we should always execute the cluster in detached mode. This has the unfortunate consequence that the cluster might go down before the user can query anything, but I think this is ok for now. In the future, we can enhance the JobClient/ClusterClient to talk directly to YARN (or the underlying execution system), this way, we can still get the job status (and some results) after a per-job YARN cluster has shut down.

tisonkun · 2019-12-06T12:12:45Z

OK make sense. I will add this constraint in document.

Here is the status from my side: I'm afraid I cannot provide document today but IIRC document can be checked in concurrently with or after releasing since it is not a feature.

[FLINK-15086][client] Explictly close JobCluster in attach mode

219d652

tisonkun requested review from aljoscha, kl0u and tillrohrmann December 6, 2019 02:20

rmetzger added the review=description? label Dec 6, 2019

rmetzger added component=Client/JobSubmission component=Runtime/Coordination labels Dec 6, 2019

tisonkun closed this Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-15086][client] Explictly close JobCluster in attach mode #10451

[FLINK-15086][client] Explictly close JobCluster in attach mode #10451

tisonkun commented Dec 6, 2019 •

edited

flinkbot commented Dec 6, 2019

flinkbot commented Dec 6, 2019 •

edited

tisonkun commented Dec 6, 2019 •

edited

aljoscha commented Dec 6, 2019

tisonkun commented Dec 6, 2019

[FLINK-15086][client] Explictly close JobCluster in attach mode #10451

[FLINK-15086][client] Explictly close JobCluster in attach mode #10451

Conversation

tisonkun commented Dec 6, 2019 • edited

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Dec 6, 2019

Automated Checks

Review Progress

flinkbot commented Dec 6, 2019 • edited

CI report:

tisonkun commented Dec 6, 2019 • edited

aljoscha commented Dec 6, 2019

tisonkun commented Dec 6, 2019

tisonkun commented Dec 6, 2019 •

edited

flinkbot commented Dec 6, 2019 •

edited

tisonkun commented Dec 6, 2019 •

edited