Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-15086][client] Explictly close JobCluster in attach mode #10451

Closed
wants to merge 1 commit into from

Conversation

tisonkun
Copy link
Member

@tisonkun tisonkun commented Dec 6, 2019

What is the purpose of the change

If a JobCluster started in attached execution mode, it will wait for a client requesting its result and shutdown itself. However, there is no permission that a user got JobClient from executeAsync must call getJobExecutionResult so the cluster may leak.

Since currently the choice detach or attach can be done by user itself, my opinion is that we doesn't close JobCluster in MiniDispatcher in attach mode, but let the client send a shutdown request on close.

Verifying this change

Maybe an end-to-end test for resource leak? I don't have an idea to verify per-job resource leak in UT/IT framework.

Does this pull request potentially affect one of the following parts:

N/A

Documentation

N/A

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 6, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 219d652 (Fri Dec 06 02:22:18 UTC 2019)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 6, 2019

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build

@tisonkun
Copy link
Member Author

tisonkun commented Dec 6, 2019

Verified locally

without patch with patch
execute + detach cluster correctly closed cluster correctly closed
execute + attach cluster correctly closed cluster correctly closed
executeAsync + detach cluster correctly closed cluster correctly closed
executeAsync + attach cluster doesn't closed cluster correctly closed

actually here we have some statements.

  1. executeAsync + detach:
    • cluster will close itself on job finish, so the job client might cannot query the cluster before JobClient closed. but since it somehow means attach execution because you still communicate the job with job client but config detach, it is an undefined behavior. In this case, the recommended practice would be: fire env.executeAsync and forget it.
  2. executeAsync + attach:
    • resource leak if user doesn't call JobClient#close in main method. OK we document it and highlight it. Now that the ownership of the cluster move to user program, he should take care of it.
    • cluster shutdown before job finished. It happens if user doesn't wait for job result but close job client before job finished. The job client is the handle you talk with the job; if you release it, the job is an orphan so close it would be OK.

TBH for streaming workload which doesn't have a result anyway, it doesn't matter we just fire & forget. And close it when needed.

@aljoscha
Copy link
Contributor

aljoscha commented Dec 6, 2019

I think for executeAsync() we should always execute the cluster in detached mode. This has the unfortunate consequence that the cluster might go down before the user can query anything, but I think this is ok for now. In the future, we can enhance the JobClient/ClusterClient to talk directly to YARN (or the underlying execution system), this way, we can still get the job status (and some results) after a per-job YARN cluster has shut down.

@tisonkun
Copy link
Member Author

tisonkun commented Dec 6, 2019

OK make sense. I will add this constraint in document.

Here is the status from my side: I'm afraid I cannot provide document today but IIRC document can be checked in concurrently with or after releasing since it is not a feature.

@tisonkun tisonkun closed this Dec 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants