Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-7201] fix concurrency in JobLeaderIdService when shutdown the … #4347

Closed
wants to merge 1 commit into from

Conversation

storm-dance
Copy link

No description provided.

@StephanEwen
Copy link
Contributor

@XuPingyong Can you give us a bit of context for the review?
From the initial exception I would expect that there is something that also needs to be addressed in the JobLeaderIdService class...

@storm-dance
Copy link
Author

@StephanEwen , rpcService of ResourceManager executes with only one single thread, so there is no conflicts when resourcemanager is in service. When resourceManager is shutdown by the other thread, the rpcService had better stop first.

@tillrohrmann
Copy link
Contributor

tillrohrmann commented Jul 26, 2017

The underlying problem is that components such as the JobLeaderIdService, the SlotManager or the HearbeatManager weren't designed to be accessed concurrently. The assumption was that there is only a single modifying thread. This actually also applies to the ResourceManager itself.

When another thread calls ResourceManager#shutdown, then it will shutdown its components, then stop itself (the stopping happens concurrently) and then modifies its internal state (from the calling thread which is not necessarily the main thread). Since we don't wait until the actor has been stopped, the state clean up can lead to a concurrent modification exception.

In order to solve the problem I would propose the following:

  1. Make RpcEndpoint#shutDown non blocking and change the semantic that it initiates the shut down of the RpcEndpoint instead of shutting down all internal services.
  2. Use the Actor#postStop to call an internalShutDown method of the RpcEndpoint where we close the services.
  3. Wherever RpcEndpoint#shutDown needs to be blocking, obtain the termination future of the RpcEndpoint and then wait on it.

That way it should also be possible to call shut down from within the RpcEndpoint's main thread and still guarantee a proper service shut down.

@tillrohrmann
Copy link
Contributor

With the changes of #4420, this problem should be resolved. Could you please close this PR then @XuPingyong.

@storm-dance
Copy link
Author

Thanks @tillrohrmann !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants