[FLINK-7201] fix concurrency in JobLeaderIdService when shutdown the … #4347

storm-dance · 2017-07-15T02:58:08Z

No description provided.

…ResourceManager

StephanEwen · 2017-07-17T18:36:26Z

@XuPingyong Can you give us a bit of context for the review?
From the initial exception I would expect that there is something that also needs to be addressed in the JobLeaderIdService class...

storm-dance · 2017-07-18T03:27:42Z

@StephanEwen , rpcService of ResourceManager executes with only one single thread, so there is no conflicts when resourcemanager is in service. When resourceManager is shutdown by the other thread, the rpcService had better stop first.

tillrohrmann · 2017-07-26T16:53:54Z

The underlying problem is that components such as the JobLeaderIdService, the SlotManager or the HearbeatManager weren't designed to be accessed concurrently. The assumption was that there is only a single modifying thread. This actually also applies to the ResourceManager itself.

When another thread calls ResourceManager#shutdown, then it will shutdown its components, then stop itself (the stopping happens concurrently) and then modifies its internal state (from the calling thread which is not necessarily the main thread). Since we don't wait until the actor has been stopped, the state clean up can lead to a concurrent modification exception.

In order to solve the problem I would propose the following:

Make RpcEndpoint#shutDown non blocking and change the semantic that it initiates the shut down of the RpcEndpoint instead of shutting down all internal services.
Use the Actor#postStop to call an internalShutDown method of the RpcEndpoint where we close the services.
Wherever RpcEndpoint#shutDown needs to be blocking, obtain the termination future of the RpcEndpoint and then wait on it.

That way it should also be possible to call shut down from within the RpcEndpoint's main thread and still guarantee a proper service shut down.

tillrohrmann · 2017-07-31T09:26:39Z

With the changes of #4420, this problem should be resolved. Could you please close this PR then @XuPingyong.

storm-dance · 2017-07-31T09:31:07Z

Thanks @tillrohrmann !

[FLINK-7201] fix concurrency in JobLeaderIdService when shutdown the …

2c04107

…ResourceManager

storm-dance closed this Jul 31, 2017

rmetzger added the component=Runtime/Coordination label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-7201] fix concurrency in JobLeaderIdService when shutdown the … #4347

[FLINK-7201] fix concurrency in JobLeaderIdService when shutdown the … #4347

storm-dance commented Jul 15, 2017

StephanEwen commented Jul 17, 2017

storm-dance commented Jul 18, 2017

tillrohrmann commented Jul 26, 2017 •

edited

tillrohrmann commented Jul 31, 2017

storm-dance commented Jul 31, 2017

[FLINK-7201] fix concurrency in JobLeaderIdService when shutdown the … #4347

[FLINK-7201] fix concurrency in JobLeaderIdService when shutdown the … #4347

Conversation

storm-dance commented Jul 15, 2017

StephanEwen commented Jul 17, 2017

storm-dance commented Jul 18, 2017

tillrohrmann commented Jul 26, 2017 • edited

tillrohrmann commented Jul 31, 2017

storm-dance commented Jul 31, 2017

tillrohrmann commented Jul 26, 2017 •

edited