[FLINK-4606] [cluster management] Integrate the new ResourceManager with the existed FlinkResourceManager #2540

beyond1920 · 2016-09-23T07:33:21Z

This pr aims to integrate the new ResourceManager with the existed FlinkResourceManager, the main difference including:

Move the useful rpc communication in existed FlinkResourceManager to new ResourceManager, e.g : register infoMessageListener, unregister infoMessageListener, shutDownCluster
Make ResourceManager to be an abstract class, extract framework specific behavior
Implement standalone resourceManager based on the new base ResourceManager class.
Modify testcases which are effected by abstract resourceManager class.

…t members This closes apache#2351

This PR introduces a generic AkkaRpcActor which receives rpc calls as a RpcInvocation message. The RpcInvocation message is generated by the AkkaInvocationHandler which gets them from automatically generated Java Proxies. Add documentation for proxy based akka rpc service Log unknown message type in AkkaRpcActor but do not fail actor Use ReflectionUtil to extract RpcGateway type from RpcEndpoint This closes apache#2357.

This closes apache#2360

…dpoint's main thread

This PR introduces an eager serialization for remote rpc invocation messages. That way it is possible to check whether the message is serializable and whether it exceeds the maximum allowed akka frame size. If either of these constraints is violated, a proper exception is thrown instead of simply swallowing the exception as Akka does it. Address PR comments This closes apache#2365.

…ourceProfile [FLINK-4373] [cluster management] address comments This closes apache#2370.

…arted This PR allows the AkkaRpcActor to stash messages until the corresponding RcpEndpoint has been started. When receiving a Processing.START message, the AkkaRpcActor unstashes all messages and starts processing rpcs. When receiving a Processing.STOP message, it will stop processing messages and stash incoming messages again. Add test case for message stashing This closes apache#2358.

…tration at ResourceManager. This closes apache#2353

…system class loader.

The RpcGateway.getAddress method allows to retrieve the fully qualified address of the associated RpcEndpoint. This closes apache#2392.

This closes apache#2394.

…sourceManager registration. This closes apache#2395.

…asters Adapt related components to the changes in HighAvailabilityServices Add comments for getJobMasterElectionService in HighAvailabilityServices This closes apache#2377.

…itance This commit extends the RpcCompletenessTest such that it can now check for inherited remote procedure calls. All methods defined at the RpcGateway are considered native. This means that they need no RpcEndpoint counterpart because they are implemented by the RpcGateway implementation. This closes apache#2401. update comments remove native method annotation add line break

The recovery mode is not used any more by the latest CheckpointCoordinator. All difference in recovery logic between high-availability and non-high-availability is encapsulated in the HighAvailabilityServices.

…torToResourceManagerConnection

… java This closes apache#2400

This closes apache#2388

- add serial rpc service - add a special rpcService implementation which directly executes the asynchronous calls serially one by one, it is just for testcase - Change ResourceManagerLeaderContender code and TestingSerialRpcService code - override shutdown logic to stop leadershipService - use a mocked RpcService rather than TestingSerialRpcService for resourceManager HA test This closes apache#2427

…r out of the rpc package The TaskExecutor, the JobMaster and the ResourceManager were still contained in the rpc package. With this commit, they will be moved out of this package. Now they are contained in dedicated packages on the o.a.f.runtime level. This closes apache#2438.

… as protected Give main thread execution context into the TaskExecutorToResourceManagerConnection

…not reachable This PR introduces a RpcConnectionException which is thrown if the rpc endpoint is not reachable when calling RpcService.connect. This closes apache#2405.

- associates JobMasters with JobID instead of InstanceID - adds TaskExecutorGateway to slot - adds SlotManager as RM constructor parameter - adds LeaderRetrievalListener to SlotManager to keep track of the leader id - tests the interaction JM->RM requestSlot - tests the interaction RM->TM requestSlot This closes apache#2463

Flink's future abstraction whose API is similar to Java 8's CompletableFuture. That's in order to ease a future transition to this class once we ditch Java 7. The current set of operations comprises: - isDone to check the completion of the future - get/getNow to obtain the future's value - cancel to cancel the future (best effort basis) - thenApplyAsync to transform the future's value into another value - thenAcceptAsync to register a callback for a successful completion of the future - exceptionallyAsync to register a callback for an exception completion of the future - thenComposeAsync to transform the future's value and flatten the returned future - handleAsync to register a callback which is called either with the regular result or the exceptional result Additionally, Flink offers a CompletableFuture which can be completed with a regular value or an exception: - complete/completeExceptionally Complete FlinkCompletableFuture exceptionally with a CanellationException upon cancel This closes apache#2472.

…d ResourceSlot

…d TaskExecutor

… submission & setting up the ExecutionGraph This closes apache#2480

mxm

Thanks for the PR! The CI reports a checkstyle error.

This closes apache#2526.

mxm · 2016-09-26T13:27:52Z

flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java

@@ -66,15 +67,16 @@
 *     <li>{@link #requestSlot(SlotRequest)} requests a slot from the resource manager</li>
 * </ul>
 */
-public class ResourceManager extends RpcEndpoint<ResourceManagerGateway> implements LeaderContender {
+public abstract class ResourceManager<ResourceManagerGateway, WorkerType extends TaskExecutorRegistration> extends RpcEndpoint implements LeaderContender {


I believe this should be

ResourceManager<WorkerType extends TaskExecutorRegistration> extends RpcEndpoint<ResourceManagerGateway>...

The RpcCompletnessTest might have to be adapted for this to work.

@mxm , I adopt ResourceManager<WorkerType extends TaskExecutorRegistration> extends RpcEndpoint<ResourceManagerGateway> at first time, but it would fail because of an Exception when I wanna to start a subClass of this ResourceManager. For example, public class StandaloneResourceManager extends ResourceManager<TaskExecutorRegistration>, when I start this ResourceManager, it would call AkkaRpcService.#startServer, an exception would be thrown here because selfGatewayType was mistake for TaskExecutorRegistration class. So I change it to ResourceManager<ResourceManagerGateway, WorkerType extends TaskExecutorRegistration> extends RpcEndpoint

Yes, I see. I'll modify RpcEndpoint and RpcCompletnessTest for this to work.

mxm · 2016-09-26T13:48:26Z

Thank you for your changes. I'm trying to incorporate them in flip-6 now.

beyond1920 · 2016-09-27T06:01:30Z

@mxm , thanks for your review, I modified the pr based on your advices:

fIx checkstyle error, AkkaRpcActorTest testcase and RpcCompletenessTest testcase. Sorry for those mistakes, I would take care of it next time.
About resourceManager, I adopt ResourceManager extends RpcEndpoint at first time, but it would fail because of an Exception when I wanna to start a subClass of this ResourceManager. For example, public class StandaloneResourceManager extends ResourceManager, when I start this ResourceManager, it would call AkkaRpcService.#startServer, an exception would be thrown here because selfGatewayType was mistake for TaskExecutorRegistration class. So I change it to ResourceManager<ResourceManagerGateway, WorkerType extends TaskExecutorRegistration> extends RpcEndpoint

mxm · 2016-09-27T10:32:00Z

flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java

+	 * @param resourceID The worker resource id
+	 * @param taskExecutorGateway the task executor gateway
+	 */
+	protected abstract WorkerType workerStarted(ResourceID resourceID, TaskExecutorGateway taskExecutorGateway);


This is missing all the other abstract methods of the old ResourceManager. We will need requestNewWorkers, releasePendingWorker, releaseStartedWorker, and reacceptRegisteredWorkers.

@mxm, I omit these method before because of following reasons:

Maybe we need requestNewWorkers method. But numWorkers parameter is not enough to allocate a certain number of new workers, the expected ResourceProfile of each worker also needed to pass in.

Maybe we need releaseStartedWorker method. I omit the method because it is used to release started taskExecutors when ResourceManager receives RemoveResource request, but I could not find any places where sends this request. So I omitted this method before.

We don't need reacceptRegisteredWorkers method. Because the method was used to consolidate the taskExecutor view between resourceManager and jobManager when resourceManager reconnects to jobManager after resourceManager restart. But in the new cluster management mode, JobManager doesn't kept the view of live taskExecutors. ResourceManager is responsible for receiving taskExecutors' registration and maintain the taskExecutor view. So we don't need this method.

We don't need releasePendingWorker method. Because the method was only used to release pending requests when resourceManager consolidates the taskExecutor view with jobManager after resourceManager restart. As we said before, this logic is not needed in new cluster management mode.

This closes #2540

mxm · 2016-09-28T15:43:17Z

This has been merged. Thank you. Could you close the PR?

This closes #2540

This closes apache#2540

This closes #2540

This closes apache#2540

This closes #2540

This closes apache#2540

tillrohrmann and others added 30 commits September 21, 2016 11:39

[FLINK-4346] [rpc] Add new RPC abstraction

04bcb71

[FLINK-4368] [distributed runtime] Eagerly initialize the RPC endpoin…

94e0092

…t members This closes apache#2351

[FLINK-4384] [rpc] Add "scheduleRunAsync()" to the RpcEndpoint

f5614a4

This closes apache#2360

[FLINK-4392] [rpc] Make RPC Service thread-safe

86f21bf

[FLINK-4386] [rpc] Add a utility to verify calls happen in the Rpc En…

4ca049b

…dpoint's main thread

[FLINK-4373] [cluster management] Introduce SlotID, AllocationID, Res…

9159ad6

…ourceProfile [FLINK-4373] [cluster management] address comments This closes apache#2370.

[FLINK-4355] [cluster management] Implement TaskManager side of regis…

fe90811

…tration at ResourceManager. This closes apache#2353

[FLINK-4403] [rpc] Use relative classloader for proxies, rather than …

6899837

…system class loader.

[FLINK-4414] [cluster] Add getAddress method to RpcGateway

d9baa58

The RpcGateway.getAddress method allows to retrieve the fully qualified address of the associated RpcEndpoint. This closes apache#2392.

[FLINK-4434] [rpc] Add a testing RPC service.

e6b0f12

This closes apache#2394.

[FLINK-4355] [cluster management] Add tests for the TaskManager -> Re…

5ea97a1

…sourceManager registration. This closes apache#2395.

[FLINK-4400] [cluster mngmt] Implement leadership election among JobM…

8fd8c99

…asters Adapt related components to the changes in HighAvailabilityServices Add comments for getJobMasterElectionService in HighAvailabilityServices This closes apache#2377.

[hotfix] Remove RecoveryMode from JobMaster

9e90412

The recovery mode is not used any more by the latest CheckpointCoordinator. All difference in recovery logic between high-availability and non-high-availability is encapsulated in the HighAvailabilityServices.

[hotfix] [clustermgnt] Set pending registration properly in TaskExecu…

ffd20e9

…torToResourceManagerConnection

[FLINK-4363] Implement TaskManager basic startup of all components in…

35e8010

… java This closes apache#2400

[FLINK-4347][cluster management] Implement SlotManager core

ba78de8

This closes apache#2388

[FLINK-4528] [rpc] Marks main thread execution methods in RpcEndpoint…

b779d19

… as protected Give main thread execution context into the TaskExecutorToResourceManagerConnection

[hotfix] Add self rpc gateway registration to TestingSerialRpcService

12af3b1

[FLINK-4451] [rpc] Throw RpcConnectionException when rpc endpoint is …

9718dcd

…not reachable This PR introduces a RpcConnectionException which is thrown if the rpc endpoint is not reachable when calling RpcService.connect. This closes apache#2405.

[hotfix] [taskmanager] Fixes TaskManager component creation at startup

2630543

[hotfix] Remove unused imports from SlotRequestRegistered/Rejected an…

a04c11c

…d ResourceSlot

[hotfix] Add methods defined in the gateway to the ResourceManager an…

04fbdb3

…d TaskExecutor

[FLINK-4408] [JobManager] Introduce JobMasterRunner and implement job…

3cda593

… submission & setting up the ExecutionGraph This closes apache#2480

mxm requested changes Sep 26, 2016

View reviewed changes

[FLINK-4580] [rpc] Report rpc invocation exceptions to the caller

2a61e74

This closes apache#2526.

mxm reviewed Sep 26, 2016

View reviewed changes

beyond1920 added 2 commits September 27, 2016 11:23

yarn slot manager

345dafc

integrate with existing FlinkResourceManager

25dd657

beyond1920 force-pushed the jira-4606 branch from 2f5ee42 to 25dd657 Compare September 27, 2016 04:14

change RpcCompletenessTest to adapted for abstract ResourceManager.

dcfda29

mxm reviewed Sep 27, 2016

View reviewed changes

asfgit force-pushed the flip-6 branch from ed5c83d to b955465 Compare September 28, 2016 08:21

mxm mentioned this pull request Sep 28, 2016

[FLINK-4703] RpcCompletenessTest: Add support for type arguments and subclasses #2561

Closed

asfgit pushed a commit that referenced this pull request Sep 28, 2016

[FLINK-4606] integrate features of old ResourceManager

3876630

This closes #2540

beyond1920 closed this Sep 29, 2016

asfgit pushed a commit that referenced this pull request Oct 2, 2016

[FLINK-4606] integrate features of old ResourceManager

eebe2c3

This closes #2540

asfgit pushed a commit that referenced this pull request Oct 6, 2016

[FLINK-4606] integrate features of old ResourceManager

04365c3

This closes #2540

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Oct 14, 2016

[FLINK-4606] integrate features of old ResourceManager

1f198d8

This closes apache#2540

asfgit pushed a commit that referenced this pull request Oct 21, 2016

[FLINK-4606] integrate features of old ResourceManager

a3426fc

This closes #2540

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 31, 2016

[FLINK-4606] integrate features of old ResourceManager

5072345

This closes apache#2540

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 31, 2016

[FLINK-4606] integrate features of old ResourceManager

24242e4

This closes apache#2540

asfgit pushed a commit that referenced this pull request Nov 1, 2016

[FLINK-4606] integrate features of old ResourceManager

5219b40

This closes #2540

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Dec 23, 2016

[FLINK-4606] integrate features of old ResourceManager

a35e582

This closes apache#2540

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Dec 23, 2016

[FLINK-4606] integrate features of old ResourceManager

ed896b6

This closes apache#2540

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Dec 23, 2016

[FLINK-4606] integrate features of old ResourceManager

9d1b5fb

This closes apache#2540

liuyuzhong7 pushed a commit to liuyuzhong7/flink that referenced this pull request Jan 4, 2017

[FLINK-4606] integrate features of old ResourceManager

02756c7

This closes apache#2540

liuyuzhong7 pushed a commit to liuyuzhong7/flink that referenced this pull request Jan 17, 2017

[FLINK-4606] integrate features of old ResourceManager

0346b92

This closes apache#2540

joseprupi pushed a commit to joseprupi/flink that referenced this pull request Feb 12, 2017

[FLINK-4606] integrate features of old ResourceManager

1cd3424

This closes apache#2540

rmetzger added the component=Runtime/Coordination label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-4606] [cluster management] Integrate the new ResourceManager with the existed FlinkResourceManager #2540

[FLINK-4606] [cluster management] Integrate the new ResourceManager with the existed FlinkResourceManager #2540

beyond1920 commented Sep 23, 2016

mxm left a comment

mxm Sep 26, 2016 •

edited

mxm Sep 26, 2016

beyond1920 Sep 27, 2016 •

edited

mxm Sep 27, 2016

mxm commented Sep 26, 2016

beyond1920 commented Sep 27, 2016

mxm Sep 27, 2016

beyond1920 Sep 28, 2016 •

edited

mxm commented Sep 28, 2016

[FLINK-4606] [cluster management] Integrate the new ResourceManager with the existed FlinkResourceManager #2540

[FLINK-4606] [cluster management] Integrate the new ResourceManager with the existed FlinkResourceManager #2540

Conversation

beyond1920 commented Sep 23, 2016

mxm left a comment

Choose a reason for hiding this comment

mxm Sep 26, 2016 • edited

Choose a reason for hiding this comment

mxm Sep 26, 2016

Choose a reason for hiding this comment

beyond1920 Sep 27, 2016 • edited

Choose a reason for hiding this comment

mxm Sep 27, 2016

Choose a reason for hiding this comment

mxm commented Sep 26, 2016

beyond1920 commented Sep 27, 2016

mxm Sep 27, 2016

Choose a reason for hiding this comment

beyond1920 Sep 28, 2016 • edited

Choose a reason for hiding this comment

mxm commented Sep 28, 2016

mxm Sep 26, 2016 •

edited

beyond1920 Sep 27, 2016 •

edited

beyond1920 Sep 28, 2016 •

edited