Skip to content

SAMZA-2340: [Container Placements] Introduce ContainerManager for handling validation for failures & starts of active & standby containers#1180

Merged
rmatharu-zz merged 4 commits intoapache:masterfrom
Sanil15:container-placement-service
Nov 5, 2019
Merged

Conversation

@Sanil15
Copy link
Contributor

@Sanil15 Sanil15 commented Oct 4, 2019

Summary

  • Introduce a ContainerManager that can act a single entity maintaining state for validation, starts & failures of any requests relating to active & standby containers
  • The ContainerProcessManager & ContainerAllocator should call the ContaineProcessManager for any state validation & for fetching the next set of actions for successful & failed start & stop of active & standby cotntainer
  • ContainerManager should encapsulate the related state & logic behind StandBy & Active Container control actions

What does this PR ADD?

  • Introducer a new class ContainerManager
  • Removes StandByContainerManager from ContainerProcessManager & ContainerAllocator. Encapsulates StandByContainerManager inside ContainerManager
  • Moves the logic to determine the next set of actions on container stop failures & container launch failures from ContainerProcessManager to ContainerManager
  • Move the logic of validating allocated resources & expired requests from ContainerAllocator to ContainerManager

Rational

  • Centralizing the logic and state maintenance for active & standby containers helps to add & test new features faster and improve code readability

Testing:
Refactored unit test and tested jobs with LXC and real yarn cluster for several use cases, list of all tests & results compiled here: https://docs.google.com/spreadsheets/d/1v-fw0pHxKRobGkALDCno4FuPCsBhdepQ86vIGLHWu54/edit#gid=0

Note: This PR does not add any new behaviors in the AppMaster ecosystem.

@Sanil15 Sanil15 force-pushed the container-placement-service branch from bfe0792 to e33af80 Compare October 4, 2019 22:08
@Sanil15 Sanil15 changed the title Samza-2340 [WIP][DO_NOT_REVIEW]: Introduce ContainerManager for handling, validation failures & starts for active & standby containers Samza-2340 [WIP][DO_NOT_REVIEW]: Introduce ContainerManager for handling validation for failures & starts of active & standby containers Oct 4, 2019
@Sanil15 Sanil15 force-pushed the container-placement-service branch 2 times, most recently from f0f090b to 6827db8 Compare October 10, 2019 21:46
…ilures & starts for active & standby containers
@Sanil15 Sanil15 force-pushed the container-placement-service branch from 6827db8 to a9f06da Compare October 10, 2019 22:30
@Sanil15 Sanil15 changed the title Samza-2340 [WIP][DO_NOT_REVIEW]: Introduce ContainerManager for handling validation for failures & starts of active & standby containers Samza-2340 Introduce ContainerManager for handling validation for failures & starts of active & standby containers Oct 10, 2019
@Sanil15
Copy link
Contributor Author

Sanil15 commented Oct 10, 2019

@rmatharu please have a look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this and standbyContainerManager be made final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup we can

* Callbacks issued from {@link ClusterResourceManager} aka {@link ContainerProcessManager} are intercepted
* by ContainerManager to handle container failure and completions for both active and standby containers
*/
public class ContainerManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to put together an interface:

public interface ContainerManager {
handleContainerLaunch(...)
handleContainerStop(...)
handleContainerLaunchFail(...)
handleExpiredResourceRequest(...)
}

that this class and StandbyContainerManager implement ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it before (master...Sanil15:container-placement) but I did not see any value in adding a new contract because if we add this interface

public interface Actions {
  void handleContainerStop();
  void handleContainerLaunchFail();
  void handleContainerLaunchSuccess();
  void handleContainerExpiredRequests();
}

Then we would have this

ActiveContainerManager implements ContainerManager {...}
StandByContainerManager implements ContainerManager {...}

We would need a composition between ActiveContainerManager and StandByContainerManager. ContainerProcessManager and ContainerAllocator will also compose ActiveContainerManager, at that time if they compose this new interface ContainerManager vs a concrete class ActiveContainerManager does it matter?

Perhaps it would have made sense to add this if we had Multilevel Inheritance between these classes:

Container Manager <----- ActiveContainerManager <==== StandByContainerManager

Copy link
Contributor

@rmatharu-zz rmatharu-zz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a pass. One main suggestion, others are minor.

@prateekm prateekm changed the title Samza-2340 Introduce ContainerManager for handling validation for failures & starts of active & standby containers SAMZA-2340: Introduce ContainerManager for handling validation for failures & starts of active & standby containers Oct 15, 2019
Copy link
Contributor

@rmatharu-zz rmatharu-zz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Please feel free to check in after testing for
a. Job with host-affinity disabled, verify deployments and values for CPMMetrics.
b. Job with host-affinity enabled, verify deployments and values for CPMMetrics.
c. Job with standby enabled, check for constraints being met with large num containers (20-30).
d. Job with standby enabled, single node failure -- failover, and failover metrics in CPMMetrics.
e. Job with standby enabled, two simultaneous node failure with RF=2 and RF=3.

for d, e can use the lxc-setup,
emulate e by
lxc-stop -k -n node1 && lxc-stop -k -n node2.

@prateekm
Copy link
Contributor

@rmatharu Let's hold off on merging this until we create the branch for the next release. Also @Sanil15, please confirm you've tested all the combinations Ray mentioned above before merge.

@Sanil15
Copy link
Contributor Author

Sanil15 commented Oct 30, 2019

@Sanil15 Sanil15 changed the title SAMZA-2340: Introduce ContainerManager for handling validation for failures & starts of active & standby containers SAMZA-2340: [Container Placements] Introduce ContainerManager for handling validation for failures & starts of active & standby containers Nov 2, 2019
@rmatharu-zz rmatharu-zz merged commit 4f715d3 into apache:master Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments