SAMZA-2340: [Container Placements] Introduce ContainerManager for handling validation for failures & starts of active & standby containers#1180
Conversation
bfe0792 to
e33af80
Compare
f0f090b to
6827db8
Compare
…ilures & starts for active & standby containers
6827db8 to
a9f06da
Compare
|
@rmatharu please have a look |
There was a problem hiding this comment.
can this and standbyContainerManager be made final?
| * Callbacks issued from {@link ClusterResourceManager} aka {@link ContainerProcessManager} are intercepted | ||
| * by ContainerManager to handle container failure and completions for both active and standby containers | ||
| */ | ||
| public class ContainerManager { |
There was a problem hiding this comment.
Would it be possible to put together an interface:
public interface ContainerManager {
handleContainerLaunch(...)
handleContainerStop(...)
handleContainerLaunchFail(...)
handleExpiredResourceRequest(...)
}
that this class and StandbyContainerManager implement ?
There was a problem hiding this comment.
I tried it before (master...Sanil15:container-placement) but I did not see any value in adding a new contract because if we add this interface
public interface Actions {
void handleContainerStop();
void handleContainerLaunchFail();
void handleContainerLaunchSuccess();
void handleContainerExpiredRequests();
}
Then we would have this
ActiveContainerManager implements ContainerManager {...}
StandByContainerManager implements ContainerManager {...}
We would need a composition between ActiveContainerManager and StandByContainerManager. ContainerProcessManager and ContainerAllocator will also compose ActiveContainerManager, at that time if they compose this new interface ContainerManager vs a concrete class ActiveContainerManager does it matter?
Perhaps it would have made sense to add this if we had Multilevel Inheritance between these classes:
Container Manager <----- ActiveContainerManager <==== StandByContainerManager
samza-core/src/main/java/org/apache/samza/clustermanager/ContainerProcessManager.java
Outdated
Show resolved
Hide resolved
rmatharu-zz
left a comment
There was a problem hiding this comment.
Took a pass. One main suggestion, others are minor.
rmatharu-zz
left a comment
There was a problem hiding this comment.
Looks good to me.
Please feel free to check in after testing for
a. Job with host-affinity disabled, verify deployments and values for CPMMetrics.
b. Job with host-affinity enabled, verify deployments and values for CPMMetrics.
c. Job with standby enabled, check for constraints being met with large num containers (20-30).
d. Job with standby enabled, single node failure -- failover, and failover metrics in CPMMetrics.
e. Job with standby enabled, two simultaneous node failure with RF=2 and RF=3.
for d, e can use the lxc-setup,
emulate e by
lxc-stop -k -n node1 && lxc-stop -k -n node2.
|
@rmatharu @prateekm Please see the doc: https://docs.google.com/spreadsheets/d/1v-fw0pHxKRobGkALDCno4FuPCsBhdepQ86vIGLHWu54/edit#gid=0&range=13:13 Covers all the test cases for testing |
Summary
What does this PR ADD?
Rational
Testing:
Refactored unit test and tested jobs with LXC and real yarn cluster for several use cases, list of all tests & results compiled here: https://docs.google.com/spreadsheets/d/1v-fw0pHxKRobGkALDCno4FuPCsBhdepQ86vIGLHWu54/edit#gid=0
Note: This PR does not add any new behaviors in the AppMaster ecosystem.