New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

SAMZA-2378: Container Placements support for Standby containers enabled jobs #1281

Merged

mynameborat merged 5 commits into apache:master from Sanil15:SAMZA-2378-standby-failover

Feb 28, 2020

Contributor

Sanil15 commented Feb 18, 2020

Design: SEP-22: Container Placements in Samza

Changes: Following abilities are added as a part of this PR:

Ability to issue container placement actions for standby enabled jobs on active & standby containers
Ability to queue actions if there are any in-flight actions on active or its standby or vice-versa
Validate placement actions when standby containers are enabled to ensure standby constraints (active container & standby container are not on the same host) are not violated

Tests:
Integ test: End-to-end test for standby enabled job is added
Manual testing matrix: Container Placement Test Plan

API Changes:

This PR introduces behavior changes when a container placement action is written to metastore and is read by JobCoordinator (JC), it will try to reserve resources for the request, if it gets the resources it attempts to stop the active container then attempts to start the container on the allocated resources.
Each placement request is identified by a UUID which is generated and given to a client on issuing a control command
Responses for each placement action can be queried by directly queueing meta-store by UUID

Upgrade Instructions: None

Usage Instructions: Instantiate ContainerPlacementMetadataStore to write Container placement messages to Metastore, query it by the UUID generated


          Container Placements support for Standby containers enabled jobs

539ffd3

Sanil15 requested a review from rmatharu-zz

February 18, 2020 23:53

rmatharu-zz reviewed

View reviewed changes

samza-core/src/main/java/org/apache/samza/clustermanager/ContainerManager.java

@@ @@ -48,8 +48,7 @@ @@
                * ContainerManager encapsulates logic and state related to container placement actions like move, restarts for active container
                * if issued externally.
                *
-               * TODO SAMZA-2378: Container Placements for Standby containers enabled jobs
-               *      SAMZA-2379: Container Placements for job running in degraded state
+               * TODO SAMZA-2379: Container Placements for job running in degraded state

Contributor

rmatharu-zz Feb 21, 2020

We should probably prioritize this sooner than later

Contributor Author

Sanil15 Feb 21, 2020

Yup following up with RB

rmatharu-zz reviewed

View reviewed changes

samza-core/src/main/java/org/apache/samza/clustermanager/ContainerManager.java

                       LOG.info("Waiting for running container to shutdown due to existing ContainerPlacement action {}", actionMetaData);
                       return false;
                     } else if (actionStatus == ContainerPlacementMetadata.ContainerStatus.STOPPED) {
-                      allocator.runStreamProcessor(request, preferredHost);
+                      // If the job has standby containers enabled, always check standby constraints before issuing a start on container

Contributor

rmatharu-zz Feb 21, 2020 •

edited

Loading

Nit:

Can the lines above

if (hasActiveContainerPlacementAction(request.getProcessorId())) {

String processorId = request.getProcessorId();	      

String processorId = request.getProcessorId();

ContainerPlacementMetadata actionMetaData = getPlacementActionMetadata(processorId).get();

be rewritten to:

Optional<ContainerPlacementMetadata> actionMetaData = getPlacementActionMetadata(processorId);

if (actionMetaData.isPresent()) {
use actionMetaData.get()  everywhere after...

Contributor Author

Sanil15 Feb 21, 2020

hasActiveContainerPlacementAction checks the metadata of the action to be either in ACCEPTED or IN_PROGRESS, it not a check for the presence of metadata

rmatharu-zz reviewed

View reviewed changes

samza-core/src/main/java/org/apache/samza/clustermanager/ContainerManager.java

+                        // Fallback to source host since the new allocated resource does not meet standby constraints
+                        allocator.requestResource(processorId, actionMetaData.getSourceHost());
+                        markContainerPlacementActionFailed(actionMetaData,
+                            String.format("allocated resource %s does not meet standby constraints now, falling back to source host", allocatedResource));

Contributor

rmatharu-zz Feb 21, 2020 •

edited

Loading

Update failedStandbyAllocations metric?

Also it may be possible to expose a method on standby-container-manager to do the
resourceRequestState.releaseUnstartableContainer();
resourceRequestState.cancelResourceRequest(request);
containerAllocator.requestResource();

because it is done in standby-container-manager as a part of
checkStandbyConstraintsAndRunStreamProcessor method

Contributor Author

Sanil15 Feb 21, 2020

Update failedStandbyAllocations metric?

No, because here we did not initiate a standby failover, user-initiated two individual requests: move of standby to x-host and move of active to the standby host, these are two independent requests hence we do not need a metric update

Also it may be possible to expose a method on standby-container-manager to do the

Sure, will do

rmatharu-zz reviewed

View reviewed changes

samza-core/src/main/java/org/apache/samza/clustermanager/ContainerManager.java

+                        markContainerPlacementActionFailed(actionMetaData,
+                            String.format("allocated resource %s does not meet standby constraints now, falling back to source host", allocatedResource));
+                      } else {
+                        LOG.info("Status updated for ContainerPlacement action: ", actionMetaData);

Contributor

rmatharu-zz Feb 21, 2020

Status wasnt updated here?

Contributor Author

Sanil15 Feb 21, 2020

This method is invoked by the Allocator thread when only when active Container is successfully stopped, the signal of that successful stop is given by AMRMClientAsync thread. Hence a state change (done by AMRMClientAsync) is updated with metadata which is logged

rmatharu-zz reviewed

View reviewed changes

samza-core/src/main/java/org/apache/samza/clustermanager/ContainerManager.java Outdated

Comment on lines 442 to 444

+                  if (hasActiveContainerPlacementAction(requestMessage.getProcessorId())
+                      || checkStandbyOrActiveContainerHasActivePlacementAction(requestMessage.getProcessorId())) {
+                    if (standbyContainerManager.isPresent()) {

Contributor

rmatharu-zz Feb 21, 2020 •

edited

Loading

This logic seems twisted; making the code unreadable,
we check if there there is an active-action on the container-id, or if there is an active-action on its active/standby counterparts,
after that we check if there is a standby-container-manager present?

Would it be possible to
a. first check if a standby-container-manager is present?
Or
b. can hasActiveContainerPlacementAction encapsulate the logic of checking a and checking with if there is an active-action on its active/standby counterparts of the given processor-id.

Contributor Author

Sanil15 Feb 22, 2020

I see your point, let me refactor it

rmatharu-zz reviewed

View reviewed changes

samza-core/src/main/java/org/apache/samza/clustermanager/ContainerManager.java Outdated

                  *
                  * @param requestMessage container placement request message
                  * @return Pair<ContainerPlacementMessage.StatusCode, String> which is status code & response suggesting if the request is valid
                  */
                 private Pair<ContainerPlacementMessage.StatusCode, String> validatePlacementAction(ContainerPlacementRequestMessage requestMessage) {
-                  String errorMessagePrefix = String.format("ContainerPlacement request: %s is rejected due to", requestMessage);
+                  String errorMessagePrefix = ContainerPlacementMessage.StatusCode.BAD_REQUEST + " reason: ";

Contributor

rmatharu-zz Feb 21, 2020

Instead can we define errorMessage here
as ContainerPlacementMessage.StatusCode.BAD_REQUEST + " reason: %s"

and later String.format(errorMessage, reason);

Contributor Author

Sanil15 Feb 22, 2020

sure

rmatharu-zz reviewed

View reviewed changes

samza-core/src/main/java/org/apache/samza/clustermanager/ContainerManager.java Outdated

Comment on lines 496 to 515

+                private boolean checkStandbyOrActiveContainerHasActivePlacementAction(String processorId) {
+                  if (standbyContainerManager.isPresent()) {
+                    // If requested placement action is on a standby container and its active container has a placement request,
+                    // this request shall not be de-queued until in-flight action on active container is complete
+                    if (StandbyTaskUtil.isStandbyContainer(processorId) && hasActiveContainerPlacementAction(
+                        StandbyTaskUtil.getActiveContainerId(processorId))) {
+                      return true;
+                    }
+                    // If requested placement action is on a standby container and its active container has a placement request,
+                    // this request shall not be de-queued until in-flight action on active container is complete
+                    if (!StandbyTaskUtil.isStandbyContainer(processorId)) {
+                      for (String standby : standbyContainerManager.get().getStandbyList(processorId)) {
+                        if (hasActiveContainerPlacementAction(standby)) {
+                          return true;
+                        }
+                      }
+                    }
+                  }
+                  return false;
+                }

Contributor

rmatharu-zz Feb 21, 2020

This method can be simplified/inlined with hasActiveAction?

Contributor Author

Sanil15 Feb 22, 2020

done

Contributor Author

Sanil15 Feb 22, 2020

done


          Addressing feedback

f4976d3

rmatharu-zz reviewed

View reviewed changes

samza-core/src/main/java/org/apache/samza/clustermanager/StandbyContainerManager.java

Comment on lines +431 to +433

+                List<String> getStandbyList(String activeContainerId) {
+                  return this.standbyContainerConstraints.get(activeContainerId);
+                }

Contributor

rmatharu-zz Feb 21, 2020 •

edited

Loading

see comment above on simplifying,
You can additional methods here to simplify the caller's logic, so that the caller does not need to know the internals of the standby container manager.

Contributor Author

Sanil15 Feb 22, 2020

So I need this exposed from StandbyContainerManager because I need to refer to metadata maintained in ContainerManager for checking active actions on each standby replica, the method here just exposes the list of standby replicas!

rmatharu-zz reviewed

View reviewed changes

Contributor

rmatharu-zz left a comment

took a pass

Sanil15 added 2 commits

February 21, 2020 17:11


          Adressing comments inlining methods for checking active actions

81e8509


          Add metrics for failed container placement actions

3cfb544

Sanil15 mentioned this pull request

SAMZA-2379: Support Container Placements for job running in degraded state #1297

Merged

rmatharu-zz approved these changes

View reviewed changes

Contributor

rmatharu-zz left a comment

Seems like a test is failing, check with @lakshmi-manasa-g if its relevant/related.

Contributor

lakshmi-manasa-g commented Feb 28, 2020

yes, the failing test (at org.apache.samza.system.azureblob.avro.TestAzureBlobOutputStream.testClose) in the last build here is being fixed in #1298

For now, do an empty commit to trigger another build and it should mostly pass. this test is flaky and wont fail in all Travis builds.


          remove unwanted file

680de86

mynameborat merged commit d022167 into apache:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet