Skip to content

SAMZA-2685: Add job coordinator which does not do resource management#1529

Merged
cameronlee314 merged 7 commits intoapache:masterfrom
cameronlee314:static_resource
Sep 28, 2021
Merged

SAMZA-2685: Add job coordinator which does not do resource management#1529
cameronlee314 merged 7 commits intoapache:masterfrom
cameronlee314:static_resource

Conversation

@cameronlee314
Copy link
Contributor

@cameronlee314 cameronlee314 commented Sep 10, 2021

Feature: Adding a job coordinator which does not do resource management. For Samza on YARN, the ClusterBasedJobCoordinator does resource management. However, for Samza on Kubernetes, it is not necessary to have a job coordinator which does resource management, because Kubernetes controllers can take care of resource management.

Changes:

  1. Added a StaticResourceJobCoordinator which handles responsibilities like job model calculation, communicating job model to workers, and startpoint fanout.
  2. Added new abstraction layer CoordinatorCommunication for communication between job coordinator and workers. Before, the only communication option was an HTTP endpoint. The new abstraction layer allows us to start decoupling the coordination from the HTTP endpoint. This PR doesn't expose an option to plug in a custom communication layer yet, but there is an interface to start working off of.

Testing:

  1. Added unit tests
  2. Tested a Samza job using this new coordinator in Kubernetes

API changes (all changes are backwards compatible):

  1. Set the config job.coordinator.factory to org.apache.samza.coordinator.staticresource.StaticResourceJobCoordinatorFactory in order to use the new coordinator.
  2. Set the config job.coordinator.restart.signal.factory to define how to restart the Samza job when an input stream changes which will change the job model. This plug-in is dependent on where the Samza job is running (e.g. Kubernetes). Currently, there is only a no-op implementation of this restart signal.

Note: This PR reuses components like JobModelHelper, JobCoordinatorMetadataManager, and StartpointManager across ClusterBasedJobCoordinator and StaticResourceJobCoordinator. I considered consolidating ClusterBasedJobCoordinator and StaticResourceJobCoordinator even further to share code which encapsulates usage of multiple of the above components, but it's not yet clear how to fit the new and old flows together cleanly. There might be further divergence between the coordinators as we iterate on on StaticResourceJobCoordinator, so I felt it would be easier to leave them more decoupled for now. Also, keeping them decoupled reduces risk that a change will impact the existing ClusterBasedJobCoordinator.

Copy link
Contributor

@kw2542 kw2542 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a significant change, do we want a SEP for it?

@cameronlee314
Copy link
Contributor Author

This looks like a significant change, do we want a SEP for it?

@kw2542 That's a good point about having a SEP. Do you think it would be reasonable to get some implementation checked in and tried out in Kubernetes a little more before posting a SEP? I do already have some context to put into a SEP now if you think that would be better.
Does anyone else have thoughts on posting the SEP now vs. later?

@cameronlee314
Copy link
Contributor Author

This looks like a significant change, do we want a SEP for it?

@kw2542 That's a good point about having a SEP. Do you think it would be reasonable to get some implementation checked in and tried out in Kubernetes a little more before posting a SEP? I do already have some context to put into a SEP now if you think that would be better.
Does anyone else have thoughts on posting the SEP now vs. later?

Synced with @mynameborat about this. We will update the existing SEP-20 for Samza on Kubernetes to include this new job coordination flow.

@alnzng
Copy link
Contributor

alnzng commented Sep 23, 2021

LGTM

Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment. Looks good to me!

@mynameborat
Copy link
Contributor

Minor comment. Looks good to me!

There are some test failures. Doesn't seem related to your change although I'd just make sure if that is the case.

@cameronlee314
Copy link
Contributor Author

Minor comment. Looks good to me!

There are some test failures. Doesn't seem related to your change although I'd just make sure if that is the case.

This error has been happening on some other PRs as well, so it's very unlikely related to this change. It is being handled separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants