New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/job groups #24558
Feature/job groups #24558
Conversation
c26e6c7
to
3e93096
Compare
/test Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked at the implementation yet. I noticed that the job groups API isn't documented as part of the hive API in Documentation/contributing/development/hive.rst
in this PR, but it probably should be.
Fixed the description (was talking about workgroups rather than |
Ah, yes. I did not think of that. That seems like a good addition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start! I think it makes sense to not add all the bells and whistles into this in one PR, but rather try to keep it minimal like it is. What I'd like to see is good godocs (strongly consider interface types for the API to minimize the visible API surface), decent tests of all different job types, and maybe think a bit more about the naming.
154378a
to
35fb96f
Compare
/test Job 'Cilium-PR-K8s-1.25-kernel-4.19' hit: #24648 (86.56% similarity) Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
operator/cmd/root.go
Outdated
@@ -106,6 +107,9 @@ var ( | |||
} | |||
}), | |||
|
|||
// Provides a global job collection which cells can use to spawn job groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/collection/registry/. The comment feels very jargony. Wonder if it's purpose could be explained more clearly?
very cool One question:
Is the goal just to replace instances where we're doing |
35fb96f
to
394c6e8
Compare
Not just those instances. Also instances where we have |
/test-1.27-net-next |
/test-1.26-4.19 |
69971d0
to
f43a4a6
Compare
The net-next test was hitting the same flake again. It looks like net next failing might be due to a new test added on main which causes stale branches to fail, re-based this branch to test that theory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have the Guide to the Hive docs as part of this PR, ref. #24558 (review)
f43a4a6
to
4af2e11
Compare
4af2e11
to
74024c1
Compare
/test |
This commit introduces a new package call `job`. Which provides a `job.Collection` and `job.Group`. The collection is created once as a cell in the hive. Other cells will then take the collection as a input and make a new Group from the collection and assign it to their own cell structure. The group has a number of function to add different job types: * AddOneShot - Adds a 'one shot' job which runs once, can exit early or remain running for the entire life of the cell. On error, the job can be retried. * AddTimer - Adds a timer job which runs at an given interval or optionally when triggered. * AddObserver - Adds a observer job which triggers on every item sent over the given stream.Observable. The group implements hive.HookInterface and thus can be added directly to a lifecycle. The jobs are invoked with contexts linked to the cells lifecycle. The group ensured proper shutdown behavior and job scheduling with minimal boilerplate for users. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
LB-IPAM is an easy target to implement the new job group since it is well tested and is a drop in replacement. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
This section describes the purpose of the package and contains a comprehensive example of possible uses of the jobs. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
74024c1
to
44954ac
Compare
/test Job 'Cilium-PR-K8s-1.26-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-4.19/82/ If it is a flake and a GitHub issue doesn't already exist to track it, comment |
test-1.27-net-next failed with: Timed out while waiting for the machine to boot. Rerunning both |
/test-1.26-4.19 |
/test-1.27-net-next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dylandreimerink 👋🏻 I see this review is post-merge, but here are my suggestions for improving content. You have a good conversational style. ✨
|
||
The `job package <https://pkg.go.dev/github.com/cilium/cilium/pkg/hive/job>`_ contains logic that | ||
makes it easy to manage units of work that the package refers to as "jobs". These jobs are | ||
scheduled as part of a job group. These jobs themselves come in a variety of flavors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better localization:
scheduled as part of a job group. These jobs themselves come in a variety of flavors. | |
scheduled as part of a job group. These jobs themselves come in several varieties. |
Every job, fundamentally is a callback function provided by the user with additional logic which | ||
is slightly different for each job type. The jobs and groups manage a lot of the boilerplate | ||
surrounding lifecycle management. The callbacks are called from the job to perform the actual | ||
work. | ||
|
||
Take the following somewhat contrived example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar, clarity
Every job, fundamentally is a callback function provided by the user with additional logic which | |
is slightly different for each job type. The jobs and groups manage a lot of the boilerplate | |
surrounding lifecycle management. The callbacks are called from the job to perform the actual | |
work. | |
Take the following somewhat contrived example: | |
Every job is a callback function provided by the user with additional logic which | |
differs slightly for each job type. The jobs and groups manage a lot of the boilerplate | |
surrounding lifecycle management. The callbacks are called from the job to perform the actual | |
work. | |
Consider the following example: |
The above example shows a number of use-cases in one cell. We start by requesting the job.Registry | ||
via the constructor. We can use the registry to create job groups, in most cases one will be enough. | ||
To this group we can add our jobs in the constructor. Any jobs added in the constructor are queued | ||
until the lifecycle of our cell starts. The group is added to the lifecycle and manages this | ||
internally. Jobs can also be added at runtime which can be handy for dynamic workloads while still | ||
guaranteeing a clean shutdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar, clarity, use active voice in present tense, avoid "we" and focus instead on a reader's actions ("you" often works)
The above example shows a number of use-cases in one cell. We start by requesting the job.Registry | |
via the constructor. We can use the registry to create job groups, in most cases one will be enough. | |
To this group we can add our jobs in the constructor. Any jobs added in the constructor are queued | |
until the lifecycle of our cell starts. The group is added to the lifecycle and manages this | |
internally. Jobs can also be added at runtime which can be handy for dynamic workloads while still | |
guaranteeing a clean shutdown. | |
The preceding example shows a number of use cases in one cell. The cell starts by requesting the job.Registry | |
by way of the constructor. The registry can create job groups; in most cases, one is enough. | |
You can add jobs in the constructor to this group. Any jobs added in the constructor are queued | |
until the lifecycle of the cell starts. The group is added to the lifecycle and manages jobs | |
internally. You can also add jobs at runtime, which can be handy for dynamic workloads while still | |
guaranteeing a clean shutdown. |
A job group will cancel the context to all jobs when the lifecycle ends. Any job callbacks are | ||
expected to exit as soon as possible when the ``ctx`` is "Done". The group will make sure that all | ||
jobs are properly shutdown before the cell stops. If callbacks that do not stop within reasonable | ||
amount of time may cause hive to perform a hard shutdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use present tense, grammar
A job group will cancel the context to all jobs when the lifecycle ends. Any job callbacks are | |
expected to exit as soon as possible when the ``ctx`` is "Done". The group will make sure that all | |
jobs are properly shutdown before the cell stops. If callbacks that do not stop within reasonable | |
amount of time may cause hive to perform a hard shutdown. | |
A job group cancels the context to all jobs when the lifecycle ends. Any job callbacks are | |
expected to exit as soon as the ``ctx`` is "Done". The group makes sure that all | |
jobs are properly shut down before the cell stops. Callbacks that do not stop within a reasonable | |
amount of time may cause the hive to perform a hard shutdown. |
There are 3 job types: one-shot jobs, timer jobs, and observer jobs. One shot jobs run a limited | ||
amount of times, they can be used for short running jobs or jobs that span the entire lifecycle. | ||
Once the callback exits without error, its never called again. A one-shot can optionally have retry | ||
logic and/or trigger hive shutdown if it fails. Timers are called on a specified interval but they | ||
can also be externally triggered. Lastly, we have observer jobs which are invoked for every event | ||
on a ``stream.Observable``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistent hyphens, grammar
There are 3 job types: one-shot jobs, timer jobs, and observer jobs. One shot jobs run a limited | |
amount of times, they can be used for short running jobs or jobs that span the entire lifecycle. | |
Once the callback exits without error, its never called again. A one-shot can optionally have retry | |
logic and/or trigger hive shutdown if it fails. Timers are called on a specified interval but they | |
can also be externally triggered. Lastly, we have observer jobs which are invoked for every event | |
on a ``stream.Observable``. | |
There are 3 job types: one-shot jobs, timer jobs, and observer jobs. One-shot jobs run a limited | |
number of times: you can use them for brief jobs, or for jobs that span the entire lifecycle. | |
Once the callback exits without error, it is never called again. Optionally, a one-shot job can include retry | |
logic and/or trigger hive shutdown if it fails. Timers are called on a specified interval but they | |
can also be externally triggered. Lastly, observer jobs are invoked for every event | |
on a ``stream.Observable``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zacharysarah thoughts on "use them" vs. "you can use them"? The former seems more concise and imperative.
(Otherwise all of this feedback looks good, @dylandreimerink / @zacharysarah do you think either of you would be able to apply these diffs in a fresh PR to the tree?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joestringer Sorry, I'm just now seeing this comment. You could use either, but "use them" is totally appropriate here, and I kinda like it.
PR cilium#24558 got merged before the docs feedback was in, this PR applies the suggested improvements to the Hive docs related to jobs. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
PR #24558 got merged before the docs feedback was in, this PR applies the suggested improvements to the Hive docs related to jobs. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
This PR introduces a new job group system. Up to this point we did not have a "good" way to manage jobs/work/routines in hive cells. While the lifecycle Start and Stop hook methods are really great for guaranteeing ordered startup and shutdown, they are not great on the usage side. Most code expect to have a context and just stop when it cancels, which is reasonable. However converting between the two paradigms takes a lot of boilerplate code: a waitgroup, context and cancel function.
Until now we have been using
cilium/workerpool
to reduce the manual work, however, workerpool requires you to specify the amount of goroutines upfront, and submitting more work than workers will queue work. It is simply not designed for what we are doing with it. This job system is designed to replace both of these older methods.In the pre-hive code it is common to see the pkg/controller used. It did what lifecycles do for us now but with some extra features such as retries, periodic invocation and triggers. It also monitors the controllers and makes that info available as metrics. In anticipation of replacing the current controller system with this new job group system, I have added a few features which are present in pkg/controllers so future migration of code that uses controllers is easier.