Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/job groups #24558

Merged
merged 3 commits into from Apr 26, 2023
Merged

Feature/job groups #24558

merged 3 commits into from Apr 26, 2023

Conversation

dylandreimerink
Copy link
Member

@dylandreimerink dylandreimerink commented Mar 24, 2023

This PR introduces a new job group system. Up to this point we did not have a "good" way to manage jobs/work/routines in hive cells. While the lifecycle Start and Stop hook methods are really great for guaranteeing ordered startup and shutdown, they are not great on the usage side. Most code expect to have a context and just stop when it cancels, which is reasonable. However converting between the two paradigms takes a lot of boilerplate code: a waitgroup, context and cancel function.

Until now we have been using cilium/workerpool to reduce the manual work, however, workerpool requires you to specify the amount of goroutines upfront, and submitting more work than workers will queue work. It is simply not designed for what we are doing with it. This job system is designed to replace both of these older methods.

In the pre-hive code it is common to see the pkg/controller used. It did what lifecycles do for us now but with some extra features such as retries, periodic invocation and triggers. It also monitors the controllers and makes that info available as metrics. In anticipation of replacing the current controller system with this new job group system, I have added a few features which are present in pkg/controllers so future migration of code that uses controllers is easier.

Added a new job group system to manage the lifecycle of jobs within cells

@dylandreimerink dylandreimerink added the release-note/misc This PR makes changes that have no direct user impact. label Mar 24, 2023
@dylandreimerink
Copy link
Member Author

dylandreimerink commented Mar 27, 2023

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Tests NodePort inside cluster (kube-proxy) with IPSec and externalTrafficPolicy=Local

Failure Output

FAIL: Request from k8s1 to service http://[fd04::11]:32708 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks N/S loadbalancing Tests with XDP, direct routing, SNAT and Maglev

Failure Output

FAIL: Can not connect to service "tftp://[fd04::11]:32014/hello" from outside cluster (10/10)

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

@dylandreimerink dylandreimerink marked this pull request as ready for review March 27, 2023 15:51
@dylandreimerink dylandreimerink requested review from a team as code owners March 27, 2023 15:51
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at the implementation yet. I noticed that the job groups API isn't documented as part of the hive API in Documentation/contributing/development/hive.rst in this PR, but it probably should be.

@joamaki
Copy link
Contributor

joamaki commented Mar 28, 2023

Fixed the description (was talking about workgroups rather than cilium/workerpool). Tagged @tommyp1ckles for thoughts on status/health.

@dylandreimerink
Copy link
Member Author

Haven't looked at the implementation yet. I noticed that the job groups API isn't documented as part of the hive API in Documentation/contributing/development/hive.rst

Ah, yes. I did not think of that. That seems like a good addition

Copy link
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! I think it makes sense to not add all the bells and whistles into this in one PR, but rather try to keep it minimal like it is. What I'd like to see is good godocs (strongly consider interface types for the API to minimize the visible API surface), decent tests of all different job types, and maybe think a bit more about the naming.

pkg/hive/job/group.go Outdated Show resolved Hide resolved
pkg/hive/job/group.go Outdated Show resolved Hide resolved
@dylandreimerink dylandreimerink force-pushed the feature/job-groups branch 3 times, most recently from 154378a to 35fb96f Compare March 30, 2023 14:57
@dylandreimerink
Copy link
Member Author

dylandreimerink commented Mar 31, 2023

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' hit: #24648 (86.56% similarity)

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications

Failure Output


If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

operator/pkg/lbipam/cell.go Outdated Show resolved Hide resolved
@@ -106,6 +107,9 @@ var (
}
}),

// Provides a global job collection which cells can use to spawn job groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/collection/registry/. The comment feels very jargony. Wonder if it's purpose could be explained more clearly?

operator/pkg/lbipam/lbipam.go Outdated Show resolved Hide resolved
pkg/hive/job/job.go Show resolved Hide resolved
pkg/hive/job/job.go Outdated Show resolved Hide resolved
pkg/hive/job/job.go Show resolved Hide resolved
pkg/hive/job/job.go Outdated Show resolved Hide resolved
pkg/hive/job/job.go Show resolved Hide resolved
pkg/hive/job/job_test.go Outdated Show resolved Hide resolved
pkg/hive/job/job_test.go Show resolved Hide resolved
pkg/hive/job/job.go Show resolved Hide resolved
pkg/hive/job/job.go Show resolved Hide resolved
@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented Apr 3, 2023

very cool

One question:

Until now we have been using cilium/workerpool to reduce the manual work, however, workerpool requires you to specify the amount of goroutines upfront, and submitting more work than workers will queue work. It is simply not designed for what we are doing with it. This job system is designed to replace both of these older methods.

Is the goal just to replace instances where we're doing workerpool.New(1), otherwise do we still want to support specifying the parallelism of a group?

@dylandreimerink
Copy link
Member Author

Is the goal just to replace instances where we're doing workerpool.New(1), otherwise do we still want to support specifying the parallelism of a group?

Not just those instances. Also instances where we have workerpool.New(2) or workerpool.New(3) in Cells as a shortcut for goroutine management of multiple routines. In addition, the intention is that this is a modular alternative for the pkg/controller package as well.

@dylandreimerink
Copy link
Member Author

/test-1.27-net-next

@dylandreimerink
Copy link
Member Author

/test-1.26-4.19

@dylandreimerink
Copy link
Member Author

dylandreimerink commented Apr 21, 2023

The net-next test was hitting the same flake again.

It looks like net next failing might be due to a new test added on main which causes stale branches to fail, re-based this branch to test that theory.

@tklauser tklauser self-requested a review April 21, 2023 08:59
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have the Guide to the Hive docs as part of this PR, ref. #24558 (review)

pkg/hive/job/job.go Outdated Show resolved Hide resolved
@dylandreimerink
Copy link
Member Author

/test

This commit introduces a new package call `job`. Which provides a
`job.Collection` and `job.Group`. The collection is created once as a
cell in the hive. Other cells will then take the collection as a input
and make a new Group from the collection and assign it to their own
cell structure.

The group has a number of function to add different job types:

* AddOneShot - Adds a 'one shot' job which runs once, can exit early or
  remain running for the entire life of the cell. On error, the job can
  be retried.
* AddTimer - Adds a timer job which runs at an given interval or
  optionally when triggered.
* AddObserver - Adds a observer job which triggers on every item sent
  over the given stream.Observable.

The group implements hive.HookInterface and thus can be added directly
to a lifecycle. The jobs are invoked with contexts linked to the
cells lifecycle. The group ensured proper shutdown behavior and job
scheduling with minimal boilerplate for users.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
LB-IPAM is an easy target to implement the new job group since it is
well tested and is a drop in replacement.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
This section describes the purpose of the package and contains a
comprehensive example of possible uses of the jobs.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
@dylandreimerink
Copy link
Member Author

dylandreimerink commented Apr 21, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With native routing

Failure Output

FAIL: Error deleting resource /home/jenkins/workspace/Cilium-PR-K8s-1.26-kernel-4.19/src/github.com/cilium/cilium/test/k8s/manifests/host-policies.yaml: Cannot retrieve "cilium-g8r69"'s policy revision: cannot get policy revision: ""

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-4.19/82/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-4.19 so I can create one.

@dylandreimerink
Copy link
Member Author

test-1.27-net-next failed with: Timed out while waiting for the machine to boot. Rerunning both

@dylandreimerink
Copy link
Member Author

/test-1.26-4.19

@dylandreimerink
Copy link
Member Author

/test-1.27-net-next

@joamaki joamaki added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 26, 2023
@ti-mo ti-mo merged commit bdd1b4a into cilium:main Apr 26, 2023
56 checks passed
Copy link
Contributor

@zacharysarah zacharysarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylandreimerink 👋🏻 I see this review is post-merge, but here are my suggestions for improving content. You have a good conversational style. ✨


The `job package <https://pkg.go.dev/github.com/cilium/cilium/pkg/hive/job>`_ contains logic that
makes it easy to manage units of work that the package refers to as "jobs". These jobs are
scheduled as part of a job group. These jobs themselves come in a variety of flavors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better localization:

Suggested change
scheduled as part of a job group. These jobs themselves come in a variety of flavors.
scheduled as part of a job group. These jobs themselves come in several varieties.

Comment on lines +887 to +892
Every job, fundamentally is a callback function provided by the user with additional logic which
is slightly different for each job type. The jobs and groups manage a lot of the boilerplate
surrounding lifecycle management. The callbacks are called from the job to perform the actual
work.

Take the following somewhat contrived example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar, clarity

Suggested change
Every job, fundamentally is a callback function provided by the user with additional logic which
is slightly different for each job type. The jobs and groups manage a lot of the boilerplate
surrounding lifecycle management. The callbacks are called from the job to perform the actual
work.
Take the following somewhat contrived example:
Every job is a callback function provided by the user with additional logic which
differs slightly for each job type. The jobs and groups manage a lot of the boilerplate
surrounding lifecycle management. The callbacks are called from the job to perform the actual
work.
Consider the following example:

Comment on lines +1009 to +1014
The above example shows a number of use-cases in one cell. We start by requesting the job.Registry
via the constructor. We can use the registry to create job groups, in most cases one will be enough.
To this group we can add our jobs in the constructor. Any jobs added in the constructor are queued
until the lifecycle of our cell starts. The group is added to the lifecycle and manages this
internally. Jobs can also be added at runtime which can be handy for dynamic workloads while still
guaranteeing a clean shutdown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar, clarity, use active voice in present tense, avoid "we" and focus instead on a reader's actions ("you" often works)

Suggested change
The above example shows a number of use-cases in one cell. We start by requesting the job.Registry
via the constructor. We can use the registry to create job groups, in most cases one will be enough.
To this group we can add our jobs in the constructor. Any jobs added in the constructor are queued
until the lifecycle of our cell starts. The group is added to the lifecycle and manages this
internally. Jobs can also be added at runtime which can be handy for dynamic workloads while still
guaranteeing a clean shutdown.
The preceding example shows a number of use cases in one cell. The cell starts by requesting the job.Registry
by way of the constructor. The registry can create job groups; in most cases, one is enough.
You can add jobs in the constructor to this group. Any jobs added in the constructor are queued
until the lifecycle of the cell starts. The group is added to the lifecycle and manages jobs
internally. You can also add jobs at runtime, which can be handy for dynamic workloads while still
guaranteeing a clean shutdown.

Comment on lines +1016 to +1019
A job group will cancel the context to all jobs when the lifecycle ends. Any job callbacks are
expected to exit as soon as possible when the ``ctx`` is "Done". The group will make sure that all
jobs are properly shutdown before the cell stops. If callbacks that do not stop within reasonable
amount of time may cause hive to perform a hard shutdown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use present tense, grammar

Suggested change
A job group will cancel the context to all jobs when the lifecycle ends. Any job callbacks are
expected to exit as soon as possible when the ``ctx`` is "Done". The group will make sure that all
jobs are properly shutdown before the cell stops. If callbacks that do not stop within reasonable
amount of time may cause hive to perform a hard shutdown.
A job group cancels the context to all jobs when the lifecycle ends. Any job callbacks are
expected to exit as soon as the ``ctx`` is "Done". The group makes sure that all
jobs are properly shut down before the cell stops. Callbacks that do not stop within a reasonable
amount of time may cause the hive to perform a hard shutdown.

Comment on lines +1021 to +1026
There are 3 job types: one-shot jobs, timer jobs, and observer jobs. One shot jobs run a limited
amount of times, they can be used for short running jobs or jobs that span the entire lifecycle.
Once the callback exits without error, its never called again. A one-shot can optionally have retry
logic and/or trigger hive shutdown if it fails. Timers are called on a specified interval but they
can also be externally triggered. Lastly, we have observer jobs which are invoked for every event
on a ``stream.Observable``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent hyphens, grammar

Suggested change
There are 3 job types: one-shot jobs, timer jobs, and observer jobs. One shot jobs run a limited
amount of times, they can be used for short running jobs or jobs that span the entire lifecycle.
Once the callback exits without error, its never called again. A one-shot can optionally have retry
logic and/or trigger hive shutdown if it fails. Timers are called on a specified interval but they
can also be externally triggered. Lastly, we have observer jobs which are invoked for every event
on a ``stream.Observable``.
There are 3 job types: one-shot jobs, timer jobs, and observer jobs. One-shot jobs run a limited
number of times: you can use them for brief jobs, or for jobs that span the entire lifecycle.
Once the callback exits without error, it is never called again. Optionally, a one-shot job can include retry
logic and/or trigger hive shutdown if it fails. Timers are called on a specified interval but they
can also be externally triggered. Lastly, observer jobs are invoked for every event
on a ``stream.Observable``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zacharysarah thoughts on "use them" vs. "you can use them"? The former seems more concise and imperative.

(Otherwise all of this feedback looks good, @dylandreimerink / @zacharysarah do you think either of you would be able to apply these diffs in a fresh PR to the tree?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joestringer Sorry, I'm just now seeing this comment. You could use either, but "use them" is totally appropriate here, and I kinda like it.

dylandreimerink added a commit to dylandreimerink/cilium that referenced this pull request May 17, 2023
PR cilium#24558 got merged before the docs feedback was in, this PR applies
the suggested improvements to the Hive docs related to jobs.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
joestringer pushed a commit that referenced this pull request May 22, 2023
PR #24558 got merged before the docs feedback was in, this PR applies
the suggested improvements to the Hive docs related to jobs.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants