Feature/job groups #24558

dylandreimerink · 2023-03-24T15:30:18Z

This PR introduces a new job group system. Up to this point we did not have a "good" way to manage jobs/work/routines in hive cells. While the lifecycle Start and Stop hook methods are really great for guaranteeing ordered startup and shutdown, they are not great on the usage side. Most code expect to have a context and just stop when it cancels, which is reasonable. However converting between the two paradigms takes a lot of boilerplate code: a waitgroup, context and cancel function.

Until now we have been using cilium/workerpool to reduce the manual work, however, workerpool requires you to specify the amount of goroutines upfront, and submitting more work than workers will queue work. It is simply not designed for what we are doing with it. This job system is designed to replace both of these older methods.

In the pre-hive code it is common to see the pkg/controller used. It did what lifecycles do for us now but with some extra features such as retries, periodic invocation and triggers. It also monitors the controllers and makes that info available as metrics. In anticipation of replacing the current controller system with this new job group system, I have added a few features which are present in pkg/controllers so future migration of code that uses controllers is easier.

Added a new job group system to manage the lifecycle of jobs within cells

dylandreimerink · 2023-03-27T14:47:26Z

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Tests NodePort inside cluster (kube-proxy) with IPSec and externalTrafficPolicy=Local

Failure Output

FAIL: Request from k8s1 to service http://[fd04::11]:32708 failed

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks N/S loadbalancing Tests with XDP, direct routing, SNAT and Maglev

Failure Output

FAIL: Can not connect to service "tftp://[fd04::11]:32014/hello" from outside cluster (10/10)

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

tklauser

Haven't looked at the implementation yet. I noticed that the job groups API isn't documented as part of the hive API in Documentation/contributing/development/hive.rst in this PR, but it probably should be.

joamaki · 2023-03-28T09:35:42Z

Fixed the description (was talking about workgroups rather than cilium/workerpool). Tagged @tommyp1ckles for thoughts on status/health.

dylandreimerink · 2023-03-28T11:32:37Z

Haven't looked at the implementation yet. I noticed that the job groups API isn't documented as part of the hive API in Documentation/contributing/development/hive.rst

Ah, yes. I did not think of that. That seems like a good addition

joamaki

Great start! I think it makes sense to not add all the bells and whistles into this in one PR, but rather try to keep it minimal like it is. What I'd like to see is good godocs (strongly consider interface types for the API to minimize the visible API surface), decent tests of all different job types, and maybe think a bit more about the naming.

pkg/hive/job/group.go

dylandreimerink · 2023-03-31T08:32:11Z

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' hit: #24648 (86.56% similarity)

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig MonitorAggregation Checks that monitor aggregation restricts notifications

Failure Output

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

operator/pkg/lbipam/cell.go

joamaki · 2023-03-31T16:59:19Z

operator/cmd/root.go

@@ -106,6 +107,9 @@ var (
 			}
 		}),

+		// Provides a global job collection which cells can use to spawn job groups.


s/collection/registry/. The comment feels very jargony. Wonder if it's purpose could be explained more clearly?

operator/pkg/lbipam/lbipam.go

pkg/hive/job/job.go

pkg/hive/job/job_test.go

pkg/hive/job/job.go

operator/pkg/lbipam/lbipam.go

tommyp1ckles · 2023-04-03T03:18:32Z

very cool

One question:

Until now we have been using cilium/workerpool to reduce the manual work, however, workerpool requires you to specify the amount of goroutines upfront, and submitting more work than workers will queue work. It is simply not designed for what we are doing with it. This job system is designed to replace both of these older methods.

Is the goal just to replace instances where we're doing workerpool.New(1), otherwise do we still want to support specifying the parallelism of a group?

dylandreimerink · 2023-04-03T13:29:41Z

Is the goal just to replace instances where we're doing workerpool.New(1), otherwise do we still want to support specifying the parallelism of a group?

Not just those instances. Also instances where we have workerpool.New(2) or workerpool.New(3) in Cells as a shortcut for goroutine management of multiple routines. In addition, the intention is that this is a modular alternative for the pkg/controller package as well.

dylandreimerink · 2023-04-20T15:41:51Z

/test-1.27-net-next

dylandreimerink · 2023-04-20T15:42:45Z

/test-1.26-4.19

dylandreimerink · 2023-04-21T08:39:29Z

The net-next test was hitting the same flake again.

It looks like net next failing might be due to a new test added on main which causes stale branches to fail, re-based this branch to test that theory.

tklauser

Would be nice to have the Guide to the Hive docs as part of this PR, ref. #24558 (review)

pkg/hive/job/job.go

dylandreimerink · 2023-04-21T14:47:57Z

/test

This commit introduces a new package call `job`. Which provides a `job.Collection` and `job.Group`. The collection is created once as a cell in the hive. Other cells will then take the collection as a input and make a new Group from the collection and assign it to their own cell structure. The group has a number of function to add different job types: * AddOneShot - Adds a 'one shot' job which runs once, can exit early or remain running for the entire life of the cell. On error, the job can be retried. * AddTimer - Adds a timer job which runs at an given interval or optionally when triggered. * AddObserver - Adds a observer job which triggers on every item sent over the given stream.Observable. The group implements hive.HookInterface and thus can be added directly to a lifecycle. The jobs are invoked with contexts linked to the cells lifecycle. The group ensured proper shutdown behavior and job scheduling with minimal boilerplate for users. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>

LB-IPAM is an easy target to implement the new job group since it is well tested and is a drop in replacement. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>

This section describes the purpose of the package and contains a comprehensive example of possible uses of the jobs. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>

dylandreimerink · 2023-04-21T18:06:31Z

/test

Job 'Cilium-PR-K8s-1.26-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With native routing

Failure Output

FAIL: Error deleting resource /home/jenkins/workspace/Cilium-PR-K8s-1.26-kernel-4.19/src/github.com/cilium/cilium/test/k8s/manifests/host-policies.yaml: Cannot retrieve "cilium-g8r69"'s policy revision: cannot get policy revision: ""

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-4.19/82/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-4.19 so I can create one.

dylandreimerink · 2023-04-22T10:35:54Z

test-1.27-net-next failed with: Timed out while waiting for the machine to boot. Rerunning both

dylandreimerink · 2023-04-22T10:36:03Z

/test-1.26-4.19

dylandreimerink · 2023-04-22T10:36:11Z

/test-1.27-net-next

zacharysarah

@dylandreimerink 👋🏻 I see this review is post-merge, but here are my suggestions for improving content. You have a good conversational style. ✨

zacharysarah · 2023-04-27T03:11:36Z