-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: Top-level composition into a hierarchy of cells #21736
Conversation
97226cd
to
ee44b4c
Compare
ee44b4c
to
41f0707
Compare
Next steps after this in #21746 which converts few of the easy "outer" components in *Daemon into modules. Worth taking a look as that perhaps gives a clearer picture of where this is heading. |
/test |
daemon/cmd/daemon_main.go
Outdated
@@ -1679,7 +1702,8 @@ func runDaemon(ctx context.Context, cleaner *daemonCleanup, shutdowner hive.Shut | |||
} | |||
bootstrapStats.k8sInit.End(true) | |||
restoreComplete := d.initRestore(restoredEndpoints) | |||
if wgAgent != nil { | |||
|
|||
if wgAgent, ok := d.datapath.WireguardAgent().(*wireguard.Agent); ok && wgAgent != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than casting, can you add whatever methods are missing to the interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will create an import cycle (ipcache depends on pkg/datapath, Init() takes *IPCache). I'll instead pass *wg.Agent down as a separate param. It'll be a bit ugly for a short while with this "init after datapath restored". Moving datapath related initialization into pkg/datapath is not far off now though. This was done in the "top-level composition" commit.
@@ -1090,17 +1106,6 @@ func initEnv() { | |||
neighborsmap.SizeofNeighKey6+neighborsmap.SizeOfNeighValue, | |||
lbmap.SizeofSockRevNat6Key+lbmap.SizeofSockRevNat6Value) | |||
|
|||
// Prepopulate option.Config with options from CLI. | |||
option.Config.Populate(Vp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of other stuff in initEnv
relies on option.Config -- is this going to be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I followed, the init()
function earlier will ensure that cobra.OnInitialize()
is set up to call option.Config.Populate(Vp)
, so then when main()
-> cmd.Execute()
-> RootCmd.Execute()
is called, it should call the OnInitialize() functions to populate the config and then subsequently call RootCmd.Run
-> runApp()
-> initEnv()
and retain the same ordering property as today.
Handy link for cross-reference: https://pkg.go.dev/github.com/spf13/cobra#OnInitialize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the order remains the same, I basically just split it so that all commands get option.Config.Populate
but only root command calls initEnv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the three-layering. I didn't take a closer look at the subsequent PR yet but at a high level it seems like a reasonable step in the right direction.
While reviewing the second commit, I wondered if some of the refactoring / code moves could be split into dedicated commits in order to make the rest of the functional changes a bit more obvious. I know it's not all that much code, for example the factoring out of the datapath initialization into newDatapath()
, but every line of code that is only moving from one place to another into a dedicated function makes it more difficult to pay attention to the functional changes. I realize the wrapping in the function is also then essential to some of the refactoring/reordering, but if there's a way to e.g. move the code to a function without changing anything else in one commit, then do the functional change separately, that would be easier to review. I noted one other example below with only reordering of some node init logic that might be another candidate for a dedicated commit.
At a glance of the testsuite, it seems like the monitor initialization ordering is being changed. I was scanning through to try to locate why this might be, but it wasn't obvious to me just yet. The reason I suspect this is in this failure:
⚠️ Number of "level=warning" in logs: 53
...
Top 3 errors/warnings:
Notifying monitor about endpoint regeneration failed
Failed to send agent start monitor message
Notably also, other failures in the ginkgo suite appear to be giving a signal that the monitor event processing / output might not be keeping up.
After one pass, I'm not confident I understand the totality of the changes well enough to give an explicit approval, but generally they seem like reasonable changes that are a necessary step on the path to modularizing the agent. 👍 Maybe I'll sleep on it and take another pass.
daemon/cmd/daemon_main.go
Outdated
lc.Append(hive.Hook{ | ||
OnStart: func(context.Context) error { | ||
if err := enableIPForwarding(); err != nil { | ||
log.Fatalf("enabling IP forwarding via sysctl failed: %s", err) | ||
} | ||
|
||
iptablesManager.Init() | ||
return nil | ||
}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this is in the OnStart()
, this logic is actually deferred until later instead of happening in between the node initialization and the wireguard initialization right? I don't know if that's consequential but it wasn't called out in the commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checked that it's OK and modified commit message.
daemon/cmd/daemon_main.go
Outdated
// If there is one device specified, use it to derive better default | ||
// allocation prefixes | ||
node.InitDefaultPrefix(option.Config.DirectRoutingDevice) | ||
|
||
if option.Config.IPv6NodeAddr != "auto" { | ||
if ip := net.ParseIP(option.Config.IPv6NodeAddr); ip == nil { | ||
log.WithField(logfields.IPAddr, option.Config.IPv6NodeAddr).Fatal("Invalid IPv6 node address") | ||
} else { | ||
if !ip.IsGlobalUnicast() { | ||
log.WithField(logfields.IPAddr, ip).Fatal("Invalid IPv6 node address: not a global unicast address") | ||
} | ||
|
||
node.SetIPv6(ip) | ||
} | ||
} | ||
|
||
if option.Config.IPv4NodeAddr != "auto" { | ||
if ip := net.ParseIP(option.Config.IPv4NodeAddr); ip == nil { | ||
log.WithField(logfields.IPAddr, option.Config.IPv4NodeAddr).Fatal("Invalid IPv4 node address") | ||
} else { | ||
node.SetIPv4(ip) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we could make this ordering change before as a separate commit, in case this has an unknown consequence? It would be helpful to be able to git bisect down to this specific change if it triggers an issue. I know it's called out in the commit message, but it's hard to say whether that would be enough to pinpoint the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For review purposes I ended up doing this myself, which made it a bit easier for me to reason about anyway (no changes, just git commit splitting): https://github.com/joestringer/cilium/pull/new/review/agent-composition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting it up. Reset to your branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted this change in the end to minimize the number of functional changes in this PR as I hit too many surprises.
if params.Config.SkipDaemon { | ||
return nil | ||
} | ||
func newDaemonPromise(params daemonParams) promise.Promise[*Daemon] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you paint a bit more of a picture of the rationale behind the promise usage here? I'm sure there's good reasons, but I was not able to parse them from the commit message :-)
Here's the current explanation:
To facilitate gradual move of components into cells from Daemon struct,
this expose it as promise.Promise[*Daemon]. This can be then further
projected down to individual objects as needed. E.g. by adding provider:
func projectIPCache(promise.Promise[*Daemon]) promise.Promise[*IPCache].
Looking through the commits, I only see daemonPromise.Await(ctx)
being used in unit tests, so maybe it's making some aspect of dependency injection or synchronization easier? My naive read of the description above is "use promises in the Daemon so that we can use promises in subsequent components".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a little bit more, the overall structure right now is effectively one giant function that triggers all of the initializers across the entire codebase. I think the rationale here is about trying to make each individual dependency more explicit, so that over time we can split up the giant function into lots of smaller modules that are defined independently with only the logic relevant to those modules. Then it'll be hopefully easier to reason about initialization ordering for each hunk of logic, since there will only be a list of promises that are required and provided and anything beyond that is not relevant to this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second day thoughts on the above is that most dependencies should already be handled by the Hive / Cell design & parameters being automatically resolved. But probably there is more use for promises in cases where a module runs its own dedicated goroutines for module-internal logic, and that logic needs to rely on the availability of some other module. I may have a good use for this in ipcache, since there's currently some ugly logic around doing some initial setup and then retrying the main loop if it gets triggered before identity allocation or k8s is fully synced. Not sure how much of that can be resolved via Cells vs. promises yet, but we can figure that out over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason for the Promise[*Daemon]
right now is that we're using it in tests (daemon_test.go, controlplane) to grab *Daemon
and use it to validate results.
We can also use it to "lift" things out from *Daemon
and depend on them in other components, but of course preferably we would not depend on Promise[*Foo]
and would rather provide *Foo
normally via injection. I'll reword the commits to not suggest this and just document the current use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes look good to me. There's several ordering modifications in the last commit, so it'd be nice if we could split each ordering change into a dedicated commit in a similar style to how I did here just to make things a bit more bisectable in case of problems. Other than that, I have a few discussion points above but I don't think that any of these are particularly consequential to the structural changes being proposed in this PR.
|
||
// Start running the daemon in the background (blocks on API server's Serve()) to allow rest | ||
// of the start hooks to run. | ||
go runDaemon(ctx, cleaner, params.Shutdowner, params.Clientset) | ||
if !option.Config.DryMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! That means that the other stuff in runDaemon
is no longer necessary for unit testing 👍
I noticed that the runDaemon()
function still has some !option.Config.DryMode
checks inside, so we can probably drop those since the check now occurs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the DryMode checks from runDaemon and initEnv.
41f0707
to
0610a04
Compare
/test |
0610a04
to
4218d92
Compare
0668d3f
to
b3e1d5f
Compare
/test-1.16-4.9 |
fa4b096
to
7ea5b04
Compare
/test Job 'Cilium-PR-K8s-1.24-kernel-4.19' has 3 failures but they might be new flakes since it also hit 1 known flakes: #17628 (92.24) |
7ea5b04
to
ef6b3b3
Compare
/test Job 'Cilium-PR-K8s-1.24-kernel-4.19' has 2 failures but they might be new flakes since it also hit 1 known flakes: #17628 (92.24) |
ConformanceIngress (Shared LB) is recently enabled, I'm not sure how stable those tests are yet (cc @sayboras @youngnick ). |
The original flaky tests were fixed as part of Gateway API work, however it seems like there is one more case, for the time being, we can mark it back to optional. I will take a closer look once I am back to work. |
Confirmed one of the failures on I've marked that job as no longer required for now. |
ef6b3b3
to
f08d8c8
Compare
Package names within github.com/cilium/cilium are unique enough that we can trim it down: github.com/cilium/cilium/pkg/k8s/resource.Resource[*github.com/cilium/cilium/.../v1.Service] -> resource.Resource[*v1.Service] In the rare cases there is ambiquity (like above, is it corev1 or slim_corev1?), it's easily resolvable looking at the constructor implementation. Keeping the output human readable is more important. Signed-off-by: Jussi Maki <jussi@isovalent.com>
Move the datapath initialization code to a new dedicated function. The IptablesManager.Init call is deferred to a start hook to allow "cilium-agent objects" to work. This does not change the initialization order as nothing in newDatapath uses iptablesManager and the returned Datapath interface is only used when initializing *Daemon which happens from a start hook. Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
Move the CRD identity allocation mode check from initEnv to NewDaemon as it checks if k8s-client is enabled, which is not yet configured when initEnv() is invoked. Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
initEnv() is only relevant when running the agent as root via the cilium-agent binary and should be skipped in integration tests. This moves the initEnv call into runApp() from registerDaemonHooks(). Since we do want option.Config.Populate and logging to be set up for all commands (dot-graph, objects), these are moved from initEnv() into cobra.OnInitialize hook (executed prior to the command run hook). Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
This is the first step towards composing the agent as a hierarchy of cells divided into 3 layers: infrastructure, control plane and datapath. By dividing into these layers we can start having a clearer separation of responsibilities and allow writing integration tests without having to use "DryMode" as a crutch. The "DryMode" becomes mocked "Infrastructure" and "Datapath". daemon tests and test/controlplane are refactored to run the agent via cmd.ControlPlane rather than constructing it directly via NewDaemon. Since the tests use *Daemon to validate, provide it asynchronously via Promise[*Daemon]. Providing it directly would require splitting NewDaemon into pure object construction and start hook which is currently non-trivial. Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
initEnv and runDaemon are now only invoked when not running in DryMode, so DryMode checks in these can now be removed. Signed-off-by: Jussi Maki <jussi@isovalent.com>
Since we're still using the global getters and setters in pkg/node it's fairly dangerous to swap it out with SetLocalNodeStore as any code that runs prior to that call and uses pkg/node may end up modifying state that gets thrown away. To make the transition safer, revert back to the old semantics and keep a singleton LocalNodeStore that needs to be reset in tests via Uninitialize. We'll need to revisit this when the global getters&setters are removed. This fixes an issue in a prior commit that moved wireguard agent initialization earlier than SetLocalNodeStore. Signed-off-by: Jussi Maki <jussi@isovalent.com>
f08d8c8
to
8233887
Compare
/test |
/ci-external-workloads |
/test-runtime |
/ci-multicluster |
These passed earlier and they are failing in master as well currently (https://github.com/cilium/cilium/actions/runs/3468923193), so marking ready-to-merge. |
First commit cleans up the
cilium-agent objects
output.Second commit adds
daemon/cmd/cells.go
that expresses the agent as a composition of 3 top-level modules and refactors daemon_test.go and test/controlplane to use thecmd.ControlPlane
. This removes the last blocker from being able to start implementing more of the agent as cells.Here is how the output of
cilium-agent objects
looks now. The agent is divided into 3 top-level modules,infrastructure, control plane and datapath. The infrastructure and datapath modules are meant to contain implementations that talk to real systems, either external or the local kernel. Control plane is meant to contain pure business logic that can be integration tested in non-privileged platform independent tests (e.g. on both macOS and Linux) by mocking the infrastructure and datapath modules. Preferably over time we would have the source code organization reflect this division (if we can stomach the backport pain).