Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Implement route-based device detection #17219

Merged

Conversation

joamaki
Copy link
Contributor

@joamaki joamaki commented Aug 23, 2021

This reimplements the device detection logic to use the route
information rather than detecting devices based on k8s nodeIP and
the device with the default route.

The devices are discovered by finding all devices mentioned in global
unicast routes and filtering out cilium-managed devices by prefix.

Motivation for this change is to handle cases like #15960 more reliably
and not have the user specify devices manually. This also prepares for
later work to detect devices at runtime to allow e.g. use of KPR with ENI,
which adds devices dynamically.

The reason for introducing DeviceManager is to later add support for
dynamically reconfiguring the datapath for devices added at runtime.
Also the detected devices are now also used for host firewall and
bandwidth manager so makes sense for this logic to be moved out.

Fixes: #15960

Detect devices from global unicast routes in addition to only
looking for the device with the Kubernetes Node IP and the one with
default route.  This expands the set of devices used for kube-proxy
replacement, host firewall and bandwidth manager and should reduce
the need to specify devices manually.

@joamaki joamaki requested review from brb and borkmann August 23, 2021 15:14
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 23, 2021
@joamaki joamaki force-pushed the pr/joamaki/route-based-device-detection branch from 780dbc0 to 19a42da Compare August 23, 2021 15:16

// updateDevicesFromRoutes processes a batch of routes and updates the set of
// devices. Returns true if devices changed.
func (dm *DeviceManager) updateDevicesFromRoutes(routes []netlink.Route) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up PR will reuse this logic to detect devices at runtime and reload the datapath.

return nil
}

func (dm *DeviceManager) GetDevices() []string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up with runtime detection of devices the plan is to stop using option.Config.Devices throughout and rather call into device manager to ask for the list of devices as it may mutate over time. Hence I already have the state for the devices and this function here.

See #17187 for the work in progress for the runtime detection and reasoning behind the structure of this file.

@@ -112,6 +112,8 @@ type Daemon struct {
monitorAgent *monitoragent.Agent
ciliumHealth *health.CiliumHealth

deviceManager *DeviceManager
Copy link
Contributor Author

@joamaki joamaki Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more suitable place for DeviceManager might be in pkg/datapath, but I'd like to leave it here for now.

daemon/cmd/devices.go Outdated Show resolved Hide resolved
@joamaki
Copy link
Contributor Author

joamaki commented Aug 23, 2021

test-me-please

daemon/cmd/devices.go Outdated Show resolved Hide resolved
@joamaki joamaki force-pushed the pr/joamaki/route-based-device-detection branch from 19a42da to b6bfcd9 Compare August 30, 2021 07:47
@joamaki joamaki added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Aug 30, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 30, 2021
@joamaki
Copy link
Contributor Author

joamaki commented Aug 30, 2021

test-me-please

@joamaki joamaki force-pushed the pr/joamaki/route-based-device-detection branch 2 times, most recently from 6bd1d94 to c75e3f1 Compare August 30, 2021 14:02
@joamaki joamaki force-pushed the pr/joamaki/route-based-device-detection branch from c75e3f1 to cd0c249 Compare September 1, 2021 15:23
@joamaki
Copy link
Contributor Author

joamaki commented Sep 1, 2021

test-me-please

@joamaki
Copy link
Contributor Author

joamaki commented Sep 3, 2021

test-runtime

@joamaki
Copy link
Contributor Author

joamaki commented Sep 3, 2021

test-gke

@joamaki joamaki force-pushed the pr/joamaki/route-based-device-detection branch 2 times, most recently from 231790e to 23bc0f4 Compare September 6, 2021 09:02
@joamaki
Copy link
Contributor Author

joamaki commented Sep 6, 2021

test-runtime

daemon/cmd/devices_test.go Outdated Show resolved Hide resolved
@joamaki joamaki force-pushed the pr/joamaki/route-based-device-detection branch 2 times, most recently from e235b4b to 8dadc1a Compare September 21, 2021 11:26
@joamaki
Copy link
Contributor Author

joamaki commented Sep 21, 2021

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDemosTest Tests Star Wars Demo

Failure Output

FAIL: DNS entry is not ready after timeout

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

@joamaki joamaki force-pushed the pr/joamaki/route-based-device-detection branch from 8dadc1a to 4bb3944 Compare September 22, 2021 09:02
@joamaki
Copy link
Contributor Author

joamaki commented Sep 22, 2021

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sBookInfoDemoTest Bookinfo Demo Tests bookinfo demo

Failure Output

FAIL: DNS entry is not ready after timeout

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

@joamaki
Copy link
Contributor Author

joamaki commented Sep 23, 2021

test-1.16-netnext

@joamaki
Copy link
Contributor Author

joamaki commented Sep 27, 2021

@pchaigno PTAL

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you break commit daemon: Address review comments on device detection and merge it in the respective commits it fixes/improves? Having it in a separate commit hinders reviewing and doesn't make sense once merged.

Ah I was just trying to make reviewing easier. I'll squash.

I didn't mean you should squash everything; that doesn't make reviews any easier ;-)

If you're looking for guidance on how to split commits in pull requests, maybe a good starting point is https://kernelnewbies.org/PatchPhilosophy#How_to_break_up_changes. For this specific pull request, a good example is the code you moved from daemon/cmd/kube_proxy_replacement{,_test}.go to a new file; that move should be in its own commit, before the rest. It would help reviewers because we wouldn't have to re-review existing code that was just moved around.

daemon/cmd/daemon.go Outdated Show resolved Hide resolved
daemon/cmd/devices.go Outdated Show resolved Hide resolved
daemon/cmd/devices.go Outdated Show resolved Hide resolved
daemon/cmd/devices.go Outdated Show resolved Hide resolved
daemon/cmd/devices.go Outdated Show resolved Hide resolved
daemon/cmd/devices.go Outdated Show resolved Hide resolved
daemon/cmd/devices.go Outdated Show resolved Hide resolved
daemon/cmd/devices.go Outdated Show resolved Hide resolved
daemon/cmd/devices.go Outdated Show resolved Hide resolved
daemon/cmd/devices.go Outdated Show resolved Hide resolved
This reimplements the device detection logic to use the route
information rather than detecting devices based on k8s nodeIP and
the device with the default route.

The devices are discovered by finding all devices mentioned in global
unicast routes and filtering out cilium-managed devices by prefix.

The reason for introducing DeviceManager is to later add support for
dynamically reconfiguring the datapath for devices added at runtime.
Also the detected devices are now also used for host firewall and
bandwidth manager so makes sense for this logic to be moved out.

Fixes: cilium#15960

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Since IPsec is not yet using the detected devices, disable
detection if only IPsec is enabled.

This fixes a failure in IPsec L7 tests. Root cause still
unknown why detecting the devices causes issues.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki force-pushed the pr/joamaki/route-based-device-detection branch from 4bb3944 to 78faa05 Compare September 30, 2021 08:52
@joamaki
Copy link
Contributor Author

joamaki commented Sep 30, 2021

test-me-please

Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sChaosTest Connectivity demo application Endpoint can still connect while Cilium is not running

Failure Output

FAIL: DNS entry is not ready after timeout

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.16-net-next so I can create a new GitHub issue to track it.

@joamaki
Copy link
Contributor Author

joamaki commented Oct 1, 2021

test-1.16-netnext

@joamaki joamaki added the area/daemon Impacts operation of the Cilium daemon. label Oct 1, 2021
@pchaigno
Copy link
Member

pchaigno commented Oct 1, 2021

I think the two tests failing in k8s-1.16-kernel-netnext are new, so likely a case of #15474. You can confirm by checking you don't have the related changes (the changes those tests test) in your branch. Happy to help on Slack.

If that's confirmed, then I think we're good to go. The remaining cilium/agent review is for changes under daemon/ but I'm familiar with the specific code you touched and I approved already.

@joamaki
Copy link
Contributor Author

joamaki commented Oct 1, 2021

I think the two tests failing in k8s-1.16-kernel-netnext are new, so likely a case of #15474. You can confirm by checking you don't have the related changes (the changes those tests test) in your branch. Happy to help on Slack.

If that's confirmed, then I think we're good to go. The remaining cilium/agent review is for changes under daemon/ but I'm familiar with the specific code you touched and I approved already.

cilium % git diff upstream/master -- test/k8sT/Services.go|grep "Checks conn"
-       }, "Checks connectivity when skipping socket lb in pod ns", func() {

Yep, it's a new test.

@joamaki joamaki added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 1, 2021
@borkmann borkmann merged commit 74fbca2 into cilium:master Oct 1, 2021
@pchaigno
Copy link
Member

pchaigno commented Oct 1, 2021

Yep, it's a new test.

Yep, but are the changes added with that test included in your branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cilium agent device auto-detection issue
7 participants