Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loopback lo device no longer considered for Direct Routing in 1.15 #30889

Closed
2 of 3 tasks
dkulchinsky opened this issue Feb 21, 2024 · 11 comments · Fixed by #31200
Closed
2 of 3 tasks

Loopback lo device no longer considered for Direct Routing in 1.15 #30889

dkulchinsky opened this issue Feb 21, 2024 · 11 comments · Fixed by #31200
Assignees
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. release-blocker/1.15 This issue will prevent the release of the next version of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@dkulchinsky
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

After upgrading Cilium from 1.14.4 to 1.15.1, cilium-agent failed to initialize with the following error:

level=fatal msg="failed to start: daemon creation failed: failed to detect devices: unable to determine direct routing device. Use --direct-routing-device to specify it" subsys=daemon

Our setup is a bit unusual, we have 4 transit devices (called vlanXXX), the node's IP is a global public IP on the lo device.

we set the following flag:

--devices='vlan100,vlan200,vlan300,vlan400'"

and on Cilium 1.14.4 it initialized correctly and picks the lo devices for Direct Routing (which makes sense since the IP there matches the Node's IP)

KubeProxyReplacement:    True   [lo <node ip> (Direct Routing), vlan100 172.27.128.114, vlan200 172.27.130.114, vlan300 172.27.132.114, vlan400 172.27.134.114]

The same exact setup on Cilium 1.15.1 fails, as suggested by the error message we tried adding --direct-routing-device='lo' but this had no affect, from what I was able to gather the lo device is simply being ignored.

Cilium Version

1.15.1

Kernel Version

6.1.78

Kubernetes Version

1.26.14

Regression

This setup works as expected on 1.14.4

Sysdump

Since cilium is unable to initialize I can't generate a dump and had to revert to 1.14.4 where things work as expected

Relevant log output

level=info msg="Node addresses updated" device=mgmt1 node-addresses="172.27.140.117 (mgmt1)" subsys=node-address
level=info msg="Node addresses updated" device=vlan100 node-addresses="172.27.128.117 (vlan100)" subsys=node-address
level=info msg="Node addresses updated" device=vlan200 node-addresses="172.27.130.117 (vlan200)" subsys=node-address
level=info msg="Node addresses updated" device=vlan300 node-addresses="172.27.132.117 (vlan300)" subsys=node-address
level=info msg="Node addresses updated" device=vlan400 node-addresses="172.27.134.117 (vlan400)" subsys=node-address
level=fatal msg="failed to start: daemon creation failed: failed to detect devices: unable to determine direct routing device. Use --direct-routing-device to specify it" subsys=daemon

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@dkulchinsky dkulchinsky added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Feb 21, 2024
@brb
Copy link
Member

brb commented Feb 21, 2024

Thanks for the issue! We are aware of it. Hopefully, it will get fixed in the next release.

@dkulchinsky
Copy link
Author

That's great @brb! thank you for the prompt reply

Is there an issue that tracks this already (might have missed it in my search), I'm happy to close this if it's a dup.

@brb
Copy link
Member

brb commented Feb 22, 2024

I'm not aware of any GH issue (maybe @joamaki @bimmlerd?), so let's keep it open.

@brb brb added release-blocker/1.15 This issue will prevent the release of the next version of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Feb 22, 2024
@julianwiedmann julianwiedmann added the kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. label Feb 22, 2024
@brb
Copy link
Member

brb commented Feb 27, 2024

Fix PR #30996.

@joamaki
Copy link
Contributor

joamaki commented Feb 28, 2024

@dkulchinsky would you be able to test the fix in #30996? e.g.
helm install cilium -n kube-system oci://quay.io/cilium-charts-dev/cilium --version 1.15.1-dev.35-HEAD-dba536ce94

@dkulchinsky
Copy link
Author

Thank you @joamaki!

I'm traveling right now, but will test this out and report back as soon as I can.

@dkulchinsky
Copy link
Author

Hey @joamaki, @bimmlerd

deployed 1.15.1-dev.35-HEAD-dba536ce94 and it works as expected 👍🏼

I did have to set direct-routing-device: "lo" and add lo to devices

@joestringer
Copy link
Member

@brb I see that you set release-blocker/1.15 on this issue. To really block the release, we must see active ongoing movement on the PR. Is there an expectation to land the fix this week? Otherwise I think that we are setting the wrong expectations by having the release-blocker/1.15 label on this issue.

@brb
Copy link
Member

brb commented Mar 6, 2024

Is there an expectation to land the fix this week?

I will defer an answer to @bimmlerd who is working on the fix (#30996). Anyway, when do we want to do the next v1.15 release?

Your concern makes sense to me. I set the label to indicate the importance of the regression, and to avoid it slipping unnoticed.

@bimmlerd
Copy link
Member

bimmlerd commented Mar 6, 2024

Reopening - this is primarily about 1.15 and I'm only working on the backport now. EDIT: backport in #31206

@bimmlerd bimmlerd reopened this Mar 6, 2024
@julianwiedmann
Copy link
Member

Reopening - this is primarily about 1.15 and I'm only working on the backport now. EDIT: backport in #31206

Merged now 🎉.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. release-blocker/1.15 This issue will prevent the release of the next version of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants