-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipam/crd: Fix agent fatal on router initialization #22477
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Could you describe under what conditions this may occur? It would be good to include that information in the release-note.
There are also some linting errors that need to be fixed.
Currently, when a cilium-agent in eni/alibabacloud mode initializes router info, it might encounter the following fatal: ``` level=fatal msg="Error while creating daemon" error="failed to create router info invalid ip: " subsys=daemon ``` The gateway IP of routing info is derived from the CIDR of the subnet, the eni.Subnet.CIDR in InstanceMap is set as empty after ENI creation. In normal cases it will be filled by a later resyncTrigger after maintainIPPool. But if another goroutine (periodic resync or pool maintainer of another node) happens to sync local InstanceMap cache to k8s, cilium-agent would be informed of that ENI IP pool with empty cidr and router IP allocation would fatal due to empty gateway IP. This patch fixes this by filling the CIDR right after ENI creation. Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
d3754cc
to
8a3cefd
Compare
@christarazi Sorry, fixed the linting errors and rebased.
When an ENI is created, we insert that ENI with an empty CIDR into cilium/pkg/alibabacloud/eni/node.go Line 181 in 8a3cefd
and then triggers resyncTrigger that updates the CIDR from cloud API.Before that resync, another goroutine(periodic resync or pool maintainer of another node) might trigger a k8sSync and submit the ENI with the empty CIDR to ciliumnode.cilium/pkg/ipam/node_manager.go Line 481 in 8a3cefd
The cilium-agent might then encounter this fatal. |
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
Currently, when a cilium-agent in eni/alibabacloud mode initializes router info, it might encounter the following fatal:
The gateway IP of routing info is derived from the CIDR of the subnet, the eni.Subnet.CIDR in InstanceMap is set as empty after ENI creation. In normal cases it will be filled by a later resyncTrigger after maintainIPPool. But if another goroutine (periodic resync or pool maintainer of another node) happens to sync local InstanceMap cache to k8s, cilium-agent would be informed of that ENI IP pool with empty cidr and router IP allocation would fatal due to empty gateway IP.
This patch fixes this by filling the CIDR right after ENI creation.
Signed-off-by: Jaff Cheng jaff.cheng.sh@gmail.com