Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam/crd: Fix agent fatal on router initialization #22477

Merged
merged 1 commit into from
Dec 10, 2022

Conversation

jaffcheng
Copy link
Contributor

@jaffcheng jaffcheng commented Dec 1, 2022

Currently, when a cilium-agent in eni/alibabacloud mode initializes router info, it might encounter the following fatal:

level=fatal msg="Error while creating daemon" error="failed to create router info invalid ip: " subsys=daemon

The gateway IP of routing info is derived from the CIDR of the subnet, the eni.Subnet.CIDR in InstanceMap is set as empty after ENI creation. In normal cases it will be filled by a later resyncTrigger after maintainIPPool. But if another goroutine (periodic resync or pool maintainer of another node) happens to sync local InstanceMap cache to k8s, cilium-agent would be informed of that ENI IP pool with empty cidr and router IP allocation would fatal due to empty gateway IP.

This patch fixes this by filling the CIDR right after ENI creation.

Signed-off-by: Jaff Cheng jaff.cheng.sh@gmail.com

ipam/crd: Fix router initialization fatal when ENI data race happens

@jaffcheng jaffcheng requested review from a team as code owners December 1, 2022 11:53
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 1, 2022
@aanm aanm added the kind/community-contribution This was a contribution made by a community member. label Dec 1, 2022
@christarazi christarazi added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/eni Impacts ENI based IPAM. sig/ipam IP address management, including cloud IPAM area/alibaba Impacts Alibaba based IPAM. labels Dec 2, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Dec 2, 2022
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Could you describe under what conditions this may occur? It would be good to include that information in the release-note.

There are also some linting errors that need to be fixed.

Currently, when a cilium-agent in eni/alibabacloud mode initializes
router info, it might encounter the following fatal:

```
level=fatal msg="Error while creating daemon" error="failed to create router info invalid ip: " subsys=daemon
```

The gateway IP of routing info is derived from the CIDR of the subnet,
the eni.Subnet.CIDR in InstanceMap is set as empty after ENI creation.
In normal cases it will be filled by a later resyncTrigger after maintainIPPool.
But if another goroutine (periodic resync or pool maintainer of another node)
happens to sync local InstanceMap cache to k8s, cilium-agent would be
informed of that ENI IP pool with empty cidr and router IP allocation would
fatal due to empty gateway IP.

This patch fixes this by filling the CIDR right after ENI creation.

Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
@jaffcheng jaffcheng force-pushed the fix-router-init-fatal-upstream branch from d3754cc to 8a3cefd Compare December 5, 2022 10:10
@jaffcheng
Copy link
Contributor Author

@christarazi Sorry, fixed the linting errors and rebased.

Could you describe under what conditions this may occur?

When an ENI is created, we insert that ENI with an empty CIDR into InstanceManager:

n.manager.UpdateENI(instanceID, eni)

and then triggers resyncTrigger that updates the CIDR from cloud API.
Before that resync, another goroutine(periodic resync or pool maintainer of another node) might trigger a k8sSync and submit the ENI with the empty CIDR to ciliumnode.
node.k8sSync.Trigger()

The cilium-agent might then encounter this fatal.

@christarazi
Copy link
Member

christarazi commented Dec 6, 2022

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Check BPF masquerading with ip-masq-agent DirectRouting

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

@aanm aanm added the needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch label Dec 10, 2022
@aanm aanm merged commit 4d7acac into cilium:master Dec 10, 2022
@jaffcheng jaffcheng deleted the fix-router-init-fatal-upstream branch December 14, 2022 02:51
@joestringer joestringer added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alibaba Impacts Alibaba based IPAM. area/eni Impacts ENI based IPAM. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. kind/bug This is a bug in the Cilium logic. kind/community-contribution This was a contribution made by a community member. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants