eni: Fix unexpected IP release when agent restarts #9888

jaffcheng · 2020-01-16T16:53:34Z

When cilium agent in eni mode restarts, ciliumnode custom resource
is cleared and refilled by several updates. Specifically,
Status.IPAM.Used map which holds all used IPs is first updated
to an empty map before endpoints finish restoration.

This becomes critical if --aws-release-excess-ips is enabled
since cilium operator treats empty IPAM.Used map as no address used
hence releases addresses arbitraryly, causing restored endpoints
disconnected.

This patch fixes this by combining per endpoint update requests
into one update request after all endpoint restoration finishes so
that Status.IPAM.Used keeps the desired state during agent restart.

Signed-off-by: Jaff Cheng jaff.cheng.sh@gmail.com

This change is

jaffcheng · 2020-01-16T16:54:56Z

@tgraf @aanm @ungureanuvladvictor Please take a look

maintainer-s-little-helper · 2020-01-16T16:57:06Z

Release note label not set, please set the appropriate release note.

coveralls · 2020-01-16T17:17:26Z

Coverage decreased (-0.04%) to 45.619% when pulling 66e4e47 on ctripcloud:pr/jaffcheng/trigger-refresh-only-once-during-restore into f03b69c on cilium:master.

ungureanuvladvictor · 2020-01-16T20:45:20Z

test-me-please (known flake https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/16410/)

raybejjani · 2020-01-20T10:42:28Z

test-me-please

tgraf · 2020-01-23T12:32:37Z

pkg/ipam/allocator.go

@@ -44,7 +44,7 @@ var (
 )

 // AllocateIP allocates a IP address.
-func (ipam *IPAM) AllocateIP(ip net.IP, owner string) (err error) {
+func (ipam *IPAM) AllocateIP(ip net.IP, owner string, refresh bool) (err error) {


Please introduce a new function AllocateIPAndSyncUpstream() instead of adding an argument. It will document the difference between refreshing the node immediately or postponing it. Reading true or false to a function argument makes the code hard to understand without looking up the definition of the function.

tgraf · 2020-01-23T12:35:36Z

pkg/ipam/crd.go

+
+// TriggerRefresh updates the custom resource in the apiserver based on the latest
+// information in the local node store
+func (a *crdAllocator) TriggerRefresh(reason string) error {


Let's call this SyncUpstream() to be more generic. The concept of refreshing the node is clear for the CRD IPAM where there is a CiliumNode CRD but it is not clear from a general-purpose IPAM perspective.

tgraf · 2020-01-23T12:44:30Z

daemon/daemon.go

@@ -450,6 +450,20 @@ func NewDaemon(ctx context.Context, dp datapath.Datapath) (*Daemon, *endpointRes
 		return nil, nil, err
 	}

+	// Trigger refresh and update custom resource in the apiserver with all restored endpoints


If I understand the intent of this code correctly, then you are adding this code because the following:

store.refreshTrigger.TriggerWithReason("initial sync")

which also triggers a sync of the node is done too early and not blocking. Correct?

Instead of adding this code explicitly, please extend pkg/trigger with a function to wait for the first execution of the trigger and then invoke the existing trigger instead of introducing a parallel way of invoking the sync. That way, the metrics on trigger invocations remain correct.

Thanks for the comments! I understand and absolutely agree with your idea that we should do all the sync actions from the existing trigger. I don't fully understand

please extend pkg/trigger with a function to wait for the first execution of the trigger and then invoke the existing trigger

Do you mean extending pkg/trigger with a function, say TriggerAfter, which waits for restoration of pods, router and health endpoints and then invoke the existing trigger?

And also replace the following:

store.refreshTrigger.TriggerWithReason("initial sync")

with

store.refreshTrigger.TriggerAfter(checkRestoration)

Is my understanding right?

jaffcheng · 2020-02-09T16:08:09Z

wip, please don't review

daemon/daemon.go

jaffcheng · 2020-03-12T11:50:11Z

@tgraf comment addressed, please take another look

tgraf · 2020-03-12T20:10:49Z

test-me-please

When cilium agent in eni mode restarts, ciliumnode custom resource is cleared and refilled by several updates. Specifically, Status.IPAM.Used map which holds all used IPs is first updated to an empty map before endpoints finish restoration. This becomes critical if `--aws-release-excess-ips` is enabled since cilium operator treats empty IPAM.Used map as no address used hence releases addresses arbitraryly, causing restored endpoints disconnected. This patch fixes this by combining per endpoint update requests into one update request after all endpoint restoration finishes so that Status.IPAM.Used keeps the desired state during agent restart. Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

tgraf · 2020-03-17T10:51:44Z

test-me-please

jaffcheng · 2020-03-24T04:13:47Z

Last ginkgo tests failed, please help retrigger the tests

tgraf · 2020-03-24T09:34:39Z

test-me-please

raybejjani · 2020-03-24T10:04:26Z

test-me-please

Sorry folks, the previous test run was on a bogus machine so I needed to restart it.

tklauser · 2020-03-24T17:32:55Z

K8s 1.18 vm provisioning fail: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/18154/

tklauser · 2020-03-24T17:33:01Z

test-me-please

jaffcheng requested a review from a team January 16, 2020 16:53

jaffcheng requested review from a team as code owners January 16, 2020 16:53

maintainer-s-little-helper bot added this to In progress in 1.7.0 Jan 16, 2020

ungureanuvladvictor added the area/eni Impacts ENI based IPAM. label Jan 16, 2020

maintainer-s-little-helper bot added the dont-merge/needs-release-note label Jan 16, 2020

aanm added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jan 20, 2020

maintainer-s-little-helper bot removed the dont-merge/needs-release-note label Jan 20, 2020

aanm added this to the 1.7 milestone Jan 20, 2020

aanm assigned tgraf Jan 20, 2020

aanm added the pending-review label Jan 20, 2020

tgraf requested changes Jan 23, 2020

View reviewed changes

aanm modified the milestones: 1.7, 1.8 Feb 4, 2020

aanm added needs-backport/1.7 dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. and removed needs-backport/1.7 labels Feb 4, 2020

jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch 3 times, most recently from 1860ea7 to 002a5c9 Compare February 9, 2020 14:17

aanm removed this from the 1.7 milestone Feb 11, 2020

aanm added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 4, 2020

jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch from 8a76f13 to 6640e69 Compare March 4, 2020 11:03

tgraf approved these changes Mar 4, 2020

View reviewed changes

daemon/daemon.go Outdated Show resolved Hide resolved

jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch from 6640e69 to 90841da Compare March 5, 2020 03:59

joestringer added this to Needs backport from master in 1.7.2 Mar 5, 2020

jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch from 90841da to af873de Compare March 9, 2020 08:40

tgraf approved these changes Mar 12, 2020

View reviewed changes

jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch from af873de to 66e4e47 Compare March 17, 2020 10:30

tklauser removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 24, 2020

tgraf merged commit 91071da into cilium:master Mar 24, 2020

jaffcheng deleted the pr/jaffcheng/trigger-refresh-only-once-during-restore branch March 25, 2020 03:17

tklauser mentioned this pull request Mar 30, 2020

v1.7 backports 2020-03-30 #10756

Merged

tklauser added backport-pending/1.7 and removed needs-backport/1.7 labels Mar 30, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.2 Mar 30, 2020

tklauser added backport-done/1.7 and removed backport-pending/1.7 labels Apr 1, 2020

tklauser moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.2 Apr 1, 2020

aanm mentioned this pull request Apr 20, 2020

daemon: Fix the "close of closed channel" panic #11056

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eni: Fix unexpected IP release when agent restarts #9888

eni: Fix unexpected IP release when agent restarts #9888

jaffcheng commented Jan 16, 2020 •

edited by tgraf

jaffcheng commented Jan 16, 2020

maintainer-s-little-helper bot commented Jan 16, 2020

coveralls commented Jan 16, 2020 •

edited

ungureanuvladvictor commented Jan 16, 2020 •

edited by raybejjani

raybejjani commented Jan 20, 2020

tgraf Jan 23, 2020

tgraf Jan 23, 2020

tgraf Jan 23, 2020

jaffcheng Jan 27, 2020

jaffcheng commented Feb 9, 2020

jaffcheng commented Mar 12, 2020

tgraf commented Mar 12, 2020

tgraf commented Mar 17, 2020

jaffcheng commented Mar 24, 2020

tgraf commented Mar 24, 2020

raybejjani commented Mar 24, 2020 •

edited

tklauser commented Mar 24, 2020

tklauser commented Mar 24, 2020

eni: Fix unexpected IP release when agent restarts #9888

eni: Fix unexpected IP release when agent restarts #9888

Conversation

jaffcheng commented Jan 16, 2020 • edited by tgraf

jaffcheng commented Jan 16, 2020

maintainer-s-little-helper bot commented Jan 16, 2020

coveralls commented Jan 16, 2020 • edited

ungureanuvladvictor commented Jan 16, 2020 • edited by raybejjani

raybejjani commented Jan 20, 2020

tgraf Jan 23, 2020

Choose a reason for hiding this comment

tgraf Jan 23, 2020

Choose a reason for hiding this comment

tgraf Jan 23, 2020

Choose a reason for hiding this comment

jaffcheng Jan 27, 2020

Choose a reason for hiding this comment

jaffcheng commented Feb 9, 2020

jaffcheng commented Mar 12, 2020

tgraf commented Mar 12, 2020

tgraf commented Mar 17, 2020

jaffcheng commented Mar 24, 2020

tgraf commented Mar 24, 2020

raybejjani commented Mar 24, 2020 • edited

tklauser commented Mar 24, 2020

tklauser commented Mar 24, 2020

jaffcheng commented Jan 16, 2020 •

edited by tgraf

coveralls commented Jan 16, 2020 •

edited

ungureanuvladvictor commented Jan 16, 2020 •

edited by raybejjani

raybejjani commented Mar 24, 2020 •

edited