Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eni: Fix unexpected IP release when agent restarts #9888

Conversation

jaffcheng
Copy link
Contributor

@jaffcheng jaffcheng commented Jan 16, 2020

When cilium agent in eni mode restarts, ciliumnode custom resource
is cleared and refilled by several updates. Specifically,
Status.IPAM.Used map which holds all used IPs is first updated
to an empty map before endpoints finish restoration.

This becomes critical if --aws-release-excess-ips is enabled
since cilium operator treats empty IPAM.Used map as no address used
hence releases addresses arbitraryly, causing restored endpoints
disconnected.

This patch fixes this by combining per endpoint update requests
into one update request after all endpoint restoration finishes so
that Status.IPAM.Used keeps the desired state during agent restart.

Signed-off-by: Jaff Cheng jaff.cheng.sh@gmail.com


This change is Reviewable

@jaffcheng jaffcheng requested a review from a team January 16, 2020 16:53
@jaffcheng jaffcheng requested review from a team as code owners January 16, 2020 16:53
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.7.0 Jan 16, 2020
@jaffcheng
Copy link
Contributor Author

@tgraf @aanm @ungureanuvladvictor Please take a look

@ungureanuvladvictor ungureanuvladvictor added the area/eni Impacts ENI based IPAM. label Jan 16, 2020
@maintainer-s-little-helper
Copy link

Release note label not set, please set the appropriate release note.

@coveralls
Copy link

coveralls commented Jan 16, 2020

Coverage Status

Coverage decreased (-0.04%) to 45.619% when pulling 66e4e47 on ctripcloud:pr/jaffcheng/trigger-refresh-only-once-during-restore into f03b69c on cilium:master.

@ungureanuvladvictor
Copy link
Member

ungureanuvladvictor commented Jan 16, 2020

@aanm aanm added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jan 20, 2020
@aanm aanm added this to the 1.7 milestone Jan 20, 2020
@raybejjani
Copy link
Contributor

test-me-please

@@ -44,7 +44,7 @@ var (
)

// AllocateIP allocates a IP address.
func (ipam *IPAM) AllocateIP(ip net.IP, owner string) (err error) {
func (ipam *IPAM) AllocateIP(ip net.IP, owner string, refresh bool) (err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please introduce a new function AllocateIPAndSyncUpstream() instead of adding an argument. It will document the difference between refreshing the node immediately or postponing it. Reading true or false to a function argument makes the code hard to understand without looking up the definition of the function.

pkg/ipam/crd.go Outdated

// TriggerRefresh updates the custom resource in the apiserver based on the latest
// information in the local node store
func (a *crdAllocator) TriggerRefresh(reason string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this SyncUpstream() to be more generic. The concept of refreshing the node is clear for the CRD IPAM where there is a CiliumNode CRD but it is not clear from a general-purpose IPAM perspective.

daemon/daemon.go Outdated
@@ -450,6 +450,20 @@ func NewDaemon(ctx context.Context, dp datapath.Datapath) (*Daemon, *endpointRes
return nil, nil, err
}

// Trigger refresh and update custom resource in the apiserver with all restored endpoints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the intent of this code correctly, then you are adding this code because the following:

	store.refreshTrigger.TriggerWithReason("initial sync")

which also triggers a sync of the node is done too early and not blocking. Correct?

Instead of adding this code explicitly, please extend pkg/trigger with a function to wait for the first execution of the trigger and then invoke the existing trigger instead of introducing a parallel way of invoking the sync. That way, the metrics on trigger invocations remain correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! I understand and absolutely agree with your idea that we should do all the sync actions from the existing trigger. I don't fully understand

please extend pkg/trigger with a function to wait for the first execution of the trigger and then invoke the existing trigger

Do you mean extending pkg/trigger with a function, say TriggerAfter, which waits for restoration of pods, router and health endpoints and then invoke the existing trigger?

And also replace the following:

	store.refreshTrigger.TriggerWithReason("initial sync")

with

	store.refreshTrigger.TriggerAfter(checkRestoration)

Is my understanding right?

@aanm aanm modified the milestones: 1.7, 1.8 Feb 4, 2020
@aanm aanm added needs-backport/1.7 dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. and removed needs-backport/1.7 labels Feb 4, 2020
@jaffcheng jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch 3 times, most recently from 1860ea7 to 002a5c9 Compare February 9, 2020 14:17
@jaffcheng
Copy link
Contributor Author

wip, please don't review

@aanm aanm removed this from the 1.7 milestone Feb 11, 2020
@aanm aanm added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 4, 2020
@jaffcheng jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch from 8a76f13 to 6640e69 Compare March 4, 2020 11:03
daemon/daemon.go Outdated Show resolved Hide resolved
@jaffcheng jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch from 6640e69 to 90841da Compare March 5, 2020 03:59
@joestringer joestringer added this to Needs backport from master in 1.7.2 Mar 5, 2020
@jaffcheng jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch from 90841da to af873de Compare March 9, 2020 08:40
@jaffcheng
Copy link
Contributor Author

@tgraf comment addressed, please take another look

@tgraf
Copy link
Member

tgraf commented Mar 12, 2020

test-me-please

When cilium agent in eni mode restarts, ciliumnode custom resource
is cleared and refilled by several updates. Specifically,
Status.IPAM.Used map which holds all used IPs is first updated
to an empty map before endpoints finish restoration.

This becomes critical if `--aws-release-excess-ips` is enabled
since cilium operator treats empty IPAM.Used map as no address used
hence releases addresses arbitraryly, causing restored endpoints
disconnected.

This patch fixes this by combining per endpoint update requests
into one update request after all endpoint restoration finishes so
that Status.IPAM.Used keeps the desired state during agent restart.

Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
@jaffcheng jaffcheng force-pushed the pr/jaffcheng/trigger-refresh-only-once-during-restore branch from af873de to 66e4e47 Compare March 17, 2020 10:30
@tgraf
Copy link
Member

tgraf commented Mar 17, 2020

test-me-please

@jaffcheng
Copy link
Contributor Author

Last ginkgo tests failed, please help retrigger the tests

@tgraf
Copy link
Member

tgraf commented Mar 24, 2020

test-me-please

@raybejjani
Copy link
Contributor

raybejjani commented Mar 24, 2020

test-me-please

Sorry folks, the previous test run was on a bogus machine so I needed to restart it.

@tklauser tklauser removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 24, 2020
@tklauser
Copy link
Member

K8s 1.18 vm provisioning fail: https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/18154/

@tklauser
Copy link
Member

test-me-please

@tgraf tgraf merged commit 91071da into cilium:master Mar 24, 2020
@jaffcheng jaffcheng deleted the pr/jaffcheng/trigger-refresh-only-once-during-restore branch March 25, 2020 03:17
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.2 Mar 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.2 Mar 30, 2020
@tklauser tklauser moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.2 Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.7.2
Backport done to v1.7
Development

Successfully merging this pull request may close these issues.

None yet

7 participants