ipam,alibabacloud: Improve event driven instance resync #25619

jaffcheng · 2023-05-23T12:50:35Z

Currently in AWS/Alibabacloud ipam modes, every time the IP allocation/release event happens, the cilium operator triggers a resync and fetches all the ENIs from instance API. In a large and frequently changing cluster, the full sync might take several tens of seconds. This slows down the IP allocation process severely and also imposes a lot of pressure on the instance API. The full sync here should be unnecessary since we only need to update the ENIs of the instance that triggered the event.

This patch introduces a new Node.instanceSync trigger to replace NodeManager.resyncTrigger. Whenever an instance is mutated due to IP pool maintenance, the trigger attempts to incrementally resynchronize the corresponding instance to the local cache. This is achieved through the newly introduced InstanceSync method of the AllocationImplementation interface.
While this feature is implemented for Alibabacloud, AWS and Azure still fall back to full resynchronization.

Here are some time cost data from an Alibabacloud cluster during different periods:

                              full sync    InstanceSync
time cost with ~8000  ENIs    ~20s         ~2s
time cost with ~13000 ENIs    ~35s         ~2s

Related: #25073

ipam,alibabacloud: Improve event driven instance resync

christarazi · 2023-05-23T16:38:36Z

/test

christarazi · 2023-05-23T17:23:15Z

/test-vagrant

pkg/ipam/node_manager.go

pkg/alibabacloud/eni/instances.go

pkg/alibabacloud/api/api.go

pkg/alibabacloud/eni/instances.go

tommyp1ckles · 2023-05-23T21:35:14Z

/test-runtime

christarazi · 2023-05-31T16:31:23Z

Moving to draft while comments are being addressed. Feel free to mark ready to merge when are ready for reviewers to take another look.

jaffcheng · 2023-06-03T10:07:38Z

@tommyp1ckles Refactored, please take another look, thank you!

tommyp1ckles · 2023-06-15T04:46:24Z

pkg/alibabacloud/eni/instances.go

+			"numSecurityGroups": len(securityGroups),
+		}).Info("Synchronized ENI information")
+
+		m.mutex.Lock()


nit: it may be better to defer the unlock here, makes inadvertently introducing lock bugs less likely.

pkg/ipam/node_manager.go

tommyp1ckles · 2023-06-15T05:10:42Z

pkg/ipam/node_manager.go

+	// SupportInstanceSync returns whether the instance incremental sync is
+	// supported. If not, full Resync is called instead.
+	SupportInstanceSync() bool
+
+	// InstanceSync is called to sync the state of the specified instance with
+	// external APIs or systems.
+	InstanceSync(ctx context.Context, instanceID string) time.Time
+


Looking at this a bit more, it feels a bit cumbersome for the node manager to have to branch based on what the underlying instance API implementation does.

At the end of the day, the IPAM manager just wants to request a resync of the instance state, having a function that returns whether the implementation supports a feature feels like we're leaking implementation details a bit. It also makes the node handler code more complex.

I wonder if this optimization couldn't be done more opaquely by the underlying implementations?

jaffcheng · 2023-06-15T11:39:28Z

Thanks for the comments, please take another look

jaffcheng · 2023-06-15T12:21:36Z

After previous force push, a linting error appeared that I don't have locally, looks like l have to rebase.

tommyp1ckles

Thanks for the changes 🥳 I think this all makes sense to me.

I'm not as familiar with the alibaba implementation, so I'm approving for Azure/general-IPAM

@christarazi do you mind taking a look as well 🙏

tommyp1ckles · 2023-06-15T21:03:50Z

@jaffcheng One more thought, do you have any data from testing this showing the reduction in latency. If so, might be good to add here/commit message for future reference.

christarazi

LGTM thanks for the PR!

My only nit is please remove the Fixes from the commit msg because #25073 will not be fully resolved for all cloud-based IPAM modes which suffer from the full sync. AFAIU, this PR only resolves it for Alibaba. You can used Related in the commit msg instead.

christarazi · 2023-06-16T06:19:22Z

/test

tommyp1ckles · 2023-06-16T06:45:25Z

@tommyp1ckles @christarazi Thanks for the comments, I have updated the commit message, please take another look. Looks like the linting error is from the main branch.

awesome 🙏

tommyp1ckles · 2023-06-16T22:17:57Z

@gandro you mind taking a look as well?

gandro

Nice work, the new iteration looks much cleaner! Thanks

jaffcheng · 2023-06-19T13:34:01Z

Thanks again for the informative guidance! If a rebase for retest is needed, please let me know.

gandro · 2023-06-19T14:10:59Z

Thanks again for the informative guidance! If a rebase for retest is needed, please let me know.

The remaining failures look unrelated (i.e. too many arguments in call to newGlobalServiceCache) - but could you do a rebase nontheless so we can re-test?

jaffcheng · 2023-06-19T14:49:59Z

The remaining failures look unrelated (i.e. too many arguments in call to newGlobalServiceCache) - but could you do a rebase nontheless so we can re-test?

Sure, rebased

gandro · 2023-06-19T14:58:31Z

/test

gandro · 2023-06-21T08:37:29Z

CI-ginko failed with The Service "echo-b" is invalid: spec.ports[0].nodePort: Invalid value: 31414: provided port is already allocated which is a known flake (e.g. #13071). Let's see if I can restart it.

gandro · 2023-06-21T08:53:49Z

CI is green. This would be ready to merge if it wasn't for the feature freeze

ldelossa · 2023-06-28T12:20:36Z

@jaffcheng - can you rebase this PR on main and push up? That will fix the issue where the gateway-api tests are stuck.

Currently in AWS/Alibabacloud ipam modes, every time the IP allocation/release event happens, the cilium operator triggers a resync and fetches all the ENIs from instance API. In a large and frequently changing cluster, the full sync might take several tens of seconds. This slows down the IP allocation process severely and also imposes a lot of pressure on the instance API. The full sync here should be unnecessary since we only need to update the ENIs of the instance that triggered the event. This patch introduces a new `Node.instanceSync` trigger to replace `NodeManager.resyncTrigger`. Whenever an instance is mutated due to IP pool maintenance, the trigger attempts to incrementally resynchronize the corresponding instance to the local cache. This is achieved through the newly introduced `InstanceSync` method of the `AllocationImplementation` interface. While this feature is implemented for Alibabacloud, AWS and Azure still fall back to full resynchronization. Here are some time cost data from an Alibabacloud cluster during different periods: full sync InstanceSync time cost with ~8000 ENIs ~20s ~2s time cost with ~13000 ENIs ~35s ~2s Related: cilium#25073 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

jaffcheng · 2023-06-28T14:50:35Z

@ldelossa Sure, rebased

gandro · 2023-06-28T15:11:39Z

/test

ldelossa · 2023-06-29T15:09:26Z

All required tests pass. Merging.

jaffcheng requested review from a team as code owners May 23, 2023 12:50

jaffcheng requested review from gandro, tommyp1ckles and christarazi May 23, 2023 12:50

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 23, 2023

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label May 23, 2023

tommyp1ckles reviewed May 23, 2023

View reviewed changes

pkg/ipam/node_manager.go Outdated Show resolved Hide resolved

pkg/alibabacloud/eni/instances.go Show resolved Hide resolved

pkg/alibabacloud/api/api.go Outdated Show resolved Hide resolved

pkg/alibabacloud/eni/instances.go Outdated Show resolved Hide resolved

christarazi marked this pull request as draft May 31, 2023 16:30

jaffcheng force-pushed the improve-ipam-resync-upstream branch 2 times, most recently from fed0574 to 9849301 Compare June 3, 2023 09:44

jaffcheng marked this pull request as ready for review June 3, 2023 10:04

gandro requested a review from tommyp1ckles June 13, 2023 08:25

tommyp1ckles reviewed Jun 15, 2023

View reviewed changes

jaffcheng force-pushed the improve-ipam-resync-upstream branch from 9849301 to 1a2ade9 Compare June 15, 2023 11:32

jaffcheng requested a review from tommyp1ckles June 15, 2023 11:39

jaffcheng force-pushed the improve-ipam-resync-upstream branch from 1a2ade9 to 0e2529a Compare June 15, 2023 12:12

tommyp1ckles approved these changes Jun 15, 2023

View reviewed changes

christarazi approved these changes Jun 15, 2023

View reviewed changes

jaffcheng force-pushed the improve-ipam-resync-upstream branch from 0e2529a to 23beafd Compare June 16, 2023 04:29

gandro approved these changes Jun 19, 2023

View reviewed changes

gandro added release-note/misc This PR makes changes that have no direct user impact. sig/ipam IP address management, including cloud IPAM area/alibaba Impacts Alibaba based IPAM. area/ipam Impacts IP address management functionality. labels Jun 19, 2023

maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 19, 2023

jaffcheng force-pushed the improve-ipam-resync-upstream branch from 23beafd to 5397643 Compare June 19, 2023 14:46

ti-mo added kind/enhancement This would improve or streamline existing functionality. dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs labels Jun 20, 2023

jaffcheng force-pushed the improve-ipam-resync-upstream branch from 5397643 to 7db0cbd Compare June 28, 2023 14:48

ldelossa removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jun 29, 2023

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 29, 2023

ldelossa merged commit cc9f038 into cilium:main Jun 29, 2023
64 of 65 checks passed

jaffcheng deleted the improve-ipam-resync-upstream branch June 30, 2023 03:08

tklauser mentioned this pull request Jul 10, 2023

ipam/azure: fix crash due to race condition when handling new node. #26658

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipam,alibabacloud: Improve event driven instance resync #25619

ipam,alibabacloud: Improve event driven instance resync #25619

jaffcheng commented May 23, 2023 •

edited

christarazi commented May 23, 2023

christarazi commented May 23, 2023

tommyp1ckles commented May 23, 2023

christarazi commented May 31, 2023

jaffcheng commented Jun 3, 2023

tommyp1ckles Jun 15, 2023

tommyp1ckles Jun 15, 2023 •

edited

jaffcheng commented Jun 15, 2023

jaffcheng commented Jun 15, 2023

tommyp1ckles left a comment •

edited

tommyp1ckles commented Jun 15, 2023

christarazi left a comment •

edited

christarazi commented Jun 16, 2023

tommyp1ckles commented Jun 16, 2023

tommyp1ckles commented Jun 16, 2023

gandro left a comment

jaffcheng commented Jun 19, 2023

gandro commented Jun 19, 2023

jaffcheng commented Jun 19, 2023

gandro commented Jun 19, 2023

gandro commented Jun 21, 2023

gandro commented Jun 21, 2023

ldelossa commented Jun 28, 2023

jaffcheng commented Jun 28, 2023

gandro commented Jun 28, 2023

ldelossa commented Jun 29, 2023

ipam,alibabacloud: Improve event driven instance resync #25619

ipam,alibabacloud: Improve event driven instance resync #25619

Conversation

jaffcheng commented May 23, 2023 • edited

christarazi commented May 23, 2023

christarazi commented May 23, 2023

tommyp1ckles commented May 23, 2023

christarazi commented May 31, 2023

jaffcheng commented Jun 3, 2023

tommyp1ckles Jun 15, 2023

Choose a reason for hiding this comment

tommyp1ckles Jun 15, 2023 • edited

Choose a reason for hiding this comment

jaffcheng commented Jun 15, 2023

jaffcheng commented Jun 15, 2023

tommyp1ckles left a comment • edited

Choose a reason for hiding this comment

tommyp1ckles commented Jun 15, 2023

christarazi left a comment • edited

Choose a reason for hiding this comment

christarazi commented Jun 16, 2023

tommyp1ckles commented Jun 16, 2023

tommyp1ckles commented Jun 16, 2023

gandro left a comment

Choose a reason for hiding this comment

jaffcheng commented Jun 19, 2023

gandro commented Jun 19, 2023

jaffcheng commented Jun 19, 2023

gandro commented Jun 19, 2023

gandro commented Jun 21, 2023

gandro commented Jun 21, 2023

ldelossa commented Jun 28, 2023

jaffcheng commented Jun 28, 2023

gandro commented Jun 28, 2023

ldelossa commented Jun 29, 2023

jaffcheng commented May 23, 2023 •

edited

tommyp1ckles Jun 15, 2023 •

edited

tommyp1ckles left a comment •

edited

christarazi left a comment •

edited