Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam,alibabacloud: Improve event driven instance resync #25619

Merged
merged 1 commit into from Jun 29, 2023

Conversation

jaffcheng
Copy link
Contributor

@jaffcheng jaffcheng commented May 23, 2023

Currently in AWS/Alibabacloud ipam modes, every time the IP allocation/release event happens, the cilium operator triggers a resync and fetches all the ENIs from instance API. In a large and frequently changing cluster, the full sync might take several tens of seconds. This slows down the IP allocation process severely and also imposes a lot of pressure on the instance API. The full sync here should be unnecessary since we only need to update the ENIs of the instance that triggered the event.

This patch introduces a new Node.instanceSync trigger to replace NodeManager.resyncTrigger. Whenever an instance is mutated due to IP pool maintenance, the trigger attempts to incrementally resynchronize the corresponding instance to the local cache. This is achieved through the newly introduced InstanceSync method of the AllocationImplementation interface.
While this feature is implemented for Alibabacloud, AWS and Azure still fall back to full resynchronization.

Here are some time cost data from an Alibabacloud cluster during different periods:

                              full sync    InstanceSync
time cost with ~8000  ENIs    ~20s         ~2s
time cost with ~13000 ENIs    ~35s         ~2s

Related: #25073

ipam,alibabacloud: Improve event driven instance resync

@jaffcheng jaffcheng requested review from a team as code owners May 23, 2023 12:50
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 23, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label May 23, 2023
@christarazi
Copy link
Member

/test

@christarazi
Copy link
Member

/test-vagrant

pkg/ipam/node_manager.go Outdated Show resolved Hide resolved
pkg/alibabacloud/eni/instances.go Show resolved Hide resolved
pkg/alibabacloud/api/api.go Outdated Show resolved Hide resolved
pkg/alibabacloud/eni/instances.go Outdated Show resolved Hide resolved
@tommyp1ckles
Copy link
Contributor

/test-runtime

@christarazi christarazi marked this pull request as draft May 31, 2023 16:30
@christarazi
Copy link
Member

Moving to draft while comments are being addressed. Feel free to mark ready to merge when are ready for reviewers to take another look.

@jaffcheng jaffcheng force-pushed the improve-ipam-resync-upstream branch 2 times, most recently from fed0574 to 9849301 Compare June 3, 2023 09:44
@jaffcheng jaffcheng marked this pull request as ready for review June 3, 2023 10:04
@jaffcheng
Copy link
Contributor Author

@tommyp1ckles Refactored, please take another look, thank you!

@gandro gandro requested a review from tommyp1ckles June 13, 2023 08:25
"numSecurityGroups": len(securityGroups),
}).Info("Synchronized ENI information")

m.mutex.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it may be better to defer the unlock here, makes inadvertently introducing lock bugs less likely.

pkg/ipam/node_manager.go Outdated Show resolved Hide resolved
Comment on lines 122 to 125
// SupportInstanceSync returns whether the instance incremental sync is
// supported. If not, full Resync is called instead.
SupportInstanceSync() bool

// InstanceSync is called to sync the state of the specified instance with
// external APIs or systems.
InstanceSync(ctx context.Context, instanceID string) time.Time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this a bit more, it feels a bit cumbersome for the node manager to have to branch based on what the underlying instance API implementation does.

At the end of the day, the IPAM manager just wants to request a resync of the instance state, having a function that returns whether the implementation supports a feature feels like we're leaking implementation details a bit. It also makes the node handler code more complex.

I wonder if this optimization couldn't be done more opaquely by the underlying implementations?

@jaffcheng jaffcheng force-pushed the improve-ipam-resync-upstream branch from 9849301 to 1a2ade9 Compare June 15, 2023 11:32
@jaffcheng
Copy link
Contributor Author

Thanks for the comments, please take another look

@jaffcheng
Copy link
Contributor Author

After previous force push, a linting error appeared that I don't have locally, looks like l have to rebase.

Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes 🥳 I think this all makes sense to me.

I'm not as familiar with the alibaba implementation, so I'm approving for Azure/general-IPAM

@christarazi do you mind taking a look as well 🙏

@tommyp1ckles
Copy link
Contributor

@jaffcheng One more thought, do you have any data from testing this showing the reduction in latency. If so, might be good to add here/commit message for future reference.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the PR!

My only nit is please remove the Fixes from the commit msg because #25073 will not be fully resolved for all cloud-based IPAM modes which suffer from the full sync. AFAIU, this PR only resolves it for Alibaba. You can used Related in the commit msg instead.

@jaffcheng jaffcheng force-pushed the improve-ipam-resync-upstream branch from 0e2529a to 23beafd Compare June 16, 2023 04:29
@christarazi
Copy link
Member

/test

@tommyp1ckles
Copy link
Contributor

@tommyp1ckles @christarazi Thanks for the comments, I have updated the commit message, please take another look. Looks like the linting error is from the main branch.

awesome 🙏

@tommyp1ckles
Copy link
Contributor

@gandro you mind taking a look as well?

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, the new iteration looks much cleaner! Thanks

@jaffcheng
Copy link
Contributor Author

Thanks again for the informative guidance! If a rebase for retest is needed, please let me know.

@gandro gandro added release-note/misc This PR makes changes that have no direct user impact. sig/ipam IP address management, including cloud IPAM area/alibaba Impacts Alibaba based IPAM. area/ipam Impacts IP address management functionality. labels Jun 19, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 19, 2023
@gandro
Copy link
Member

gandro commented Jun 19, 2023

Thanks again for the informative guidance! If a rebase for retest is needed, please let me know.

The remaining failures look unrelated (i.e. too many arguments in call to newGlobalServiceCache) - but could you do a rebase nontheless so we can re-test?

@jaffcheng jaffcheng force-pushed the improve-ipam-resync-upstream branch from 23beafd to 5397643 Compare June 19, 2023 14:46
@jaffcheng
Copy link
Contributor Author

The remaining failures look unrelated (i.e. too many arguments in call to newGlobalServiceCache) - but could you do a rebase nontheless so we can re-test?

Sure, rebased

@gandro
Copy link
Member

gandro commented Jun 19, 2023

/test

@ti-mo ti-mo added kind/enhancement This would improve or streamline existing functionality. dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs labels Jun 20, 2023
@gandro
Copy link
Member

gandro commented Jun 21, 2023

CI-ginko failed with The Service "echo-b" is invalid: spec.ports[0].nodePort: Invalid value: 31414: provided port is already allocated which is a known flake (e.g. #13071). Let's see if I can restart it.

@gandro
Copy link
Member

gandro commented Jun 21, 2023

CI is green. This would be ready to merge if it wasn't for the feature freeze

@ldelossa
Copy link
Contributor

@jaffcheng - can you rebase this PR on main and push up? That will fix the issue where the gateway-api tests are stuck.

Currently in AWS/Alibabacloud ipam modes, every time the
IP allocation/release event happens, the cilium operator
triggers a resync and fetches all the ENIs from instance
API. In a large and frequently changing cluster, the full
sync might take several tens of seconds.
This slows down the IP allocation process severely and
also imposes a lot of pressure on the instance API.
The full sync here should be unnecessary since we only
need to update the ENIs of the instance that triggered
the event.

This patch introduces a new `Node.instanceSync` trigger
to replace `NodeManager.resyncTrigger`. Whenever an
instance is mutated due to IP pool maintenance, the trigger
attempts to incrementally resynchronize the corresponding
instance to the local cache. This is achieved through the
newly introduced `InstanceSync` method of the
`AllocationImplementation` interface.
While this feature is implemented for Alibabacloud, AWS
and Azure still fall back to full resynchronization.

Here are some time cost data from an Alibabacloud cluster
during different periods:

                              full sync    InstanceSync
time cost with ~8000  ENIs    ~20s         ~2s
time cost with ~13000 ENIs    ~35s         ~2s

Related: cilium#25073
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
@jaffcheng jaffcheng force-pushed the improve-ipam-resync-upstream branch from 5397643 to 7db0cbd Compare June 28, 2023 14:48
@jaffcheng
Copy link
Contributor Author

@ldelossa Sure, rebased

@gandro
Copy link
Member

gandro commented Jun 28, 2023

/test

@ldelossa ldelossa removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jun 29, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 29, 2023
@ldelossa
Copy link
Contributor

All required tests pass. Merging.

@ldelossa ldelossa merged commit cc9f038 into cilium:main Jun 29, 2023
64 of 65 checks passed
@jaffcheng jaffcheng deleted the improve-ipam-resync-upstream branch June 30, 2023 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alibaba Impacts Alibaba based IPAM. area/ipam Impacts IP address management functionality. kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants