New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipam,alibabacloud: Improve event driven instance resync #25619
Conversation
/test |
/test-vagrant |
/test-runtime |
Moving to draft while comments are being addressed. Feel free to mark ready to merge when are ready for reviewers to take another look. |
fed0574
to
9849301
Compare
@tommyp1ckles Refactored, please take another look, thank you! |
"numSecurityGroups": len(securityGroups), | ||
}).Info("Synchronized ENI information") | ||
|
||
m.mutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it may be better to defer the unlock here, makes inadvertently introducing lock bugs less likely.
pkg/ipam/node_manager.go
Outdated
// SupportInstanceSync returns whether the instance incremental sync is | ||
// supported. If not, full Resync is called instead. | ||
SupportInstanceSync() bool | ||
|
||
// InstanceSync is called to sync the state of the specified instance with | ||
// external APIs or systems. | ||
InstanceSync(ctx context.Context, instanceID string) time.Time | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this a bit more, it feels a bit cumbersome for the node manager to have to branch based on what the underlying instance API implementation does.
At the end of the day, the IPAM manager just wants to request a resync of the instance state, having a function that returns whether the implementation supports a feature feels like we're leaking implementation details a bit. It also makes the node handler code more complex.
I wonder if this optimization couldn't be done more opaquely by the underlying implementations?
9849301
to
1a2ade9
Compare
Thanks for the comments, please take another look |
1a2ade9
to
0e2529a
Compare
After previous force push, a linting error appeared that I don't have locally, looks like l have to rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes 🥳 I think this all makes sense to me.
I'm not as familiar with the alibaba implementation, so I'm approving for Azure/general-IPAM
@christarazi do you mind taking a look as well 🙏
@jaffcheng One more thought, do you have any data from testing this showing the reduction in latency. If so, might be good to add here/commit message for future reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for the PR!
My only nit is please remove the Fixes
from the commit msg because #25073 will not be fully resolved for all cloud-based IPAM modes which suffer from the full sync. AFAIU, this PR only resolves it for Alibaba. You can used Related
in the commit msg instead.
0e2529a
to
23beafd
Compare
/test |
awesome 🙏 |
@gandro you mind taking a look as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, the new iteration looks much cleaner! Thanks
Thanks again for the informative guidance! If a rebase for retest is needed, please let me know. |
The remaining failures look unrelated (i.e. |
23beafd
to
5397643
Compare
Sure, rebased |
/test |
CI-ginko failed with |
CI is green. This would be ready to merge if it wasn't for the feature freeze |
@jaffcheng - can you rebase this PR on |
Currently in AWS/Alibabacloud ipam modes, every time the IP allocation/release event happens, the cilium operator triggers a resync and fetches all the ENIs from instance API. In a large and frequently changing cluster, the full sync might take several tens of seconds. This slows down the IP allocation process severely and also imposes a lot of pressure on the instance API. The full sync here should be unnecessary since we only need to update the ENIs of the instance that triggered the event. This patch introduces a new `Node.instanceSync` trigger to replace `NodeManager.resyncTrigger`. Whenever an instance is mutated due to IP pool maintenance, the trigger attempts to incrementally resynchronize the corresponding instance to the local cache. This is achieved through the newly introduced `InstanceSync` method of the `AllocationImplementation` interface. While this feature is implemented for Alibabacloud, AWS and Azure still fall back to full resynchronization. Here are some time cost data from an Alibabacloud cluster during different periods: full sync InstanceSync time cost with ~8000 ENIs ~20s ~2s time cost with ~13000 ENIs ~35s ~2s Related: cilium#25073 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
5397643
to
7db0cbd
Compare
@ldelossa Sure, rebased |
/test |
All required tests pass. Merging. |
Currently in AWS/Alibabacloud ipam modes, every time the IP allocation/release event happens, the cilium operator triggers a resync and fetches all the ENIs from instance API. In a large and frequently changing cluster, the full sync might take several tens of seconds. This slows down the IP allocation process severely and also imposes a lot of pressure on the instance API. The full sync here should be unnecessary since we only need to update the ENIs of the instance that triggered the event.
This patch introduces a new
Node.instanceSync
trigger to replaceNodeManager.resyncTrigger
. Whenever an instance is mutated due to IP pool maintenance, the trigger attempts to incrementally resynchronize the corresponding instance to the local cache. This is achieved through the newly introducedInstanceSync
method of theAllocationImplementation
interface.While this feature is implemented for Alibabacloud, AWS and Azure still fall back to full resynchronization.
Here are some time cost data from an Alibabacloud cluster during different periods:
Related: #25073