-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipset: Rework the reconciler to use batch ops #31638
ipset: Rework the reconciler to use batch ops #31638
Conversation
8345e8c
to
e468cab
Compare
4b7ded2
to
2feaa25
Compare
Commit 2feaa25 does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
2feaa25
to
e2a3c24
Compare
Commit e2a3c24 does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
e2a3c24
to
c19d61d
Compare
A performance test done by @giorio94 against the previous implementation (based on a multiple addresses per table entry model) highlighted that the The current implementation already addresses this partially, since moving the table to a single address per entry model simplifies the |
This comment was marked as outdated.
This comment was marked as outdated.
c19d61d
to
95296f8
Compare
/test |
1a632cb
to
869fef3
Compare
/test |
c8cc12e
to
cd30bea
Compare
/test |
cd30bea
to
712cd4f
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A couple of minor comments, then I'll do the final pass.
Stop the rate limiter goroutine during reconciler termination if rate limiting was enabled in the configuration. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Rework the ipset table to store a single address per entry. This allows to rewrite the reconciler in a more canonical way, instead of mixing address insertions and removals in the Update operation. In the current setup, the Update and Delete respectively add and remove a single IP address in the related set. The Prune operation performs a full reconciliation for both IPv4 and IPv6 sets. Since each table entry now stores a single address instead of an entire set, the usage of immutable sets is replaced by a simpler netip.Addr field. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The current ipset reconciler logic might remove ipset entries too aggressively at startup, when the Cilium managed IP sets still contain entries related to a previous agent run. This may happen if a full reconciliation is performed before the initial nodes listing (either from k8s or kvstore) is complete. This in turn may result in temporary traffic disruption until the IP sets have been fully reloaded following the completion of the initial nodes listing. To avoid depending from k8s or the kvstore in the ipset cell to explicitly wait for the listing, we rely on the handler pattern: the ipset manager exposes a way to grab a handler (namely, an initializer) that blocks the Prune operation until the initialization is explicitly marked as completed. The ipset manager consumer is in charge of signalling the completion of the initialization and the start of the full reconciliation. The ipset manager waits on the initializers and as soon as all have been marked as done triggers a full reconciliation to immediately perform a Prune operation. Since the Prune operation is always executed as soon as the initializers have been marked as done, the full reconciliation interval is increased to 30 minutes. In a subsequent commit an ipset initializer is used by the node manager to signal the possibility to safely perform address pruning. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Add a unit test to check that the ipset reconciler skips the Prune operation up until all initializers have been marked as done. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The Cilium managed IP sets contain the node addresses for which we do not SNAT the outgoing traffic. During an agent restart, while we are waiting for the initial nodes listing to complete, we don't want to remove addresses too early, because those addresses may still be related to existing nodes not yet seen in the in-progress listing. In order to avoid temporary pod traffic disruption, the node manager takes an ipset initializer reference and mark it as done only after the initial listing, from either k8s or kvstore, is complete. This in turn triggers a full IP sets reconciliation, to safely remove stale addresses that are still in the sets. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
When clustermesh is enabled, the nodes manager receives events about all the remote cluster nodes. Those events are handled just as nodes events coming from the local cluster, so the remote nodes IPs are added to the kernel IP sets. To avoid pruning remote nodes addresses in kernel IP sets too early after an agent restart, clustermesh module takes an initializer handler to the ipset manager, delayiing the stale addresses deletion until the listing of all the remote clusters nodes is completed. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Performing an ipset list operation at each reconciler update can become expensive in terms of memory, because it requires the allocation of a set of addresses each time it is called. In a very large cluster or in a large multi-cluster setup, the cardinality of this list can be pretty high. When the cluster is experiencing node churn, and the ipset.RemoveFromIPSet is called many times, this can lead to a very high GC pressure, because we rely on `ipset list` in the reconciler Delete operation to be sure that the IP set already exists. To avoid that, reconcile the ipset changes using "ipset restore" in order to add multiple IPs in one go and avoid the overhead of executing ipset many times. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
If the incremental round found no work to do we don't need to acquire the WriteTxn. This can happen when objects are updated but are not marked pending (e.g. changes are being made that do not need to be reconciled). Signed-off-by: Jussi Maki <jussi@isovalent.com>
712cd4f
to
fe16690
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/test |
Rework the ipset table to store a single address per entry. Consequently, rewrite the ipset reconciler to:
ipset restore
Finally, introduce the concept of an ipset manager initializer, to delay the addresses pruning during agent restart, before the initial nodes listing from k8s or kvstore is completed.
Refer to the individual commit messages and the related issue for further details.
Fixes: #31537