Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipset: Rework the reconciler to use batch ops #31638

Conversation

pippolo84
Copy link
Member

@pippolo84 pippolo84 commented Mar 27, 2024

Rework the ipset table to store a single address per entry. Consequently, rewrite the ipset reconciler to:

  • use batch operations and ipset restore
  • delete addresses only in DeleteBatch and Prune operations

Finally, introduce the concept of an ipset manager initializer, to delay the addresses pruning during agent restart, before the initial nodes listing from k8s or kvstore is completed.

Refer to the individual commit messages and the related issue for further details.

Fixes: #31537

@pippolo84 pippolo84 added the dont-merge/preview-only Only for preview or testing, don't merge it. label Mar 27, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 27, 2024
@pippolo84 pippolo84 force-pushed the pr/pippolo84/fix-ipset-deletion-at-restart branch 2 times, most recently from 8345e8c to e468cab Compare April 3, 2024 09:26
@pippolo84 pippolo84 added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Apr 3, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 3, 2024
@pippolo84 pippolo84 requested a review from giorio94 April 3, 2024 09:50
@pippolo84 pippolo84 force-pushed the pr/pippolo84/fix-ipset-deletion-at-restart branch 3 times, most recently from 4b7ded2 to 2feaa25 Compare April 4, 2024 12:26
@maintainer-s-little-helper
Copy link

Commit 2feaa25 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Apr 4, 2024
@pippolo84 pippolo84 force-pushed the pr/pippolo84/fix-ipset-deletion-at-restart branch from 2feaa25 to e2a3c24 Compare April 4, 2024 12:29
@maintainer-s-little-helper
Copy link

Commit e2a3c24 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@pippolo84 pippolo84 force-pushed the pr/pippolo84/fix-ipset-deletion-at-restart branch from e2a3c24 to c19d61d Compare April 4, 2024 15:11
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Apr 4, 2024
@pippolo84
Copy link
Member Author

A performance test done by @giorio94 against the previous implementation (based on a multiple addresses per table entry model) highlighted that the ipset list operation can become quite expensive in large multi-cluster setup in terms of memory allocation, leading to a high GC pressure. Here the related pprof:

pprof

The current implementation already addresses this partially, since moving the table to a single address per entry model simplifies the Update implementation, removing the need for relying on ipset list in favor of ipset create.
To be on the safe side, the last commit updates the implementation of the Delete operation too, so that the entire incremental reconciliation path will not use the ipset list anymore.

@pippolo84 pippolo84 marked this pull request as ready for review April 4, 2024 15:23
@pippolo84 pippolo84 requested review from a team as code owners April 4, 2024 15:23
@pippolo84 pippolo84 requested a review from joamaki April 4, 2024 15:23
@pippolo84

This comment was marked as outdated.

@pippolo84 pippolo84 added area/iptables Impacts how Cilium interacts with iptables. and removed dont-merge/preview-only Only for preview or testing, don't merge it. labels Apr 4, 2024
@pippolo84 pippolo84 force-pushed the pr/pippolo84/fix-ipset-deletion-at-restart branch from c19d61d to 95296f8 Compare April 4, 2024 16:12
@pippolo84
Copy link
Member Author

/test

@pippolo84 pippolo84 force-pushed the pr/pippolo84/fix-ipset-deletion-at-restart branch 3 times, most recently from 1a632cb to 869fef3 Compare April 17, 2024 12:43
@pippolo84 pippolo84 requested a review from a team as a code owner April 17, 2024 12:43
@pippolo84 pippolo84 requested a review from thorn3r April 17, 2024 12:43
@pippolo84
Copy link
Member Author

/test

@pippolo84 pippolo84 requested a review from giorio94 April 17, 2024 12:49
@pippolo84 pippolo84 force-pushed the pr/pippolo84/fix-ipset-deletion-at-restart branch 2 times, most recently from c8cc12e to cd30bea Compare April 17, 2024 15:26
@pippolo84
Copy link
Member Author

/test

@pippolo84 pippolo84 force-pushed the pr/pippolo84/fix-ipset-deletion-at-restart branch from cd30bea to 712cd4f Compare April 17, 2024 15:41
@pippolo84
Copy link
Member Author

/test

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A couple of minor comments, then I'll do the final pass.

daemon/cmd/status_test.go Outdated Show resolved Hide resolved
pkg/datapath/iptables/ipset/ipset.go Show resolved Hide resolved
pkg/clustermesh/ipsetsyncer.go Outdated Show resolved Hide resolved
@giorio94 giorio94 removed the request for review from thorn3r April 18, 2024 14:21
pippolo84 and others added 8 commits April 18, 2024 19:07
Stop the rate limiter goroutine during reconciler termination if rate
limiting was enabled in the configuration.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Rework the ipset table to store a single address per entry. This allows
to rewrite the reconciler in a more canonical way, instead of mixing
address insertions and removals in the Update operation.

In the current setup, the Update and Delete respectively add and remove
a single IP address in the related set.  The Prune operation performs a
full reconciliation for both IPv4 and IPv6 sets.

Since each table entry now stores a single address instead of an entire
set, the usage of immutable sets is replaced by a simpler netip.Addr
field.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The current ipset reconciler logic might remove ipset entries too
aggressively at startup, when the Cilium managed IP sets still contain
entries related to a previous agent run.  This may happen if a full
reconciliation is performed before the initial nodes listing (either
from k8s or kvstore) is complete.  This in turn may result in temporary
traffic disruption until the IP sets have been fully reloaded following
the completion of the initial nodes listing.

To avoid depending from k8s or the kvstore in the ipset cell to
explicitly wait for the listing, we rely on the handler pattern: the
ipset manager exposes a way to grab a handler (namely, an initializer)
that blocks the Prune operation until the initialization is explicitly
marked as completed.  The ipset manager consumer is in charge of
signalling the completion of the initialization and the start of the
full reconciliation.

The ipset manager waits on the initializers and as soon as all have been
marked as done triggers a full reconciliation to immediately perform a
Prune operation.

Since the Prune operation is always executed as soon as the initializers
have been marked as done, the full reconciliation interval is increased
to 30 minutes.

In a subsequent commit an ipset initializer is used by the node manager
to signal the possibility to safely perform address pruning.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Add a unit test to check that the ipset reconciler skips the Prune
operation up until all initializers have been marked as done.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The Cilium managed IP sets contain the node addresses for which we do
not SNAT the outgoing traffic.  During an agent restart, while we are
waiting for the initial nodes listing to complete, we don't want to
remove addresses too early, because those addresses may still be related
to existing nodes not yet seen in the in-progress listing.

In order to avoid temporary pod traffic disruption, the node manager
takes an ipset initializer reference and mark it as done only after the
initial listing, from either k8s or kvstore, is complete.
This in turn triggers a full IP sets reconciliation, to safely remove
stale addresses that are still in the sets.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
When clustermesh is enabled, the nodes manager receives events about all
the remote cluster nodes. Those events are handled just as nodes events
coming from the local cluster, so the remote nodes IPs are added to the
kernel IP sets.

To avoid pruning remote nodes addresses in kernel IP sets too early
after an agent restart, clustermesh module takes an initializer handler
to the ipset manager, delayiing the stale addresses deletion until the
listing of all the remote clusters nodes is completed.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Performing an ipset list operation at each reconciler update can become
expensive in terms of memory, because it requires the allocation of a
set of addresses each time it is called. In a very large cluster or in a
large multi-cluster setup, the cardinality of this list can be pretty
high. When the cluster is experiencing node churn, and the
ipset.RemoveFromIPSet is called many times, this can lead to a very high
GC pressure, because we rely on `ipset list` in the reconciler Delete
operation to be sure that the IP set already exists.

To avoid that, reconcile the ipset changes using "ipset restore" in
order to add multiple IPs in one go and avoid the overhead of executing
ipset many times.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
If the incremental round found no work to do we don't need to
acquire the WriteTxn. This can happen when objects are updated
but are not marked pending (e.g. changes are being made that
do not need to be reconciled).

Signed-off-by: Jussi Maki <jussi@isovalent.com>
@pippolo84 pippolo84 force-pushed the pr/pippolo84/fix-ipset-deletion-at-restart branch from 712cd4f to fe16690 Compare April 18, 2024 17:07
@pippolo84 pippolo84 requested a review from giorio94 April 18, 2024 18:48
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@giorio94
Copy link
Member

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 19, 2024
@julianwiedmann julianwiedmann added this pull request to the merge queue Apr 19, 2024
Merged via the queue into cilium:main with commit b3ffa77 Apr 19, 2024
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/iptables Impacts how Cilium interacts with iptables. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ipset entries transitory drop on agent restart due to the new reconcile logic
4 participants