hubble/peer: handle burst of change notifications #12024

rolinh · 2020-06-11T11:51:22Z

When a client calls Notify, a change notification is sent for every node
in the cluster which allows the client to discover all nodes in the
cluster. Subsequently, only changes are notified (when a node is updated
or deleted or a new node is added to the cluster).

With a sufficiently large cluster, the initial burst of notifications
can be massive and the previous implementation did not accommodate for
this as it used a default buffer of size 10, assuming it would be large
enough to serialize notifications and send them to the client before
new notifications are being pushed by the node manager. It appears that
the node manager iterates over a hash map while holding a lock when it
calls the notification handler and this process is much faster than
serializing and sending these notifications over to the remote client
via gRPC.

However, as the initial buffer implementation used a channel, increasing
the buffer size to a very large value would dramatically increase memory
usage and most of it would be unused in most cases (in Go, a buffered
channel allocates enough memory to hold every element it has be able to
buffer).

This PR fixes this problem by using a concurrent safe buffer
implementation backed by a slice that is only allowed to grow to a
predefined maximum size. This size defaults to 65_536 entries, which
should be enough to accommodate bursts of notifications for large
clusters in cluster-mesh.

coveralls · 2020-06-11T12:24:54Z

Coverage increased (+0.04%) to 37.049% when pulling e27ee36 on pr/rolinh/hubble-fix-cn-burst-handling into 477c487 on master.

rolinh · 2020-06-11T12:57:44Z

test-me-please

gandro

Looks good! One optional nit.

There was also an offline discussion about maybe using a CondVar in the implementation instead of the notify channel trick, but I want to run some benchmarks first before changing things prematurely (especially since this is not on a any critical path).

pkg/hubble/peer/buffer.go

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

When a client calls Notify, a change notification is sent for every node in the cluster which allows the client to discover all nodes in the cluster. Subsequently, only changes are notified (when a node is updated or deleted or a new node is added to the cluster). With a sufficiently large cluster, the initial burst of notifications can be massive and the previous implementation did not accommodate for this as it used a default buffer of size 10, assuming it would be large enough to serialize notifications and send them to the client before new notifications are being pushed by the node manager. It appears that the node manager iterates over a hash map while holding a lock when it calls the notification handler and this process is much faster than serializing and sending these notifications over to the remote client via gRPC. However, as the initial buffer implementation used a channel, increasing the buffer size to a very large value would dramatically increase memory usage and most of it would be unused in most cases (in Go, a buffered channel allocates enough memory to hold every element it has be able to buffer). This commit fixes this problem by using a concurrent safe buffer implementation backed by a slice that is only allowed to grow to a predefined maximum size. This size defaults to 65_536 entries, which should be enough to accommodate bursts of notifications for large clusters in cluster-mesh. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

rolinh · 2020-06-11T15:37:28Z

test-me-please

michi-covalent

lgtm a couple of questions.

will there be a corresponding change in command line option / helm template?
i'm always curious to see the actual difference in resource usage whenever we introduce optimization.

rolinh · 2020-06-12T09:10:39Z

will there be a corresponding change in command line option / helm template?

This option has never been made configurable.

i'm always curious to see the actual difference in resource usage whenever we introduce optimization.

Good question! The memory allocation for a channel of this type and size would have been in the order of ~1M (verified using pprof). It may not seem like much but this memory would be allocated per request to Notify.

rolinh · 2020-06-12T09:17:45Z

K8s-1.17-Kernel-4.19 failure is a known flake (IPv4 fragments)
K8s-1.18-kernel-4.9 failures seem to be a vagrant cache issue
Marking the PR as ready to merge.

rolinh added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/hubble Impacts hubble server or relay labels Jun 11, 2020

rolinh requested review from gandro and a team June 11, 2020 11:51

maintainer-s-little-helper bot added this to In progress in 1.8.0 Jun 11, 2020

rolinh added the needs-backport/1.8 label Jun 11, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.8.0 Jun 11, 2020

gandro approved these changes Jun 11, 2020

View reviewed changes

pkg/hubble/peer/buffer.go Outdated Show resolved Hide resolved

rolinh added 2 commits June 11, 2020 17:36

hubble/peer: implement a concurrent safe buffer for change notifications

7ad6f1e

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

rolinh force-pushed the pr/rolinh/hubble-fix-cn-burst-handling branch from f1abeb0 to e27ee36 Compare June 11, 2020 15:37

michi-covalent approved these changes Jun 11, 2020

View reviewed changes

rolinh added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 12, 2020

aanm merged commit 20fea0b into master Jun 12, 2020

1.8.0 automation moved this from In progress to Merged Jun 12, 2020

aanm deleted the pr/rolinh/hubble-fix-cn-burst-handling branch June 12, 2020 09:27

aanm added backport-pending/1.8 and removed needs-backport/1.8 labels Jun 12, 2020

joestringer mentioned this pull request Jun 12, 2020

v1.8 backports 2020-06-11-2 #12039

Merged

aanm added backport-done/1.8 and removed backport-pending/1.8 labels Jun 16, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport done to v1.8 in 1.8.0 Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubble/peer: handle burst of change notifications #12024

hubble/peer: handle burst of change notifications #12024

rolinh commented Jun 11, 2020

coveralls commented Jun 11, 2020 •

edited

rolinh commented Jun 11, 2020

gandro left a comment •

edited

rolinh commented Jun 11, 2020

michi-covalent left a comment

rolinh commented Jun 12, 2020 •

edited

rolinh commented Jun 12, 2020

hubble/peer: handle burst of change notifications #12024

hubble/peer: handle burst of change notifications #12024

Conversation

rolinh commented Jun 11, 2020

coveralls commented Jun 11, 2020 • edited

rolinh commented Jun 11, 2020

gandro left a comment • edited

Choose a reason for hiding this comment

rolinh commented Jun 11, 2020

michi-covalent left a comment

Choose a reason for hiding this comment

rolinh commented Jun 12, 2020 • edited

rolinh commented Jun 12, 2020

coveralls commented Jun 11, 2020 •

edited

gandro left a comment •

edited

rolinh commented Jun 12, 2020 •

edited