hubble/relay: flush old flows when the buffer drain timeout is reached #13776

rolinh · 2020-10-27T10:57:18Z

Draining only 1 flow every time the buffer drain timeout is reached
proves to be insufficient for certain cases. The worst case typically
happens for a request with historic data, some filters and follow-mode.
Using the Hubble CLI, this means requests like this one:

hubble observe --last=100 --follow --pod kube-system/coredns-f9fd979d6-66mfg

The behavior before this patch for a request like this, assuming 250
flows matching the requests are stored in Hubble instances ring buffers
and new flows matching the filter criteria are infrequent, would be to
forward 150 flows, then drain 1 flow to forward every time the drain
timeout is reached. This means that for a buffer size of 100, it would
take at least 100 seconds (with the default buffer drain timeout of 1s)
to drain the buffer (unless enough new flows are received in the
meantime and fill the buffer again).

The new code drains all flows in the queue for which the timestamp is
older than now-bufferDrainTimeout. In other words, this would be the
old behavior:

...
QUEUE FULL draining flow 149
QUEUE FULL draining flow 150
TIMEOUT REACHED
DRAINED 1 flow
TIMEOUT REACHED
DRAINED 1 flow
TIMEOUT REACHED
DRAINED 1 flow
...

And this is the new behavior:

...
QUEUE FULL draining flow 149
QUEUE FULL draining flow 150
TIMEOUT REACHED
DRAINED 100 flow
TIMEOUT REACHED
DRAINED 0 flow
TIMEOUT REACHED
DRAINED 0 flow
...

If new flows are received but not in enough numbers to fill the buffer
within the time window of bufferDrainTimeout, the behavior would look
something like this:

...
QUEUE FULL draining flow 149
QUEUE FULL draining flow 150
TIMEOUT REACHED
DRAINED 100 flow
TIMEOUT REACHED
DRAINED 34 flow
TIMEOUT REACHED
DRAINED 72 flow
...

In effect, this means that a time window of bufferDrainTimeout
(default to 1s) is considered to sort flows before forwarding them to
the client. This dramatically improves the query experience for requests
in follow-mode.

Fixes: #13775

gandro

Looks good to me! One small unnecessary import in the unit tests.

pkg/hubble/relay/queue/priority_queue_test.go

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

Draining only 1 flow every time the buffer drain timeout is reached proves to be insufficient for certain cases. The worst case typically happens for a request with historic data, some filters and follow-mode. Using the Hubble CLI, this means requests like this one: hubble observe --last=100 --follow --pod kube-system/coredns-f9fd979d6-66mfg The behavior before this patch for a request like this, assuming 250 flows matching the requests are stored in Hubble instances ring buffers and new flows matching the filter criteria are infrequent, would be to forward 150 flows, then drain 1 flow to forward every time the drain timeout is reached. This means that for a buffer size of 100, it would take at least 100 seconds (with the default buffer drain timeout of 1s) to drain the buffer (unless enough new flows are received in the meantime and fill the buffer again). The new code drains all flows in the queue for which the timestamp is older than `now-bufferDrainTimeout`. In other words, this would be the old behavior: ... QUEUE FULL draining flow 149 QUEUE FULL draining flow 150 TIMEOUT REACHED DRAINED 1 flow TIMEOUT REACHED DRAINED 1 flow TIMEOUT REACHED DRAINED 1 flow ... And this is the new behavior: ... QUEUE FULL draining flow 149 QUEUE FULL draining flow 150 TIMEOUT REACHED DRAINED 100 flow TIMEOUT REACHED DRAINED 0 flow TIMEOUT REACHED DRAINED 0 flow ... If new flows are received but not in enough numbers to fill the buffer within the time window of `bufferDrainTimeout`, the behavior would look something like this: ... QUEUE FULL draining flow 149 QUEUE FULL draining flow 150 TIMEOUT REACHED DRAINED 100 flow TIMEOUT REACHED DRAINED 34 flow TIMEOUT REACHED DRAINED 72 flow ... In effect, this means that a time window of `bufferDrainTimeout` (default to 1s) is considered to sort flows before forwarding them to the client. This dramatically improves the query experience for requests in follow-mode. Suggested-by: Tom Payne <tom@isovalent.com> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

rolinh · 2020-10-27T13:36:52Z

test-me-please

rolinh added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/hubble Impacts hubble server or relay needs-backport/1.8 labels Oct 27, 2020

rolinh requested review from gandro and a team October 27, 2020 10:57

maintainer-s-little-helper bot assigned gandro Oct 27, 2020

maintainer-s-little-helper bot added this to In progress in 1.10.0 Oct 27, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.9.0-rc3 Oct 27, 2020

maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 27, 2020

gandro requested changes Oct 27, 2020

View reviewed changes

pkg/hubble/relay/queue/priority_queue_test.go Outdated Show resolved Hide resolved

rolinh added 2 commits October 27, 2020 14:02

hubble/relay: add PopOlderThan to pop old elements from the queue

5b9bc22

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>

rolinh force-pushed the pr/rolinh/hubble-relay-fix-buffer-drain branch from 2a9a06e to a5bb295 Compare October 27, 2020 13:03

rolinh requested a review from gandro October 27, 2020 13:03

gandro approved these changes Oct 27, 2020

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 27, 2020

pchaigno merged commit 7d56068 into master Oct 27, 2020

pchaigno deleted the pr/rolinh/hubble-relay-fix-buffer-drain branch October 27, 2020 18:12

aditighag mentioned this pull request Oct 27, 2020

Pr/v1.9 backport 2020 10 27 #13786

Merged

aditighag added backport-pending/1.9 and removed needs-backport/1.9 labels Oct 27, 2020

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.0-rc3 Oct 27, 2020

jrajahalme added backport-done/1.9 and removed backport-pending/1.9 labels Oct 28, 2020

maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.0-rc3 Oct 28, 2020

christarazi removed this from Needs backport from master in 1.8.5 Oct 29, 2020

christarazi added this to Needs backport from master in 1.8.6 Oct 29, 2020

joestringer mentioned this pull request Nov 3, 2020

Prepare for release v1.9.0-rc3 #13864

Merged

tklauser mentioned this pull request Nov 3, 2020

v1.8 backports 2020-11-03 #13875

Merged

tklauser added backport-pending/1.8 and removed needs-backport/1.8 labels Nov 3, 2020

rolinh mentioned this pull request Nov 3, 2020

v1.8: hubble/relay: flush old flows when the buffer drain timeout is reached #13877

Merged

rolinh added backport-pending/1.8 and removed needs-backport/1.8 labels Nov 3, 2020

joestringer moved this from Needs backport from master to Backport pending to v1.8 in 1.8.6 Nov 5, 2020

jrfastab added backport-done/1.8 and removed backport-pending/1.8 labels Nov 6, 2020

joestringer moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.6 Nov 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubble/relay: flush old flows when the buffer drain timeout is reached #13776

hubble/relay: flush old flows when the buffer drain timeout is reached #13776

rolinh commented Oct 27, 2020

gandro left a comment

rolinh commented Oct 27, 2020

hubble/relay: flush old flows when the buffer drain timeout is reached #13776

hubble/relay: flush old flows when the buffer drain timeout is reached #13776

Conversation

rolinh commented Oct 27, 2020

gandro left a comment

Choose a reason for hiding this comment

rolinh commented Oct 27, 2020