Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble/relay: flush old flows when the buffer drain timeout is reached #13776

Merged
merged 2 commits into from Oct 27, 2020

Conversation

rolinh
Copy link
Member

@rolinh rolinh commented Oct 27, 2020

Draining only 1 flow every time the buffer drain timeout is reached
proves to be insufficient for certain cases. The worst case typically
happens for a request with historic data, some filters and follow-mode.
Using the Hubble CLI, this means requests like this one:

hubble observe --last=100 --follow --pod kube-system/coredns-f9fd979d6-66mfg

The behavior before this patch for a request like this, assuming 250
flows matching the requests are stored in Hubble instances ring buffers
and new flows matching the filter criteria are infrequent, would be to
forward 150 flows, then drain 1 flow to forward every time the drain
timeout is reached. This means that for a buffer size of 100, it would
take at least 100 seconds (with the default buffer drain timeout of 1s)
to drain the buffer (unless enough new flows are received in the
meantime and fill the buffer again).

The new code drains all flows in the queue for which the timestamp is
older than now-bufferDrainTimeout. In other words, this would be the
old behavior:

...
QUEUE FULL draining flow 149
QUEUE FULL draining flow 150
TIMEOUT REACHED
DRAINED 1 flow
TIMEOUT REACHED
DRAINED 1 flow
TIMEOUT REACHED
DRAINED 1 flow
...

And this is the new behavior:

...
QUEUE FULL draining flow 149
QUEUE FULL draining flow 150
TIMEOUT REACHED
DRAINED 100 flow
TIMEOUT REACHED
DRAINED 0 flow
TIMEOUT REACHED
DRAINED 0 flow
...

If new flows are received but not in enough numbers to fill the buffer
within the time window of bufferDrainTimeout, the behavior would look
something like this:

...
QUEUE FULL draining flow 149
QUEUE FULL draining flow 150
TIMEOUT REACHED
DRAINED 100 flow
TIMEOUT REACHED
DRAINED 34 flow
TIMEOUT REACHED
DRAINED 72 flow
...

In effect, this means that a time window of bufferDrainTimeout
(default to 1s) is considered to sort flows before forwarding them to
the client. This dramatically improves the query experience for requests
in follow-mode.

Fixes: #13775

@rolinh rolinh added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/hubble Impacts hubble server or relay needs-backport/1.8 labels Oct 27, 2020
@rolinh rolinh requested review from gandro and a team October 27, 2020 10:57
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Oct 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.0-rc3 Oct 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.5 Oct 27, 2020
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! One small unnecessary import in the unit tests.

pkg/hubble/relay/queue/priority_queue_test.go Outdated Show resolved Hide resolved
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Draining only 1 flow every time the buffer drain timeout is reached
proves to be insufficient for certain cases. The worst case typically
happens for a request with historic data, some filters and follow-mode.
Using the Hubble CLI, this means requests like this one:

    hubble observe --last=100 --follow --pod kube-system/coredns-f9fd979d6-66mfg

The behavior before this patch for a request like this, assuming 250
flows matching the requests are stored in Hubble instances ring buffers
and new flows matching the filter criteria are infrequent, would be to
forward 150 flows, then drain 1 flow to forward every time the drain
timeout is reached. This means that for a buffer size of 100, it would
take at least 100 seconds (with the default buffer drain timeout of 1s)
to drain the buffer (unless enough new flows are received in the
meantime and fill the buffer again).

The new code drains all flows in the queue for which the timestamp is
older than `now-bufferDrainTimeout`. In other words, this would be the
old behavior:

    ...
    QUEUE FULL draining flow 149
    QUEUE FULL draining flow 150
    TIMEOUT REACHED
    DRAINED 1 flow
    TIMEOUT REACHED
    DRAINED 1 flow
    TIMEOUT REACHED
    DRAINED 1 flow
    ...

And this is the new behavior:

    ...
    QUEUE FULL draining flow 149
    QUEUE FULL draining flow 150
    TIMEOUT REACHED
    DRAINED 100 flow
    TIMEOUT REACHED
    DRAINED 0 flow
    TIMEOUT REACHED
    DRAINED 0 flow
    ...

If new flows are received but not in enough numbers to fill the buffer
within the time window of `bufferDrainTimeout`, the behavior would look
something like this:

    ...
    QUEUE FULL draining flow 149
    QUEUE FULL draining flow 150
    TIMEOUT REACHED
    DRAINED 100 flow
    TIMEOUT REACHED
    DRAINED 34 flow
    TIMEOUT REACHED
    DRAINED 72 flow
    ...

In effect, this means that a time window of `bufferDrainTimeout`
(default to 1s) is considered to sort flows before forwarding them to
the client. This dramatically improves the query experience for requests
in follow-mode.

Suggested-by: Tom Payne <tom@isovalent.com>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
@rolinh rolinh force-pushed the pr/rolinh/hubble-relay-fix-buffer-drain branch from 2a9a06e to a5bb295 Compare October 27, 2020 13:03
@rolinh rolinh requested a review from gandro October 27, 2020 13:03
@rolinh
Copy link
Member Author

rolinh commented Oct 27, 2020

test-me-please

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 27, 2020
@pchaigno pchaigno merged commit 7d56068 into master Oct 27, 2020
@pchaigno pchaigno deleted the pr/rolinh/hubble-relay-fix-buffer-drain branch October 27, 2020 18:12
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.0-rc3 Oct 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.9 in 1.9.0-rc3 Oct 27, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.9 to Backport done to v1.9 in 1.9.0-rc3 Oct 28, 2020
@christarazi christarazi removed this from Needs backport from master in 1.8.5 Oct 29, 2020
@christarazi christarazi added this to Needs backport from master in 1.8.6 Oct 29, 2020
@joestringer joestringer moved this from Needs backport from master to Backport pending to v1.8 in 1.8.6 Nov 5, 2020
@joestringer joestringer moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.6 Nov 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/hubble Impacts hubble server or relay
Projects
No open projects
1.8.6
Backport done to v1.8
1.9.0-rc3
Backport done to v1.9
Development

Successfully merging this pull request may close these issues.

Hubble Relay slow to respond in follow-mode
7 participants