New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble/relay: flush old flows when the buffer drain timeout is reached #13776
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rolinh
added
release-note/bug
This PR fixes an issue in a previous release of Cilium.
sig/hubble
Impacts hubble server or relay
needs-backport/1.8
labels
Oct 27, 2020
gandro
requested changes
Oct 27, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! One small unnecessary import in the unit tests.
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Draining only 1 flow every time the buffer drain timeout is reached proves to be insufficient for certain cases. The worst case typically happens for a request with historic data, some filters and follow-mode. Using the Hubble CLI, this means requests like this one: hubble observe --last=100 --follow --pod kube-system/coredns-f9fd979d6-66mfg The behavior before this patch for a request like this, assuming 250 flows matching the requests are stored in Hubble instances ring buffers and new flows matching the filter criteria are infrequent, would be to forward 150 flows, then drain 1 flow to forward every time the drain timeout is reached. This means that for a buffer size of 100, it would take at least 100 seconds (with the default buffer drain timeout of 1s) to drain the buffer (unless enough new flows are received in the meantime and fill the buffer again). The new code drains all flows in the queue for which the timestamp is older than `now-bufferDrainTimeout`. In other words, this would be the old behavior: ... QUEUE FULL draining flow 149 QUEUE FULL draining flow 150 TIMEOUT REACHED DRAINED 1 flow TIMEOUT REACHED DRAINED 1 flow TIMEOUT REACHED DRAINED 1 flow ... And this is the new behavior: ... QUEUE FULL draining flow 149 QUEUE FULL draining flow 150 TIMEOUT REACHED DRAINED 100 flow TIMEOUT REACHED DRAINED 0 flow TIMEOUT REACHED DRAINED 0 flow ... If new flows are received but not in enough numbers to fill the buffer within the time window of `bufferDrainTimeout`, the behavior would look something like this: ... QUEUE FULL draining flow 149 QUEUE FULL draining flow 150 TIMEOUT REACHED DRAINED 100 flow TIMEOUT REACHED DRAINED 34 flow TIMEOUT REACHED DRAINED 72 flow ... In effect, this means that a time window of `bufferDrainTimeout` (default to 1s) is considered to sort flows before forwarding them to the client. This dramatically improves the query experience for requests in follow-mode. Suggested-by: Tom Payne <tom@isovalent.com> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
rolinh
force-pushed
the
pr/rolinh/hubble-relay-fix-buffer-drain
branch
from
October 27, 2020 13:03
2a9a06e
to
a5bb295
Compare
gandro
approved these changes
Oct 27, 2020
test-me-please |
maintainer-s-little-helper
bot
added
the
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
label
Oct 27, 2020
maintainer-s-little-helper
bot
moved this from Needs backport from master
to Backport pending to v1.9
in 1.9.0-rc3
Oct 27, 2020
maintainer-s-little-helper
bot
moved this from Needs backport from master
to Backport pending to v1.9
in 1.9.0-rc3
Oct 27, 2020
maintainer-s-little-helper
bot
moved this from Backport pending to v1.9
to Backport done to v1.9
in 1.9.0-rc3
Oct 28, 2020
joestringer
moved this from Needs backport from master
to Backport pending to v1.8
in 1.8.6
Nov 5, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
sig/hubble
Impacts hubble server or relay
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Draining only 1 flow every time the buffer drain timeout is reached
proves to be insufficient for certain cases. The worst case typically
happens for a request with historic data, some filters and follow-mode.
Using the Hubble CLI, this means requests like this one:
The behavior before this patch for a request like this, assuming 250
flows matching the requests are stored in Hubble instances ring buffers
and new flows matching the filter criteria are infrequent, would be to
forward 150 flows, then drain 1 flow to forward every time the drain
timeout is reached. This means that for a buffer size of 100, it would
take at least 100 seconds (with the default buffer drain timeout of 1s)
to drain the buffer (unless enough new flows are received in the
meantime and fill the buffer again).
The new code drains all flows in the queue for which the timestamp is
older than
now-bufferDrainTimeout
. In other words, this would be theold behavior:
And this is the new behavior:
If new flows are received but not in enough numbers to fill the buffer
within the time window of
bufferDrainTimeout
, the behavior would looksomething like this:
In effect, this means that a time window of
bufferDrainTimeout
(default to 1s) is considered to sort flows before forwarding them to
the client. This dramatically improves the query experience for requests
in follow-mode.
Fixes: #13775