New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble: Never fail with ErrInvalidRead #17046
Conversation
in addition to unit tests, manually validated the fix with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I left an inline comment
Currently GetFlows() fails with the following error when a position in the ring buffer being read by Ring.read() has been overwritten: requested data has been overwritten and is no longer available This turned out to be impractical as it makes it difficult to read all the flows in the ring buffer (e.g.. hubble observe --all). GetFlows() would fail if Hubble observes a single flow between the reader rewinding to the oldest position and retrieving the entry. This patch modifies Ring.read() so that GetFlows() returns LostEvent instead of stopping with an error. The caller of GetFlows() can then decide how to handle LostEvent. Note that this makes the behavior of Ring.read() consistent with that of Ring.readFrom() used in the follow mode. It generates LostEvent and continues following instead of failing with ErrInvalidRead. Fixes: cilium#17036 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
cfedde5
to
79be7fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Should we mark this for backport too?
test-me-please |
restarting these 2: aks: https://github.com/cilium/cilium/actions/runs/1101961941
|
ci-aks |
ci-eks |
https://github.com/cilium/cilium/actions/runs/1102872491 legitimate failures 😐
|
Looking at the failure, it is not related to the PR, it's a connectivity problem, not a problem with flow visibility (i.e. Hubble). Given that more and more users are hitting the error that this PR is addressing, I'll mark this ready-to-merge. |
Currently, hubble flows are retrieved during sysdump collection passing the `--follow` parameter to hubble observe. According to the comment, this appeared to be a necessary hack to prevent the "requested data has been overwritten and is no longer available" error. Yet, the consequence is that the hubble observe command becomes blocking, and we relying on the specified timeout only for its termination. When capturing a sysdump, though, we are interested in storing (as many as possible) flows prior to that moment (e.g., to investigate the causes of a connectivity test failure), not the ones occurring during the collection of the sysdump itself. Given that the original reason for using the `--follow` parameter got fixed quite some time ago [1] and the fix is included in any Cilium versions supported today, let's just get rid of it. The side effects include the early termination of the collection process as soon as all the flows have been retrieved, as well as the reduction of the size of the sysdumps when increasing the timeout period, given that we do no longer block until its expiration (this is relevant especially in CI tests, as they are currently too large to be uploaded on GH). Nonetheless, the timeout parameter is preserved to interrupt the retrieval if taking too long. [1]: cilium/cilium#17046 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, hubble flows are retrieved during sysdump collection passing the `--follow` parameter to hubble observe. According to the comment, this appeared to be a necessary hack to prevent the "requested data has been overwritten and is no longer available" error. Yet, the consequence is that the hubble observe command becomes blocking, and we relying on the specified timeout only for its termination. When capturing a sysdump, though, we are interested in storing (as many as possible) flows prior to that moment (e.g., to investigate the causes of a connectivity test failure), not the ones occurring during the collection of the sysdump itself. Given that the original reason for using the `--follow` parameter got fixed quite some time ago [1] and the fix is included in any Cilium versions supported today, let's just get rid of it. The side effects include the early termination of the collection process as soon as all the flows have been retrieved, as well as the reduction of the size of the sysdumps when increasing the timeout period, given that we do no longer block until its expiration (this is relevant especially in CI tests, as they are currently too large to be uploaded on GH). Nonetheless, the timeout parameter is preserved to interrupt the retrieval if taking too long. [1]: cilium/cilium#17046 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, hubble flows are retrieved during sysdump collection passing the `--follow` parameter to hubble observe. According to the comment, this appeared to be a necessary hack to prevent the "requested data has been overwritten and is no longer available" error. Yet, the consequence is that the hubble observe command becomes blocking, and we relying on the specified timeout only for its termination. When capturing a sysdump, though, we are interested in storing (as many as possible) flows prior to that moment (e.g., to investigate the causes of a connectivity test failure), not the ones occurring during the collection of the sysdump itself. Given that the original reason for using the `--follow` parameter got fixed quite some time ago [1] and the fix is included in any Cilium versions supported today, let's just get rid of it. The side effects include the early termination of the collection process as soon as all the flows have been retrieved, as well as the reduction of the size of the sysdumps when increasing the timeout period, given that we do no longer block until its expiration (this is relevant especially in CI tests, as they are currently too large to be uploaded on GH). Nonetheless, the timeout parameter is preserved to interrupt the retrieval if taking too long. [1]: cilium/cilium#17046 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently GetFlows() fails with the following error when a position in
the ring buffer being read by Ring.read() has been overwritten:
This turned out to be impractical as it makes it difficult to read all
the flows in the ring buffer (e.g.. hubble observe --all). GetFlows()
would fail if Hubble observes a single flow between the reader rewinding
to the oldest position and retrieving the entry.
This patch modifies Ring.read() so that GetFlows() returns LostEvent
instead of stopping with an error. The caller of GetFlows() can then
decide how to handle LostEvent.
Note that this makes the behavior of Ring.read() consistent with that
of Ring.readFrom() used in the follow mode. It generates LostEvent and
continues following instead of failing with ErrInvalidRead.
Fixes: #17036
Signed-off-by: Michi Mutsuzaki michi@isovalent.com