Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble: Never fail with ErrInvalidRead #17046

Merged
merged 1 commit into from Aug 18, 2021

Conversation

michi-covalent
Copy link
Contributor

@michi-covalent michi-covalent commented Aug 3, 2021

Currently GetFlows() fails with the following error when a position in
the ring buffer being read by Ring.read() has been overwritten:

requested data has been overwritten and is no longer available

This turned out to be impractical as it makes it difficult to read all
the flows in the ring buffer (e.g.. hubble observe --all). GetFlows()
would fail if Hubble observes a single flow between the reader rewinding
to the oldest position and retrieving the entry.

This patch modifies Ring.read() so that GetFlows() returns LostEvent
instead of stopping with an error. The caller of GetFlows() can then
decide how to handle LostEvent.

Note that this makes the behavior of Ring.read() consistent with that
of Ring.readFrom() used in the follow mode. It generates LostEvent and
continues following instead of failing with ErrInvalidRead.

Fixes: #17036

Signed-off-by: Michi Mutsuzaki michi@isovalent.com

@michi-covalent michi-covalent requested a review from a team August 3, 2021 23:50
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 3, 2021
@michi-covalent michi-covalent added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Aug 3, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 3, 2021
@michi-covalent
Copy link
Contributor Author

in addition to unit tests, manually validated the fix with hubble observe --all --debug:

# hubble observe --all --debug | head
time="2021-08-03T23:49:59Z" level=debug msg="Sending GetFlows request" request="number:18446744073709551615"
unknown response type: &{LostEvents:source:HUBBLE_RING_BUFFER  num_events_lost:1}
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:49527 <- kube-system/kube-dns-6c7b8dc9f9-x962b:53 to-endpoint FORWARDED (UDP)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-stack FORWARDED (TCP Flags: SYN)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-endpoint FORWARDED (TCP Flags: SYN)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 <- default/echo-a-56f9849b6b-jlbpr:8080 to-stack FORWARDED (TCP Flags: SYN, ACK)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 <- default/echo-a-56f9849b6b-jlbpr:8080 to-endpoint FORWARDED (TCP Flags: SYN, ACK)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-stack FORWARDED (TCP Flags: ACK)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-endpoint FORWARDED (TCP Flags: ACK)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-stack FORWARDED (TCP Flags: ACK, PSH)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-endpoint FORWARDED (TCP Flags: ACK, PSH)

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I left an inline comment

pkg/hubble/observer/local_observer.go Show resolved Hide resolved
Currently GetFlows() fails with the following error when a position in
the ring buffer being read by Ring.read() has been overwritten:

    requested data has been overwritten and is no longer available

This turned out to be impractical as it makes it difficult to read all
the flows in the ring buffer (e.g.. hubble observe --all). GetFlows()
would fail if Hubble observes a single flow between the reader rewinding
to the oldest position and retrieving the entry.

This patch modifies Ring.read() so that GetFlows() returns LostEvent
instead of stopping with an error. The caller of GetFlows() can then
decide how to handle LostEvent.

Note that this makes the behavior of Ring.read() consistent with that
of Ring.readFrom() used in the follow mode. It generates LostEvent and
continues following instead of failing with ErrInvalidRead.

Fixes: cilium#17036

Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Should we mark this for backport too?

@michi-covalent
Copy link
Contributor Author

test-me-please

@michi-covalent
Copy link
Contributor Author

restarting these 2:

aks: https://github.com/cilium/cilium/actions/runs/1101961941
eks: https://github.com/cilium/cilium/actions/runs/1101961955

Error : .github#L1
An unexpected error has occurred and we've been automatically notified. Errors are sometimes temporary, so please try again.

@michi-covalent
Copy link
Contributor Author

ci-aks

@michi-covalent
Copy link
Contributor Author

ci-eks

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.4 Aug 5, 2021
@michi-covalent
Copy link
Contributor Author

https://github.com/cilium/cilium/actions/runs/1102872491

legitimate failures 😐

Error: Connectivity test failed: 2 tests failed
📋 Test Report
❌ 2/9 tests failed (2/81 actions), 0 tests skipped, 0 scenarios skipped:
Test [client-ingress]:
  ❌ client-ingress/client-to-client/ping-1: cilium-test/client2-666976c95b-fpfxz (10.240.0.42) -> cilium-test/client2-666976c95b-fpfxz (10.240.0.42:0)
Test [client-egress]:
  ❌ client-egress/pod-to-pod/curl-0: cilium-test/client2-666976c95b-fpfxz (10.240.0.42) -> cilium-test/echo-other-node-697d5d69b7-jk4kp (10.240.0.49:8080)

@gandro
Copy link
Member

gandro commented Aug 18, 2021

Looking at the failure, it is not related to the PR, it's a connectivity problem, not a problem with flow visibility (i.e. Hubble). Given that more and more users are hitting the error that this PR is addressing, I'll mark this ready-to-merge.

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 18, 2021
@ti-mo ti-mo merged commit 98b5fb3 into cilium:master Aug 18, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.4 Aug 18, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.4 Aug 19, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.4 Aug 19, 2021
giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Jan 18, 2024
Currently, hubble flows are retrieved during sysdump collection passing
the `--follow` parameter to hubble observe. According to the comment,
this appeared to be a necessary hack to prevent the "requested data has
been overwritten and is no longer available" error. Yet, the consequence
is that the hubble observe command becomes blocking, and we relying on the
specified timeout only for its termination. When capturing a sysdump,
though, we are interested in storing (as many as possible) flows prior to
that moment (e.g., to investigate the causes of a connectivity test failure),
not the ones occurring during the collection of the sysdump itself.

Given that the original reason for using the `--follow` parameter got
fixed quite some time ago [1] and the fix is included in any Cilium
versions supported today, let's just get rid of it. The side effects
include the early termination of the collection process as soon as
all the flows have been retrieved, as well as the reduction of the
size of the sysdumps when increasing the timeout period, given that
we do no longer block until its expiration (this is relevant especially
in CI tests, as they are currently too large to be uploaded on GH).
Nonetheless, the timeout parameter is preserved to interrupt the
retrieval if taking too long.

[1]: cilium/cilium#17046

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
giorio94 added a commit to giorio94/cilium-cli that referenced this pull request Jan 19, 2024
Currently, hubble flows are retrieved during sysdump collection passing
the `--follow` parameter to hubble observe. According to the comment,
this appeared to be a necessary hack to prevent the "requested data has
been overwritten and is no longer available" error. Yet, the consequence
is that the hubble observe command becomes blocking, and we relying on the
specified timeout only for its termination. When capturing a sysdump,
though, we are interested in storing (as many as possible) flows prior to
that moment (e.g., to investigate the causes of a connectivity test failure),
not the ones occurring during the collection of the sysdump itself.

Given that the original reason for using the `--follow` parameter got
fixed quite some time ago [1] and the fix is included in any Cilium
versions supported today, let's just get rid of it. The side effects
include the early termination of the collection process as soon as
all the flows have been retrieved, as well as the reduction of the
size of the sysdumps when increasing the timeout period, given that
we do no longer block until its expiration (this is relevant especially
in CI tests, as they are currently too large to be uploaded on GH).
Nonetheless, the timeout parameter is preserved to interrupt the
retrieval if taking too long.

[1]: cilium/cilium#17046

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
aditighag pushed a commit to cilium/cilium-cli that referenced this pull request Jan 22, 2024
Currently, hubble flows are retrieved during sysdump collection passing
the `--follow` parameter to hubble observe. According to the comment,
this appeared to be a necessary hack to prevent the "requested data has
been overwritten and is no longer available" error. Yet, the consequence
is that the hubble observe command becomes blocking, and we relying on the
specified timeout only for its termination. When capturing a sysdump,
though, we are interested in storing (as many as possible) flows prior to
that moment (e.g., to investigate the causes of a connectivity test failure),
not the ones occurring during the collection of the sysdump itself.

Given that the original reason for using the `--follow` parameter got
fixed quite some time ago [1] and the fix is included in any Cilium
versions supported today, let's just get rid of it. The side effects
include the early termination of the collection process as soon as
all the flows have been retrieved, as well as the reduction of the
size of the sysdumps when increasing the timeout period, given that
we do no longer block until its expiration (this is relevant especially
in CI tests, as they are currently too large to be uploaded on GH).
Nonetheless, the timeout parameter is preserved to interrupt the
retrieval if taking too long.

[1]: cilium/cilium#17046

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.10.4
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

hubble observe fails with "requested data has been overwritten and is no longer available"
5 participants