hubble: Never fail with ErrInvalidRead #17046

michi-covalent · 2021-08-03T23:50:28Z

Currently GetFlows() fails with the following error when a position in
the ring buffer being read by Ring.read() has been overwritten:

requested data has been overwritten and is no longer available

This turned out to be impractical as it makes it difficult to read all
the flows in the ring buffer (e.g.. hubble observe --all). GetFlows()
would fail if Hubble observes a single flow between the reader rewinding
to the oldest position and retrieving the entry.

This patch modifies Ring.read() so that GetFlows() returns LostEvent
instead of stopping with an error. The caller of GetFlows() can then
decide how to handle LostEvent.

Note that this makes the behavior of Ring.read() consistent with that
of Ring.readFrom() used in the follow mode. It generates LostEvent and
continues following instead of failing with ErrInvalidRead.

Fixes: #17036

Signed-off-by: Michi Mutsuzaki michi@isovalent.com

michi-covalent · 2021-08-03T23:51:50Z

in addition to unit tests, manually validated the fix with hubble observe --all --debug:

# hubble observe --all --debug | head
time="2021-08-03T23:49:59Z" level=debug msg="Sending GetFlows request" request="number:18446744073709551615"
unknown response type: &{LostEvents:source:HUBBLE_RING_BUFFER  num_events_lost:1}
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:49527 <- kube-system/kube-dns-6c7b8dc9f9-x962b:53 to-endpoint FORWARDED (UDP)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-stack FORWARDED (TCP Flags: SYN)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-endpoint FORWARDED (TCP Flags: SYN)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 <- default/echo-a-56f9849b6b-jlbpr:8080 to-stack FORWARDED (TCP Flags: SYN, ACK)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 <- default/echo-a-56f9849b6b-jlbpr:8080 to-endpoint FORWARDED (TCP Flags: SYN, ACK)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-stack FORWARDED (TCP Flags: ACK)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-endpoint FORWARDED (TCP Flags: ACK)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-stack FORWARDED (TCP Flags: ACK, PSH)
Aug  3 23:47:12.074: default/pod-to-a-9dc6d768c-rc7nc:34082 -> default/echo-a-56f9849b6b-jlbpr:8080 to-endpoint FORWARDED (TCP Flags: ACK, PSH)

gandro

Thanks for doing this! I left an inline comment

pkg/hubble/observer/local_observer.go

Currently GetFlows() fails with the following error when a position in the ring buffer being read by Ring.read() has been overwritten: requested data has been overwritten and is no longer available This turned out to be impractical as it makes it difficult to read all the flows in the ring buffer (e.g.. hubble observe --all). GetFlows() would fail if Hubble observes a single flow between the reader rewinding to the oldest position and retrieving the entry. This patch modifies Ring.read() so that GetFlows() returns LostEvent instead of stopping with an error. The caller of GetFlows() can then decide how to handle LostEvent. Note that this makes the behavior of Ring.read() consistent with that of Ring.readFrom() used in the follow mode. It generates LostEvent and continues following instead of failing with ErrInvalidRead. Fixes: cilium#17036 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>

gandro

Nice work! Should we mark this for backport too?

michi-covalent · 2021-08-05T15:12:46Z

test-me-please

michi-covalent · 2021-08-05T20:41:03Z

restarting these 2:

aks: https://github.com/cilium/cilium/actions/runs/1101961941
eks: https://github.com/cilium/cilium/actions/runs/1101961955

Error : .github#L1
An unexpected error has occurred and we've been automatically notified. Errors are sometimes temporary, so please try again.

michi-covalent · 2021-08-05T20:41:14Z

ci-aks

michi-covalent · 2021-08-05T20:41:25Z

ci-eks

michi-covalent · 2021-08-05T22:18:28Z

https://github.com/cilium/cilium/actions/runs/1102872491

legitimate failures 😐

Error: Connectivity test failed: 2 tests failed
📋 Test Report
❌ 2/9 tests failed (2/81 actions), 0 tests skipped, 0 scenarios skipped:
Test [client-ingress]:
  ❌ client-ingress/client-to-client/ping-1: cilium-test/client2-666976c95b-fpfxz (10.240.0.42) -> cilium-test/client2-666976c95b-fpfxz (10.240.0.42:0)
Test [client-egress]:
  ❌ client-egress/pod-to-pod/curl-0: cilium-test/client2-666976c95b-fpfxz (10.240.0.42) -> cilium-test/echo-other-node-697d5d69b7-jk4kp (10.240.0.49:8080)

gandro · 2021-08-18T09:54:38Z

Looking at the failure, it is not related to the PR, it's a connectivity problem, not a problem with flow visibility (i.e. Hubble). Given that more and more users are hitting the error that this PR is addressing, I'll mark this ready-to-merge.

Currently, hubble flows are retrieved during sysdump collection passing the `--follow` parameter to hubble observe. According to the comment, this appeared to be a necessary hack to prevent the "requested data has been overwritten and is no longer available" error. Yet, the consequence is that the hubble observe command becomes blocking, and we relying on the specified timeout only for its termination. When capturing a sysdump, though, we are interested in storing (as many as possible) flows prior to that moment (e.g., to investigate the causes of a connectivity test failure), not the ones occurring during the collection of the sysdump itself. Given that the original reason for using the `--follow` parameter got fixed quite some time ago [1] and the fix is included in any Cilium versions supported today, let's just get rid of it. The side effects include the early termination of the collection process as soon as all the flows have been retrieved, as well as the reduction of the size of the sysdumps when increasing the timeout period, given that we do no longer block until its expiration (this is relevant especially in CI tests, as they are currently too large to be uploaded on GH). Nonetheless, the timeout parameter is preserved to interrupt the retrieval if taking too long. [1]: cilium/cilium#17046 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

michi-covalent requested a review from a team August 3, 2021 23:50

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 3, 2021

michi-covalent requested a review from glibsm August 3, 2021 23:50

maintainer-s-little-helper bot assigned glibsm Aug 3, 2021

michi-covalent added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Aug 3, 2021

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 3, 2021

gandro requested changes Aug 4, 2021

View reviewed changes

pkg/hubble/observer/local_observer.go Show resolved Hide resolved

michi-covalent force-pushed the pr/michi/invalid-read branch from cfedde5 to 79be7fc Compare August 4, 2021 20:23

gandro approved these changes Aug 5, 2021

View reviewed changes

glibsm approved these changes Aug 5, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned glibsm Aug 5, 2021

michi-covalent added the needs-backport/1.10 label Aug 5, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.10.4 Aug 5, 2021

gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 18, 2021

ti-mo merged commit 98b5fb3 into cilium:master Aug 18, 2021

ti-mo mentioned this pull request Aug 18, 2021

v1.10 backports 2021-08-18 #17183

Merged

ti-mo added backport-pending/1.10 and removed needs-backport/1.10 labels Aug 18, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.4 Aug 18, 2021

tklauser added backport-done/1.10 and removed backport-pending/1.10 labels Aug 19, 2021

maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.4 Aug 19, 2021

joestringer mentioned this pull request Sep 1, 2021

Prepare for release v1.10.4 #17287

Merged

giorio94 mentioned this pull request Jan 18, 2024

sysdump: don't specify --follow while collecting hubble flows cilium/cilium-cli#2240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubble: Never fail with ErrInvalidRead #17046

hubble: Never fail with ErrInvalidRead #17046

michi-covalent commented Aug 3, 2021 •

edited

michi-covalent commented Aug 3, 2021

gandro left a comment

gandro left a comment

michi-covalent commented Aug 5, 2021

michi-covalent commented Aug 5, 2021

michi-covalent commented Aug 5, 2021

michi-covalent commented Aug 5, 2021

michi-covalent commented Aug 5, 2021

gandro commented Aug 18, 2021

hubble: Never fail with ErrInvalidRead #17046

hubble: Never fail with ErrInvalidRead #17046

Conversation

michi-covalent commented Aug 3, 2021 • edited

michi-covalent commented Aug 3, 2021

gandro left a comment

Choose a reason for hiding this comment

gandro left a comment

Choose a reason for hiding this comment

michi-covalent commented Aug 5, 2021

michi-covalent commented Aug 5, 2021

michi-covalent commented Aug 5, 2021

michi-covalent commented Aug 5, 2021

michi-covalent commented Aug 5, 2021

gandro commented Aug 18, 2021

michi-covalent commented Aug 3, 2021 •

edited