Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble/container: Properly deal with nil values in RingReader #11323

Merged
merged 1 commit into from May 6, 2020

Conversation

gandro
Copy link
Member

@gandro gandro commented May 4, 2020

When the ring-buffer is not full, we must treat nil values the same as
reads outside of the readable indices. This commit ensures that read
treats nil values the same way readFrom does.

This has an impact on the RingReader, as we should not try to either
increase or decrease the index if the current value is unreadable.
If the current index is unreadable, then we have either read past the
last written value, or we have tried to access an index which has
already been overwritten.

This commit therefore adds an error return type to
RingReader.{Next,Prev} in order to determine which of the two cases
occured.

Most notably, Next may now return ErrInvalidData. This means this
means that the value it tried to access just got overwritten by the
writer from behind. In such cases we want to return an error, as
otherwise RingReader.Next would silently stop reading and
subsequentially terminate the request as if it had reached most
recent value in the ring buffer (which it has not).

The error is also added to Previous, however in that case
we only ever expect ErrInvalidData, as it traverses the ring buffer
in the opposite direction of the writer.

The check for concurrent writes is currently missing from readFrom,
which can suffer from the same problem. This will be added in a
follow-up PR.

Fixes: cilium/hubble#131

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/hubble Impacts hubble server or relay labels May 4, 2020
@gandro gandro requested a review from a team May 4, 2020 19:32
@maintainer-s-little-helper
Copy link

Commit 357e0a4b493be4c991d31b68f0a3b0f315366277 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label May 4, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.8.0 May 4, 2020
@gandro gandro force-pushed the pr/gandro/ringreader-detect-concurrent-write branch from 357e0a4 to b06ee8f Compare May 4, 2020 19:32
@maintainer-s-little-helper
Copy link

Commit 357e0a4b493be4c991d31b68f0a3b0f315366277 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@gandro
Copy link
Member Author

gandro commented May 4, 2020

test-me-please

pkg/hubble/container/ring_reader_test.go Outdated Show resolved Hide resolved
pkg/hubble/container/ring_reader.go Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented May 4, 2020

Coverage Status

Coverage increased (+0.03%) to 44.479% when pulling 4229ac8 on pr/gandro/ringreader-detect-concurrent-write into 5aeea78 on master.

@gandro gandro force-pushed the pr/gandro/ringreader-detect-concurrent-write branch from b06ee8f to 8835e85 Compare May 4, 2020 20:18
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label May 4, 2020
@gandro
Copy link
Member Author

gandro commented May 4, 2020

test-me-please

@glibsm
Copy link
Member

glibsm commented May 4, 2020

Are we expecting that the --since 10y is mostly fixed by this patch, or will those types of queries more often than not surface the concurrent writer error on busy nodes?

@gandro
Copy link
Member Author

gandro commented May 4, 2020

Are we expecting that the --since 10y is mostly fixed by this patch, or will those types of queries more often than not surface the concurrent writer error on busy nodes?

I believe they will suffer from the concurrent writer issue.

I'm currently thinking that the way to implement something like hubble observe --all (without changing the ring buffer implementation or introducing the lock) is to iterate the ring buffer from the most recent to the oldest entry. This way we can just stop once we hit the writer cursor.

@gandro
Copy link
Member Author

gandro commented May 4, 2020

I should also note: The concurrent write error only occurs if there really is a conflict, i.e. the events we were about to read got lost. If the reader is reading quick enough to stay ahead of the writer, it can actually "loop around", i.e. dump the ring buffer content which existed at the time the request was, followed by all flows that have been added since the request was started.

Me and @rolinh didn't consider this a bug, therefore I didn't "fix" it. It can however easily be avoided if we never read past the initial LastParallelWrite value. I would mean however that hubble observe will never the very most recent flows that occured after the request was started.

Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now i see all the flows from ring buffer:

bash-5.0# hubble observe --last 10000 -o json | wc -l
4095
bash-5.0# hubble observe --since 1h -o json | wc -l
4095

sometimes i get more than len(ringbuffer) - 1 flows, but i guess this can happen if new flows are coming in as we read.

regarding @gandro 's comment, i think it's totally fine to stop at the initial LastParallelWrite index. you can always follow if you want to get more.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 4, 2020
@rolinh
Copy link
Member

rolinh commented May 5, 2020

sometimes i get more than len(ringbuffer) - 1 flows, but i guess this can happen if new flows are coming in as we read.

This feels non intuitive at first but it's definitely not a problem as long as the reader is able to eventually catch up with the writers.

regarding @gandro 's comment, i think it's totally fine to stop at the initial LastParallelWrite index. you can always follow if you want to get more.

The problem is not really to "not get enough" but that the caller will in effect miss the most recent flows. I'd be in favor of keeping the current behavior unless it's a source of bugs.

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having invested time into this issue. I know the ring buffer is tricky to work with :)
I think this is definitely going to the right direction. That said, I have one concern with a suggested change. Could you have a look at it please?

pkg/hubble/container/ring_reader.go Outdated Show resolved Hide resolved
@rolinh rolinh removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 5, 2020
@gandro gandro force-pushed the pr/gandro/ringreader-detect-concurrent-write branch from 8835e85 to bbbc6ba Compare May 5, 2020 12:46
@gandro
Copy link
Member Author

gandro commented May 5, 2020

test-me-please

@gandro gandro requested a review from rolinh May 5, 2020 12:47
Comment on lines -515 to -519
if !ok {
t.Errorf("Should be able to read position %x", lastWrite)
}
if entry != nil {
t.Errorf("Read Event should be %+v, got %+v instead", nil, entry)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewer note: This test relied on read returning nil values. With this PR, we never allow nil values to be returned from read (same as readFrom, which had this restriction already for a while)

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're almost there but I left a few comments. Note that the commit message is really hard to comprehend, eg:

If the current index is unreadable, then we have reached either end of the ring buffer.

What's the "end" of a ring structure?

Next may return ErrConcurrentRead

I think that the error name has changed.

// ErrConcurrentWrite indicates that a value was overwritten
ErrConcurrentWrite = errors.New("concurrent write detected")
// ErrEndOfBuffer is an alias for io.EOF
ErrEndOfBuffer = io.EOF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this alias is strictly necessary as io.EOF is widely accepted as meaning "end of whatever".
Besides, I'm still trying to understand what the "end" of a ring structure can be :)

@@ -24,6 +26,13 @@ import (
"github.com/cilium/cilium/pkg/lock"
)

var (
// ErrConcurrentWrite indicates that a value was overwritten
ErrConcurrentWrite = errors.New("concurrent write detected")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your commit message mentions ErrConcurrentRead. I think that both names are misleading anyway as reading concurrently should not be a problem and a "concurrent write" error while reading is rather something unexpected. What about a more generic name like ErrInvalidRead for instance?

Copy link
Member Author

@gandro gandro May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the ConcurrentRead is a typo in the commit messages. I'll think about the naming some more. I don't want it to be too generic, since the error is also presented to the user, and invalid read sounds like something is broken internally, when in fact they just tried to read stale data.

pkg/hubble/container/ring.go Outdated Show resolved Hide resolved
pkg/hubble/container/ring.go Outdated Show resolved Hide resolved
pkg/hubble/container/ring_reader.go Outdated Show resolved Hide resolved
if e == nil {
e, err := reader.Previous()
if err != nil {
idx++ // we went backward 1 too far
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we went backward 1 too far

Unless the error is io.EOF, no?

Copy link
Member Author

@gandro gandro May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda, I was not sure how to handle this properly. If Previous returns io.EOF, then something went horribly wrong. But yeah, we can check the error for ErrConcurrentWrite and propagate everything else.

@gandro gandro added this to In progress in Hubble server May 5, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 5, 2020
@gandro gandro force-pushed the pr/gandro/ringreader-detect-concurrent-write branch from bbbc6ba to bee1bf7 Compare May 5, 2020 17:00
@gandro
Copy link
Member Author

gandro commented May 5, 2020

test-me-please

pkg/hubble/container/ring.go Outdated Show resolved Hide resolved
@aanm aanm removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 6, 2020
When the ring-buffer is not full, we must treat nil values the same as
reads outside of the readable indices. This commit ensures that `read`
treats nil values the same way `readFrom` does.

This has an impact on the RingReader, as we should not try to either
increase or decrease the index if the current value is unreadable.
If the current index is unreadable, then we have either read past the
last written value, or we have tried to access an index which has
already been overwritten.

This commit therefore adds an error return type to
`RingReader.{Next,Prev}` in order to determine which of the two cases
occured.

Most notably, `Next` may now return ErrInvalidData. This means this
means that the value it tried to access just got overwritten by the
writer from behind. In such cases we want to return an error, as
otherwise `RingReader.Next` would silently stop reading and
subsequentially terminate the request as if it had reached most
recent value in the ring buffer (which it has not).

The error is also added to `Previous`, however in that case
we only ever expect ErrInvalidData, as it traverses the ring buffer
in the opposite direction of the writer.

The check for concurrent writes is currently missing from `readFrom`,
which can suffer from the same problem. This will be added in a
follow-up PR.

Fixes: cilium/hubble#131

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/ringreader-detect-concurrent-write branch from bee1bf7 to 4229ac8 Compare May 6, 2020 08:45
@gandro
Copy link
Member Author

gandro commented May 6, 2020

test-me-please

Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the great work! LGTM.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 6, 2020
@rolinh rolinh merged commit 8fab042 into master May 6, 2020
1.8.0 automation moved this from In progress to Merged May 6, 2020
@rolinh rolinh deleted the pr/gandro/ringreader-detect-concurrent-write branch May 6, 2020 12:12
@rolinh rolinh moved this from In progress (1.8) to Done in Hubble server May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/hubble Impacts hubble server or relay
Projects
No open projects
1.8.0
  
Merged
Hubble server
  
Done
Development

Successfully merging this pull request may close these issues.

observe does not print anything when the ring buffer is full
6 participants