New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reordering/reentrancy bug in NIOAsyncWriter
+ NIOAsyncChannel
#2587
Conversation
3609bb1
to
b0e3e99
Compare
/// | ||
/// - Note: This is guaranteed to be called _exactly_ once. | ||
/// - Note: This is guaranteed to be called _at most_ once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about this change in behaviour, users could be relying on it as a signal to e.g. tear some things down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the behaviour here changed that we only call didTerminate
when somebody called writer.finished
and not on sink.finished
anymore. The logic before we quite weird and it caused problems with overlapping exclusive access that I fixed in previous PRs. Now the logic is that a call to didTerminate
should result in the underlying resource to trigger termination and then a call to sink.didFinish
indicates that this is done.
I am on the fence with this change so happy to discuss this more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think anyone outside of NIO actually implements this? Might be worth having a quick search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a quick GH search and couldn't find anyone besides grpc-swift
using this and there we are only relying on the didTerminate
being called from writer.finish
and not from sink.finish
which is the intention of this PR.
|
||
case .retry: | ||
// The yield has to be retried. We are recursively calling yield here | ||
// but don't expect to blow the stack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we think this won't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to a while loop now
inDelegateOutcall: Bool, | ||
suspendedYields: _TinyArray<SuspendedYield>, | ||
cancelledYields: [YieldID], | ||
knownYieldIDs: _TinyArray<YieldID>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the known yield IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to clarify this but they are the buffered yields before we got the finish.
# Motivation While testing the latest async interfaces we found a potential reordering/reentrancy bug in the `NIOAsyncWriter`. This was caused due to our latest performance changes where we fast-pathed calls in `didYield` to not hop. The problem was in the following flow: 1. Task 1: Calls `outbound.write()` -> which led an `EventLoop` enqueue with `didYield` 2. Task 1: Calls `outbound.write()` -> which led an `EventLoop` enqueue with `didYield` 3. EventLoop: While processing the write from 1. the channel became **not** writable 4. Task 1: Calls `outbound.write()` -> which lead to buffering the write in the writer's state machine since we are **not** writable 5. EventLoop: While still processing the write from 1. the channel became writable again -> We informed the `NIOAsyncWriter` about this which unbuffered the write in 4. that was stored in the state machine and call `didYield`. Since, we are on the EventLoop already we processed the write right away The above flow show-cases a flow where we reordered the write in 2. and 4. # Modification This PR fixes the above issue while upholding a few constraints: 1. Produce as few context switches as possible 2. Minimize allocations I tried different approaches but in the end decided to do the following: 1. Make sure to never call `didYield/didTerminate` from calls on the `NIOAsyncWriter.Sink` 2. Don't coalesce the elements of different writes in the `NIOAsyncWriter` but rather use the suspended tasks to retry a write after they were suspended. I choose to do this since I wanted to avoid any allocation (remember writers are `some Sequence`) and because we assume that continuous contention in a multi producer pattern is low. 3. Make sure that `Sink.finish()` is terminal and does not lead to a `didTerminate` event. This is in line with 1. One important thing to call out, our `writer.finish()` method is not `async` we have to buffer the finish event and deliver it with the yield that got buffered before we transitioned to `writerFinished`. # Result No more reordering/reentrancy problems in `NIOAsyncWriter` or `NIOAsyncChannel`.
05e8ce4
to
09c595c
Compare
/// | ||
/// - Note: This is guaranteed to be called _exactly_ once. | ||
/// - Note: This is guaranteed to be called _at most_ once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think anyone outside of NIO actually implements this? Might be worth having a quick search.
|
||
switch action { | ||
case .callDidYield(let delegate): | ||
// We are allocating a new Deque for every write here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch. I didn't realise we did this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this one of the early design choices because users can yield us any kind of sequence and we have to normalise this in the Deque
case. Right now we have provided a fast path for a single element we ought to be able to do the same for Deque
/Array
and more.
Motivation
While testing the latest async interfaces we found a potential reordering/reentrancy bug in the
NIOAsyncWriter
. This was caused due to our latest performance changes where we fast-pathed calls indidYield
to not hop. The problem was in the following flow:outbound.write()
-> which led anEventLoop
enqueue withdidYield
outbound.write()
-> which led anEventLoop
enqueue withdidYield
outbound.write()
-> which lead to buffering the write in the writer's state machine since we are not writableNIOAsyncWriter
about this which unbuffered the write in 4. that was stored in the state machine and calldidYield
. Since, we are on the EventLoop already we processed the write right awayThe above flow show-cases a flow where we reordered the write in 2. and 4.
Modification
This PR fixes the above issue while upholding a few constraints:
I tried different approaches but in the end decided to do the following:
didYield/didTerminate
from calls on theNIOAsyncWriter.Sink
NIOAsyncWriter
but rather use the suspended tasks to retry a write after they were suspended. I choose to do this since I wanted to avoid any allocation (remember writers aresome Sequence
) and because we assume that continuous contention in a multi producer pattern is low.Sink.finish()
is terminal and does not lead to adidTerminate
event. This is in line with 1.One important thing to call out, our
writer.finish()
method is notasync
we have to buffer the finish event and deliver it with the yield that got buffered before we transitioned towriterFinished
.Result
No more reordering/reentrancy problems in
NIOAsyncWriter
orNIOAsyncChannel
.