Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reordering/reentrancy bug in NIOAsyncWriter + NIOAsyncChannel #2587

Merged
merged 2 commits into from Nov 8, 2023

Conversation

FranzBusch
Copy link
Contributor

Motivation

While testing the latest async interfaces we found a potential reordering/reentrancy bug in the NIOAsyncWriter. This was caused due to our latest performance changes where we fast-pathed calls in didYield to not hop. The problem was in the following flow:

  1. Task 1: Calls outbound.write() -> which led an EventLoop enqueue with didYield
  2. Task 1: Calls outbound.write() -> which led an EventLoop enqueue with didYield
  3. EventLoop: While processing the write from 1. the channel became not writable
  4. Task 1: Calls outbound.write() -> which lead to buffering the write in the writer's state machine since we are not writable
  5. EventLoop: While still processing the write from 1. the channel became writable again -> We informed the NIOAsyncWriter about this which unbuffered the write in 4. that was stored in the state machine and call didYield. Since, we are on the EventLoop already we processed the write right away

The above flow show-cases a flow where we reordered the write in 2. and 4.

Modification

This PR fixes the above issue while upholding a few constraints:

  1. Produce as few context switches as possible
  2. Minimize allocations

I tried different approaches but in the end decided to do the following:

  1. Make sure to never call didYield/didTerminate from calls on the NIOAsyncWriter.Sink
  2. Don't coalesce the elements of different writes in the NIOAsyncWriter but rather use the suspended tasks to retry a write after they were suspended. I choose to do this since I wanted to avoid any allocation (remember writers are some Sequence) and because we assume that continuous contention in a multi producer pattern is low.
  3. Make sure that Sink.finish() is terminal and does not lead to a didTerminate event. This is in line with 1.

One important thing to call out, our writer.finish() method is not async we have to buffer the finish event and deliver it with the yield that got buffered before we transitioned to writerFinished.

Result

No more reordering/reentrancy problems in NIOAsyncWriter or NIOAsyncChannel.

@FranzBusch FranzBusch added the patch-version-bump-only For PRs that when merged will only cause a bump of the patch version, ie. 1.0.x -> 1.0.(x+1) label Nov 6, 2023
///
/// - Note: This is guaranteed to be called _exactly_ once.
/// - Note: This is guaranteed to be called _at most_ once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about this change in behaviour, users could be relying on it as a signal to e.g. tear some things down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the behaviour here changed that we only call didTerminate when somebody called writer.finished and not on sink.finished anymore. The logic before we quite weird and it caused problems with overlapping exclusive access that I fixed in previous PRs. Now the logic is that a call to didTerminate should result in the underlying resource to trigger termination and then a call to sink.didFinish indicates that this is done.
I am on the fence with this change so happy to discuss this more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think anyone outside of NIO actually implements this? Might be worth having a quick search.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a quick GH search and couldn't find anyone besides grpc-swift using this and there we are only relying on the didTerminate being called from writer.finish and not from sink.finish which is the intention of this PR.


case .retry:
// The yield has to be retried. We are recursively calling yield here
// but don't expect to blow the stack.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we think this won't happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to a while loop now

Sources/NIOCore/AsyncSequences/NIOAsyncWriter.swift Outdated Show resolved Hide resolved
inDelegateOutcall: Bool,
suspendedYields: _TinyArray<SuspendedYield>,
cancelledYields: [YieldID],
knownYieldIDs: _TinyArray<YieldID>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the known yield IDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to clarify this but they are the buffered yields before we got the finish.

# Motivation
While testing the latest async interfaces we found a potential reordering/reentrancy bug in the `NIOAsyncWriter`. This was caused due to our latest performance changes where we fast-pathed calls in `didYield` to not hop. The problem was in the following flow:

1. Task 1: Calls `outbound.write()` -> which led an `EventLoop` enqueue with `didYield`
2. Task 1: Calls `outbound.write()` -> which led an `EventLoop` enqueue with `didYield`
3. EventLoop: While processing the write from 1. the channel became **not** writable
4. Task 1: Calls `outbound.write()` -> which lead to buffering the write in the writer's state machine since we are **not** writable
5. EventLoop: While still processing the write from 1. the channel became writable again -> We informed the `NIOAsyncWriter` about this which unbuffered the write in 4. that was stored in the state machine and call `didYield`. Since, we are on the EventLoop already we processed the write right away

The above flow show-cases a flow where we reordered the write in 2. and 4.

# Modification
This PR fixes the above issue while upholding a few constraints:

1. Produce as few context switches as possible
2. Minimize allocations

I tried different approaches but in the end decided to do the following:
1. Make sure to never call `didYield/didTerminate` from calls on the `NIOAsyncWriter.Sink`
2. Don't coalesce the elements of different writes in the `NIOAsyncWriter` but rather use the suspended tasks to retry a write after they were suspended. I choose to do this since I wanted to avoid any allocation (remember writers are `some Sequence`) and because we assume that continuous contention in a multi producer pattern is low.
3. Make sure that `Sink.finish()` is terminal and does not lead to a `didTerminate` event. This is in line with 1.

One important thing to call out, our `writer.finish()` method is not `async` we have to buffer the finish event and deliver it with the yield that got buffered before we transitioned to `writerFinished`.

# Result
No more reordering/reentrancy problems in `NIOAsyncWriter` or `NIOAsyncChannel`.
///
/// - Note: This is guaranteed to be called _exactly_ once.
/// - Note: This is guaranteed to be called _at most_ once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think anyone outside of NIO actually implements this? Might be worth having a quick search.


switch action {
case .callDidYield(let delegate):
// We are allocating a new Deque for every write here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch. I didn't realise we did this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this one of the early design choices because users can yield us any kind of sequence and we have to normalise this in the Deque case. Right now we have provided a fast path for a single element we ought to be able to do the same for Deque/Array and more.

@FranzBusch FranzBusch merged commit 035141d into apple:main Nov 8, 2023
7 of 8 checks passed
@FranzBusch FranzBusch deleted the fb-async-writer-reentrancy branch November 8, 2023 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch-version-bump-only For PRs that when merged will only cause a bump of the patch version, ie. 1.0.x -> 1.0.(x+1)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants