Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize epoll events on thread pool and process events in the same thread #35330

Merged
merged 13 commits into from
May 3, 2020

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Apr 23, 2020

It was seen that on larger machines when fewer epoll threads are used, throughput drops significantly. An issue was that the epoll threads were not able to queue work items to the thread pool quickly enough to keep thread pool threads fully occupied. When the thread pool is not fully occupied, thread pool threads end up waiting for work and enqueues are much slower as they need to release a thread, creating a positive feedback loop, and lots of thread pool threads being released to look for work items and not finding any. It also doesn't help that the thread pool requests many more threads than necessary for the number of work items enqueued, and that the enqueue overhead is repeated for each epoll socket event.

Following @adamsitnik's idea on batching the enqueues to the thread pool and requesting only one thread to limit the overhead, this change tries to reduce and delegate the overhead in epoll threads into the thread pool and automatically parallelizing that work, and to decrease the number of redundant thread pool work items a bit.

  • The epoll thread enqueues socket events to a concurrent queue specific to the epoll thread. If the queue is busy enough, enqueues would not contend.
  • The epoll thread then schedules a work item to the thread pool to process events, if a work item is not already scheduled
  • When the work item runs, it dequeues an event, schedules another work item if necessary to parallelize the work, processes the event, and continues until the event queue is empty
  • At most one work item is scheduled to the thread pool at a time to avoid over-parallelizing the work
  • Since socket events are now processed on a thread pool thread already, the change also avoids scheduling a redundant thread pool work item to perform the socket operations and user callbacks
  • The change is relatively more beneficial when fewer epoll threads are used. A heuristic for that is being discussed separately, for now this change doesn't change the number of epoll threads.

@kouvel kouvel added this to the 5.0 milestone Apr 23, 2020
@kouvel kouvel self-assigned this Apr 23, 2020
@ghost
Copy link

ghost commented Apr 23, 2020

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

@kouvel
Copy link
Member Author

kouvel commented Apr 23, 2020

Perf on JsonPlatform benchmark

  • Connections: 512
  • The current default number of epoll threads for 512 connections is 16
  • Hill climbing disabled. Other issues were identified there, I had removed that variable for these tests.
  • Some of the data was collected on only the first commit. I have verified that perf is about the same with the other commits.

12-proc x64 machine

. Epoll threads Before After Diff
. 16 455776 493383 8.3%
. 4 484207 517855 6.9%
. 2 505386 546309 8.1%
. 1 526676 561758 6.7%
Max 526676 561758 6.7%

28-proc x64 machine

. Epoll threads Before After Diff
. 16 914569 966343 5.7%
. 4 1045350 1083586 3.7%
. 2 992084 1123756 13.3%
. 1 770677 1145468 48.6%
Max 1045350 1145468 9.6%

56-proc x64 machine (2-socket), limited to 1 socket

. Epoll threads Before After Diff
. 16 1022563 1067926 4.4%
. 4 1101562 1172907 6.5%
. 2 1098544 1202171 9.4%
. 1 606518 1191682 96.5%
Max 1101562 1202171 9.1%

56-proc x64 machine (2-socket), not limited

. Epoll threads Before After Diff
. 16 615418 1177954 91.4%
. 4 510258 1234930 142.0%
. 2 367121 1134470 209.0%
. 1 131383 1153392 777.9%
Max 615418 1234930 100.7%

32-proc arm64 machine

. Epoll threads Before After Diff
. 16 481059 496148 3.1%
. 4 503988 526638 4.5%
. 2 481798 505688 5.0%
. 1 448148 437307 -2.4%
Max 503988 526638 4.5%

@kouvel
Copy link
Member Author

kouvel commented Apr 23, 2020

CC @adamsitnik @stephentoub @tmds

@kouvel
Copy link
Member Author

kouvel commented Apr 23, 2020

The event-processing work items can theoretically become long-running, probably should fix

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

{
ManualResetEventSlim? e = Event;
if (e != null)
{
// Sync operation. Signal waiting thread to continue processing.
e.Set();
}
else if (processAsyncOperationSynchronously)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to change sightly how this is structured, in particular for the e != null case. We get there if a synchronous operation is performed on a Socket on which an asynchronous operation was ever performed. In such cases, we've permanently moved the socket to be non-blocking, which means we need to simulate the blocking behavior of all subsequent sync operations, and we do that by using a MRES instead of a callback. We don't want to require a thread pool thread just to set that event, as doing so could lead to thread pool starvation, with a sync operation waiting on a thread pool thread for a work item to be processed by the thread pool that will unblock it. So, we want the epoll thread to set such an MRES rather than queuing a work item to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we want the epoll thread to set such an MRES rather than queuing a work item to do it.

I am afraid that it would require us to make the enqueueing more complex and decrease throughput in a noticeable way.

To set MRES on a epoll thread we would need to check two queues (send and receives) which would require to take two locks:

and then perform 0 to two casts (depending on whether queues are empty or not):

get { return CallbackOrEvent as ManualResetEventSlim; }

Ofc we would have to do that for every epoll event returned by epoll wait.

With @kouvel proposal after we receive an epoll notification we just add a bunch of simple events to a queue (this is very fast) and schedule a work item to the thread pool.

Copy link
Member

@stephentoub stephentoub Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that it would require us to make the enqueueing more complex and decrease throughput in a noticeable way.

We only need to do it if the event is for a sync operation, which can be checked cheaply.

The alternative is potential deadlock / long delays while waiting for the thread pool's starvation detection to introduce more threads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to do it if the event is for a sync operation, which can be checked cheaply.

But we don't know this when we are receiving the epoll notification. To check this we need to translate the socket handle to socket context:

_handleToContextMap.TryGetValue(handle, out SocketAsyncContext? context);

and then take the two locks that I've described above.

Copy link
Member

@stephentoub stephentoub Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we could consider having a dedicated epoll thread for sync-over-nonblocking sockets. EDIT: I realize this won't work well, add the socket may already be associated with a particular epoll.

I understand your pushback, but I think this is a big deal. Convince me it's not if you disagree :)

Copy link
Member

@tmds tmds Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operation may be cancelled by time this gets dequeued for execution.

Looks like ProcessQueuedOperation accounts for that.

So the suggested change is something like:

  • processAsyncOperationSynchronously becomes processAsyncOperationOnConcurrentQueue, and enqueues to the ConcurrentQueue<AsyncOperation>
  • HandleEvents gets called directly on the epoll thread, ConcurrentQueue processing is deferred to the ThreadPool after calling HandleEvents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the lock for each operation queue on the epoll thread seems to reduce RPS by ~20-25 K. In the new commits I made changes to track whether the first operation in each queue is synchronous and to check it speculatively from the epoll thread. The speculative check along with the concurrent dictionary dequeue is not making a noticeable difference to RPS. Considering that the alternative is not incorrect, it seems like a speculative check would be enough to avoid the starvation issue.

I don't think there's a need to queue AsyncOperation to the queue, but anyway it involves taking the lock, and doing so seems to reduce RPS by an additional 15-20 K RPS on top of taking the locks, not sure fully why, though there is a bit more work involved in extracting those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code would be simpler using ConcurrentQueue<AsyncOperation>.

Taking the lock for each operation queue on the epoll thread seems to reduce RPS by ~20-25 K.

I plan to take a shot at replacing these locks with Interlocked operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it involves taking the lock, and doing so seems to reduce RPS by an additional 15-20 K RPS on top of taking the locks, not sure fully why, though there is a bit more work involved in extracting those.

What lock does this refer to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the lock taken in SocketAsyncContext.HandleEvent before this change (linked above by @adamsitnik), in this change the lock taken in SocketAsyncContext.ProcessSyncEventOrGetAsyncEvent.

The code would be simpler using ConcurrentQueue<AsyncOperation>

Agreed, I wanted to get that to work and tried it first but there currently seem to be obstacles to doing that.

@@ -387,5 +445,17 @@ private bool TryRegister(SafeSocketHandle socket, IntPtr handle, out Interop.Err
Interop.Sys.SocketEvents.Read | Interop.Sys.SocketEvents.Write, handle);
return error == Interop.Error.SUCCESS;
}

private struct Event
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readonly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for readonly, I think that we could also give it a less generic name, EpollEvent for example.

Suggested change
private struct Event
private readonly struct EpollEvent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW do you think that changing SocketEvent size from 4 bytes to 1 could improve the perf in any way?

Copy link
Member

@stephentoub stephentoub Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@stephentoub
Copy link
Member

The event-processing work items can theoretically become long-running, probably should fix

By only executing at most N work items and then scheduling a replica and exiting? Seems reasonable.

@kouvel
Copy link
Member Author

kouvel commented Apr 23, 2020

Yea that's what I was thinking. Maybe a batch of EventBufferCount or something like that. Will need to check perf.

@kouvel
Copy link
Member Author

kouvel commented Apr 23, 2020

Smaller threshold would be better for that issue but probably worse for perf. For instance what if processing a request takes 10 ms, it wouldn't take running many of those for the work item to appear long-running. Maybe also a time-based thing like the thread pool does, or something else.

Co-Authored-By: Stephen Toub <stoub@microsoft.com>

// An event was successfully dequeued, and as there may be more events to process, speculatively schedule a work
// item to parallelize processing of events. Since this is only for additional parallelization, doing so
// speculatively is ok.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning explicitly that the parallelization makes it impossible for continuations to block one another.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible for a continuation to block on another continuation for the same socket even without the parallelization done here. If a continuation blocks on a synchronous socket operation the next epoll event that serves that blocking operation would schedule another work item. The parallelization done here is only for making use of more procs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible for a continuation to block on another continuation for the same socket even without the parallelization done here.

It may be on different sockets. And we shouldn't reply on the next epoll_wait with events to get things moving.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If parallelization is not done here, and if a blocking socket operation is performed on this thread as a result of running a user continuation (on any socket), then the set of IO events that are already queued would not be releasing that blocking operation anyway since it's a new operation that got added. Only another epoll event would get that blocking operation moving (before this change too).

However, if it's some other kind of blocking related to other already-queued, for example if a user continuation blocks waiting for another already-in-progress socket operation to complete, then the parallelization done here would help with unblocking that more quickly, though not guaranteed and can still lead to thread pool starvation issues, and there would have already been potential thread pool starvation issues before this change from those kinds of blocking.

I'm trying to understand if there would be a correctness issue from not parallelizing here. If there is a possibility of correctness issue then it may be necessary to queue up a replica before the loop instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To alleviate potential issues from other kinds of blocking in user callbacks maybe it would be safer to ensure there is a queued replica (non-speculatively) if it has not yet scheduled a replica.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change I mentioned above in the latest update

@tmds
Copy link
Member

tmds commented Apr 23, 2020

Maybe a batch of EventBufferCount or something like that.

Maybe something that relates to how many events we get from epoll? EventBufferCount is the upper bound for that.

@adamsitnik adamsitnik added os-linux Linux OS (any supported distro) tenet-performance Performance related issue labels Apr 23, 2020
@kouvel
Copy link
Member Author

kouvel commented Apr 25, 2020

No significant change to perf on the 28-proc x64 machine with the latest changes

…irst dequeue, delegating scheduling of more work items to other threads
@kouvel
Copy link
Member Author

kouvel commented Apr 25, 2020

No significant change to perf on the 28-proc x64 machine with the latest changes

@kouvel
Copy link
Member Author

kouvel commented Apr 30, 2020

Addressed feedback above and added a few more comments.

@kouvel
Copy link
Member Author

kouvel commented May 1, 2020

Updated numbers below with preview 5 SDK. These are with hill climbing disabled.

JsonPlatform

28-proc x64 machine

512 connections Epoll threads Before After Diff
. 16 937492 983232 4.9%
. 4 1054384 1095836 3.9%
. 2 1004742 1136945 13.2%
. 1 717291 1175142 63.8%
Max 1054384 1175142 11.5%

12-proc x64 machine

512 connections Epoll threads Before After Diff
. 16 462025 502741 8.8%
. 4 486645 536467 10.2%
. 2 509969 568554 11.5%
. 1 525168 586676 11.7%
Max 525168 586676 11.7%

32-proc arm64 machine

512 connections Epoll threads Before After Diff
. 16 478542 501044 4.7%
. 4 527463 533571 1.2%
. 2 495679 516062 4.1%
. 1 471666 448060 -5.0%
Max 527463 533571 1.2%

I'm not seeing an issue with epoll thread count > 1. Decreasing latency between getting an epoll notification and scheduling a thread didn't seem to help.

12-proc x64 machine with cpuset 0-3

To sort of simulate a smaller VM.

256 connections Epoll threads Before After Diff
. 8 343991 342262 -0.5%
. 4 343323 345447 0.6%
. 2 341536 340198 -0.4%
. 1 339642 341089 0.4%
Max 343991 345447 0.4%

FortunesPlatform

This benchmark seems to be affected by the number of connections and epoll threads. On the x64 machines, in some cases with 512 connections and 1 epoll thread the change seems to be performing slightly worse than the baseline, while with 256 connections and 1 epoll thread the change seems to be performing slightly better.

28-proc x64 machine

256 connections Epoll threads Before After Diff
. 8 295163 303885 3.0%
. 4 308736 314709 1.9%
. 2 319814 324905 1.6%
. 1 322504 334484 3.7%
Max 322504 334484 3.7%
512 connections Epoll threads Before After Diff
. 16 295230 306175 3.7%
. 4 313131 303398 -3.1%
. 2 320548 311698 -2.8%
. 1 326749 314887 -3.6%
Max 326749 314887 -3.6%

12-proc x64 machine

256 connections Epoll threads Before After Diff
. 8 125501 131392 4.7%
. 4 124756 131443 5.4%
. 2 126811 133650 5.4%
. 1 131359 138170 5.2%
Max 131359 138170 5.2%
512 connections Epoll threads Before After Diff
. 16 123050 131718 7.0%
. 4 122490 128844 5.2%
. 2 125194 128060 2.3%
. 1 127967 129658 1.3%
Max 127967 131718 2.9%

32-proc arm64 machine

256 connections Epoll threads Before After Diff
. 8 85678 93139 8.7%
. 4 95925 94786 -1.2%
. 2 89028 103297 16.0%
. 1 87266 99467 14.0%
Max   95925 103297 7.7%
512 connections Epoll threads Before After Diff
. 16 96175 92792 -3.5%
. 4 100371 100783 0.4%
. 2 98390 106206 7.9%
. 1 100739 102455 1.7%
Max 100739 106206 5.4%

12-proc x64 machine with cpuset 0-3

To sort of simulate a smaller VM.

256 connections Epoll threads Before After Diff
. 8 79052 90081 14.0%
. 4 78456 88628 13.0%
. 2 78437 88895 13.3%
. 1 78357 88789 13.3%
Max 79052 90081 14.0%

@kouvel
Copy link
Member Author

kouvel commented May 1, 2020

For the sync operation case I tried having a server do synchronous reads after an async operation on 256 sockets, while a client writes to those sockets using async operations. The starvation issue appeared very quickly before the fixes (within a couple of seconds). The baseline and after the fixes it did not hit a noticeable starvation issue after minutes. I don't think it would be easy to repro the races, or at least it seems like they would not be frequent enough to trigger a starvation sequence.

@kouvel
Copy link
Member Author

kouvel commented May 1, 2020

Hopefully this PR should be close to checkin now. I probably won't be able to spend much time on it this/next week, but let me know if I can help to unblock.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! thank you @kouvel !!

@adamsitnik
Copy link
Member

the PR looks ready to me

@stephentoub @tmds is there anything that should be addressed? If not, I would like to merge this PR.

@benaadams
Copy link
Member

FortunesPlatform

This benchmark seems to be affected by the number of connections and epoll threads.

It does have 2 types of socket connection, one for DB and other for HTTP, so the dual load might be a factor.

@kouvel
Copy link
Member Author

kouvel commented May 1, 2020

@sebastienros kindly collected some numbers on an x64 VM with 4 procs in the same modes as above, here are the results.

JsonPlatform

256 connections Epoll threads Before After Diff
. 8 138547 145501 5.0%
. 1 142676 141960 -0.5%
Max   142676 145501 2.0%

FortunesPlatform

256 connections Epoll threads Before After Diff
. 8 16736 16980 1.5%
. 1 14911 15216 2.0%
Max   16736 16980 1.5%

Numbers are pretty close, there doesn't appear to be a noticeable regression. On this VM FortunesPlatform seems to perform better with more epoll threads before and after the change. The extra load may have something to do with it, I'm not seeing a clear pattern yet but this diff between 8 and 1 epoll threads seems to be higher than on the other machines.

@stephentoub
Copy link
Member

For the sync operation case I tried having a server do synchronous reads after an async operation on 256 sockets, while a client writes to those sockets using async operations. The starvation issue appeared very quickly before the fixes (within a couple of seconds). The baseline and after the fixes it did not hit a noticeable starvation issue after minutes. I don't think it would be easy to repro the races, or at least it seems like they would not be frequent enough to trigger a starvation sequence.

Ok, thanks for fixing and confirming.

@sebastienros
Copy link
Member

sebastienros commented May 3, 2020

First updated numbers should be available tomorrow morning.
Edit: Assuming a runtime build is successful before midnight


if ((events & Interop.Sys.SocketEvents.Read) != 0 &&
_receiveQueue.IsNextOperationSynchronous_Speculative &&
_receiveQueue.ProcessSyncEventOrGetAsyncEvent(this) == null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kouvel @stephentoub @adamsitnik I think there is an issue when IsNextOperationSynchronous_Speculative is true but the operation is not really a SyncEvent operation. The queue moves to Processing but no-one dispatches the operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, will fix

@kouvel kouvel deleted the ParallelizeEpollEventsProcessInline branch May 6, 2020 13:12
kouvel added a commit to kouvel/runtime that referenced this pull request May 7, 2020
Fixes dotnet#35330 (comment) by skipping state transitions when an async operation needs to be processed.
stephentoub pushed a commit that referenced this pull request May 7, 2020
Fixes #35330 (comment) by skipping state transitions when an async operation needs to be processed.
tmds added a commit to tmds/aspnetcore that referenced this pull request May 15, 2020
Applies the technique from dotnet/runtime#35330
to IOQueue.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Sockets os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants