[release/5.0] Fix regression in SocketAsyncContext.Unix (sequential processing of reads & writes) #46745

antonfirsov · 2021-01-08T14:00:23Z

(Bar check, as no customer reports at this stage.)

Backport of #45683
Fixes #45673

Customer Impact

Found by code inspection by @tmds (RedHat)

When we get epoll notifications for a particular socket, we can get either a read notification, a write notification, or both. Each notification will cause us to perform the IO and then invoke the user callback.

Prior to PR #37974, when we got both notifications, we would dispatch one notification to the thread pool so that both user callbacks can be processed concurrently.

Unfortunately, #37974 inadvertently broke this behavior, and instead resulted in the notifications being processed in sequence. This means that the second IO operation and callback won't be invoked until the first callback completes, which could in theory take arbitrarily long.

This can lead to unexpected behavior and, at worst, deadlocks. It's probably not that common in practice, but it would be extremely hard to diagnose if it was hit.

Regression?

Yes, caused by #37974 in 5.0.

Testing

It's not really possible to write a test for this because it requires epoll signalling both the read and write notification in the same notification, which is timing dependent and very difficult to make happen consistently.

Performance

Since the change alters a performance-critical path, we decided to run some of the TechEmpower benchmarks. As expected, there is no regression. For results see #46745 (comment).

Risk

Low. The fix is to basically just revert to the previous correct behavior here (same behavior as 3.1), and isolate the changes from #37974 so they do not affect the common path. It has been fixed in master for a month. We have run TechEmpower as a perf check, which also helped test stress (although, TechEmpower didn't hit the problem before it was fixed either)

…tnet#45683) * SocketAsyncContext.Unix: fix double processing of AsyncOperations When a socket event occurs, the pending operations gets triggered to continue their work by calling the Process method. The changes in dotnet#37974 cause Process to be called twice on the same AsyncOperation. When Process is called, the operation can complete, and the AsyncOperation instance may be reused for a different operation. * Remove processAsyncEvents

ghost · 2021-01-08T14:00:29Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Backport of #45683
Fixes #45673

/cc @geoffkizer @tmds @karelz

Customer Impact

When we get epoll notifications for a particular socket, we can get either a read notification, a write notification, or both. Each notification will cause us to perform the IO and then invoke the user callback.

Prior to PR #37974, when we got both notifications, we would dispatch one notification to the thread pool so that both user callbacks can be processed concurrently.

Unfortunately, #37974 inadvertently broke this behavior, and instead resulted in the notifications being processed in sequence. This means that the second IO operation and callback won't be invoked until the first callback completes, which could in theory take arbitrarily long.

This can lead to unexpected behavior and, at worst, deadlocks. It's probably not that common in practice, but it would be extremely hard to diagnose if it was hit.

Regression?

Yes, caused by #37974 in 5.0.

Testing

It's not really possible to write a test for this because it requires epoll signalling both the read and write notification in the same notification, which is timing dependent and very difficult to make happen consistently.

Risk

Low. The fix is to basically just revert to the previous correct behavior here, and isolate the changes from #37974 so they do not affect the common path.

Author:	antonfirsov
Assignees:	-
Labels:	`area-System.Net.Sockets`
Milestone:	-

antonfirsov · 2021-01-08T14:04:58Z

Although it is very unlikely (if not impossible) that this will result in a perf regression, we need to run TechEmpower with these changes.

@adamsitnik can you help with this? I'm fine doing it myself after a quick offline guidance, but I'm out of practice.

adamsitnik · 2021-01-08T14:23:14Z

can you help with this? I'm fine doing it myself after a quick offline guidance, but I'm out of practice

https://github.com/aspnet/Benchmarks/blob/master/scenarios/README.md

dotnet tool install Microsoft.Crank.Controller --version "0.1.0-*" –global
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/master/scenarios/platform.benchmarks.yml 
           --scenario plaintext --profile aspnet-citrine-lin

To profile and get a trace file: --application.collect true
To use given .dll in the published app: --application.options.outputFiles $pathToFile.dll

tmds · 2021-01-08T14:36:50Z

Although it is very unlikely (if not impossible) that this will result in a perf regression, we need to run TechEmpower with these changes.

This shouldn't impact TechEmpower, and if it does, we'd still want the change.
So this is mostly to check the impact.

antonfirsov · 2021-01-08T19:27:35Z

Did 3-3 runs of the plaintext and json scenarios. As expected, the differences do not seem to be statistically significant.

plaintext (RPS)

	release/5.0	pr	delta
	10,764,605	10,781,788
	10,835,379	10,909,136
	10,798,054	10,806,531
average	10,799,346	10,832,485	0.31%

json (RPS)

	release/5.0	pr	delta
	1,212,258	1,199,259
	1,209,587	1,197,367
	1,199,528	1,212,231
average	1207124.333	1202952.333	-0.35%

tmds · 2021-01-12T08:14:38Z

@antonfirsov @geoffkizer thanks!

adamsitnik

LGTM, thank you for verifying perf before merging the changes @antonfirsov !

karelz · 2021-01-12T15:31:52Z

@danmosemsft this is ready for 5.0 servicing ...

antonfirsov · 2021-01-12T16:16:53Z

Test failures are unrelated:

#15376, #41530, #41531 and a hang in System.Threading.Channels.Tests.BoundedChannelTests.ManyProducerConsumer_ConcurrentReadWrite_WithBufferedCapacity_Success.

danmoseley · 2021-01-12T18:32:23Z

Tactics is sympathetic looking for more clarity around how likely this is to happen, given no customers have reported this. Maybe @geoffkizer could come to a future tactics.

geoffkizer · 2021-01-12T18:56:35Z

It's pretty unlikely, because it requires a combination of a couple factors, plus lucky timing where both notifications occur at the same time. It's a bit more likely to happen under load because this will stress the timing factors.

Even when it does happen, the impact is largely benign unless you happen to do something complicated in the first callback, which is unusual.

The problem is that if you do hit this, the impact could be very very bad (deadlock), and extremely hard to diagnose.

danmoseley · 2021-01-12T23:59:17Z

I wonder how we can help Tactics make a decision (about whether to wait for customer evidence of impact or not)

@geoffkizer any concerns about the risk here -- it looks like the code was refactored since 3.1, so I can't simply "eyeball and see it's the same". I don't know how to evaluate the risk of something in this code. That would reduce the numerator in risk/benefit. I wonder if we could/should do more to reduce the risk somehow.

@stephentoub any other perspective about the risk here?

geoffkizer · 2021-01-13T02:21:33Z

Here's the change that introduced the bug: antonfirsov@205f997#diff-0784c9867dba825371f6c43638e11610bbc7992527e51b6f25b31971641f5176

The only code that matters here is the change to ProcessSyncEventOrGetAsyncEvent. The rest of the code only kicks in when you have a particular app context switch set to enable "inline completions" (which is what this PR was intended to introduce).

Here's the fix we want to backport: antonfirsov@6e12567#diff-0784c9867dba825371f6c43638e11610bbc7992527e51b6f25b31971641f5176

Again, just look at ProcessSyncEventOrGetAsyncEvent and ignore the changes in HandleInlineCompletions.

You'll see that the fix simply reverts the code in ProcessSyncEventOrGetAsyncEvent to what it did previously.

It's true that all this code has been refactored a bit since 3.1, but at least we are reverting to a known state in 5.0 here.

geoffkizer · 2021-01-13T02:25:11Z

(BTW if anyone knows how to make github show that diff directly, please let me know :))

geoffkizer · 2021-01-13T02:33:04Z

All that said, this code did change a fair amount during 5.0 and so I'm not sure how much stock to put in to reverting to this previous 5.0 behavior. It looks like ProcessSyncEventOrGetAsyncEvent and the related logic was heavily refactored in this change and some previous ones: antonfirsov@8157271#diff-0784c9867dba825371f6c43638e11610bbc7992527e51b6f25b31971641f5176

That change was something like 6 weeks prior to the bad PR above, so the "known 5.0 state" was only such for about 6 weeks or so in the middle of 5.0.

geoffkizer · 2021-01-13T02:36:33Z

Adding @kouvel who made the change I mentioned in the last comment.

The only thing that I think is potentially scary here is that with this fix, we now go through the "two operations to process, so schedule one" path in HandleEvents. This path didn't exist in 3.1, and was only really exercised for that 6 week period during 5.0 before the bad PR went in.

Kount, any thoughts here?

tmds · 2021-01-13T05:51:52Z

The fix is related to this comment:

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

Lines 2097 to 2109 in 6e12567

    
           // This method is called from a thread pool thread. When we have only one operation to process, process it 
        
           // synchronously to avoid an extra thread pool work item. When we have two operations to process, processing both 
        
           // synchronously may delay the second operation, so schedule one onto the thread pool and process the other 
        
           // synchronously. There might be better ways of doing this. 
        
           if (sendOperation == null) 
        
           { 
        
               receiveOperation?.Process(); 
        
           } 
        
           else 
        
           { 
        
               receiveOperation?.Schedule(); 
        
               sendOperation.Process(); 
        
           }

The bug causes Process to be called for both operations, while we want to Schedule one and Process the other. That is to avoid a deadlock between the operations. Something like:

Task sendOp = socket.SendAsync
Task receiveOp = socket.ReceiveAsync
await receiveOp();
sendOp.Wait();

I don't think this is a common pattern when using a Socket. Typically the Send and Receive operations are interleaved, or independent. Interleaved doesn't have an issue.

Independent would be something like:

'Receive loop'
while {
await socket.ReceiveAsync;
ProcessData();
};

'Send loop':
while {
await socket.SendAsync;
};

When both events happen simultaneously, the SendAsync continuation will be delayed by ProcessData.
In the regular case, SendAsync completes immediately. It becomes async when the peer doesn't acknowledge receiving the data and the local send buffer is full. Then we wait for the peer acknowledgements. ProcessData is adding a (potentially unbound) amount of delay on top of that.

It is not likely, but it would be hard to debug that the ReceiveAsync continuation causes the SendAsync to not complete.

kouvel · 2021-01-13T11:19:59Z

A couple of questions regarding:

The bug causes Process to be called for both operations, while we want to Schedule one and Process the other. That is to avoid a deadlock between the operations.

From what I can tell between the change that caused the issue and the change that fixed it, the call to Process is moved from ProcessSyncEventOrGetAsyncEvent to HandleEventsInline
I don't see any change when going through HandleEvents, so it seems like the issue is only relevant when going through HandleEventsInline. Is there an issue in the default path?
When going through HandleEventsInline, even after the fix it looks like Process would be called on both operations (just in a different place). How does the fix solve the problem quoted above?

tmds · 2021-01-13T11:55:09Z

I don't see any change when going through HandleEvents, so it seems like the issue is only relevant when going through HandleEventsInline. Is there an issue in the default path?

When HandleEvents calls ProcessSyncEventOrGetAsyncEvent that method calls Process on the operation and returns null here (instead of handing it back to HandleEvents):

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

Lines 906 to 910 in 6072e4d

    
           if (processAsyncEvents) 
        
           { 
        
               op.Process(); 
        
               return null; 
        
           }

kouvel · 2021-01-13T12:03:14Z

Ah, I missed that the default for the processAsyncEvents parameter to ProcessSyncEventOrGetAsyncEvent is true. So then the issue exists in default mode, probably worth fixing to avoid the rare and bad cases.

danmoseley · 2021-01-14T18:17:27Z

I'm gonna mark no merge for a bit until we're ready to go back to tactics with our updated thoughts and reasoning.

danmoseley · 2021-01-19T17:05:35Z

Removing servicing-consider for now. Let's apply it again when you believe we are ready to go to tactics to discuss risk/benefit again .

karelz · 2021-02-04T07:58:02Z

@geoffkizer are we ready to go back to Tactics?

karelz · 2021-02-09T20:49:07Z

@danmosemsft we will bring it to Tactics on Thu if it is ok -- @geoffkizer should be there to talk about it (just sent him invite).

danmoseley · 2021-02-09T21:01:29Z

Sounds good to me

Dotnet-GitSync-Bot added the area-System.Net.Sockets label Jan 8, 2021

antonfirsov added the Servicing-consider Issue for next servicing release review label Jan 8, 2021

antonfirsov requested a review from a team January 8, 2021 14:05

antonfirsov mentioned this pull request Jan 8, 2021

[.NET 5 regression] Socket.AsyncOperation gets processed too many times #45673

Closed

This comment has been minimized.

Sign in to view

antonfirsov added bug regression-from-last-release labels Jan 8, 2021

stephentoub approved these changes Jan 8, 2021

View reviewed changes

tmds approved these changes Jan 12, 2021

View reviewed changes

This comment has been minimized.

Sign in to view

adamsitnik approved these changes Jan 12, 2021

View reviewed changes

runfoapp bot mentioned this pull request Jan 12, 2021

System.Threading.Tests.ThreadLocalTests.RunThreadLocalTest8_Values failing on Mono on SLES 12 AMD64 #43981

Open

karelz added this to the 5.0.x milestone Jan 12, 2021

leecow modified the milestones: 5.0.x, 5.0.3 Jan 12, 2021

danmoseley added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 14, 2021

danmoseley removed the Servicing-consider Issue for next servicing release review label Jan 19, 2021

karelz modified the milestones: 5.0.3, 5.0.x Feb 4, 2021

danmoseley added Servicing-consider Issue for next servicing release review and removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) labels Feb 9, 2021

leecow added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Feb 11, 2021

leecow modified the milestones: 5.0.x, 5.0.4 Feb 11, 2021

Anipik approved these changes Feb 11, 2021

View reviewed changes

runfoapp bot mentioned this pull request Feb 11, 2021

threadlocaltests.runthreadlocaltest8_values thows taskschedulerexception on SLES #47257

Closed

Anipik merged commit 2a430b4 into dotnet:release/5.0 Feb 11, 2021

ghost locked as resolved and limited conversation to collaborators Mar 13, 2021

[release/5.0] Fix regression in SocketAsyncContext.Unix (sequential processing of reads & writes) #46745

[release/5.0] Fix regression in SocketAsyncContext.Unix (sequential processing of reads & writes) #46745

Conversation

antonfirsov commented Jan 8, 2021 • edited by danmoseley Loading

Customer Impact

Regression?

Testing

Performance

Risk

ghost commented Jan 8, 2021

Customer Impact

Regression?

Testing

Risk

antonfirsov commented Jan 8, 2021 • edited Loading

This comment has been minimized.

This comment has been minimized.

adamsitnik commented Jan 8, 2021

tmds commented Jan 8, 2021

antonfirsov commented Jan 8, 2021 • edited Loading

plaintext (RPS)

json (RPS)

tmds commented Jan 12, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

adamsitnik left a comment

Choose a reason for hiding this comment

karelz commented Jan 12, 2021

antonfirsov commented Jan 12, 2021

danmoseley commented Jan 12, 2021

geoffkizer commented Jan 12, 2021

danmoseley commented Jan 12, 2021

geoffkizer commented Jan 13, 2021

geoffkizer commented Jan 13, 2021

geoffkizer commented Jan 13, 2021

geoffkizer commented Jan 13, 2021

tmds commented Jan 13, 2021 • edited by karelz Loading

kouvel commented Jan 13, 2021

tmds commented Jan 13, 2021 • edited Loading

kouvel commented Jan 13, 2021

danmoseley commented Jan 14, 2021

danmoseley commented Jan 19, 2021

karelz commented Feb 4, 2021

karelz commented Feb 9, 2021

danmoseley commented Feb 9, 2021

antonfirsov commented Jan 8, 2021 •

edited by danmoseley

Loading

antonfirsov commented Jan 8, 2021 •

edited

Loading

antonfirsov commented Jan 8, 2021 •

edited

Loading

tmds commented Jan 13, 2021 •

edited by karelz

Loading

tmds commented Jan 13, 2021 •

edited

Loading