Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase some of the DnsEndPointTest timeouts #58129

Merged
merged 7 commits into from
Sep 3, 2021

Conversation

antonfirsov
Copy link
Member

Long story short:
I have spent several hours trying to get a test run on SLES in order to repro #57929, no success so far, and it may take days to make progress, since I'm unfamiliar with SLES.

I recommend to bump the timeout values for now, and see if it helps with the issue. If not, we may invest into another round of investigation.

(hopefully) fixes #57929

@ghost
Copy link

ghost commented Aug 25, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Long story short:
I have spent several hours trying to get a test run on SLES in order to repro #57929, no success so far, and it may take days to make progress, since I'm unfamiliar with SLES.

I recommend to bump the timeout values for now, and see if it helps with the issue. If not, we may invest into another round of investigation.

(hopefully) fixes #57929

Author: antonfirsov
Assignees: -
Labels:

area-System.Net.Sockets

Milestone: -

@antonfirsov antonfirsov changed the title Sockets/dns connect timeouts Increase some of the DnsEndPointTest timeouts Aug 25, 2021
@antonfirsov
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -272,7 +272,7 @@ public void Socket_ConnectAsyncDnsEndPoint_HostNotFound()
bool willRaiseEvent = sock.ConnectAsync(args);
if (willRaiseEvent)
{
Assert.True(complete.WaitOne(TestSettings.PassingTestTimeout), "Timed out while waiting for connection");
Assert.True(complete.WaitOne(TimeSpan.FromSeconds(30)), "Timed out while waiting for connection");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do it as TestSettings.PassingTestTimeout * 3 e.g. factor of baseline instead of hardcoded value.
(or add new constant)

Copy link
Member Author

@antonfirsov antonfirsov Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename the 10 sec PassingTestTimeout to PassingTestShortTimeout, and define another constant PassingTestLongTimeout = 30 sec ?

But maybe better to do it in a separate PR, to keep the changes in this one minimal? (It's about to be backported.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be actually better to do the rename in the release/* branches too to keep the code in sync, avoiding conflicts in future PR-s. I'm undecided.

@wfurt
Copy link
Member

wfurt commented Aug 25, 2021

Did you get VM from the Helix gallery @antonfirsov ?
Can you check content of /etc/resolv.conf and /etc/nsswitch.conf?
AFAIK SLES should be using normal glibc resolver....

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 26, 2021

@wfurt also reacting to #57929 (comment):

Note the most problematic tests (Socket_ConnectAsyncDnsEndPoint_HostNotFound and Socket_StaticConnectAsync_HostNotFound) are negative tests which check the error code resolving notahostname.invalid.corp.microsoft.com. There is a sync variant of those tests which never fails:

[OuterLoop]
[Fact]
public void Socket_ConnectDnsEndPoint_Failure()
{
using (Socket sock = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp))
{
SocketException ex = Assert.ThrowsAny<SocketException>(() =>
{
sock.Connect(new DnsEndPoint(Configuration.Sockets.InvalidHost, UnusedPort));
});
SocketError errorCode = ex.SocketErrorCode;
Assert.True((errorCode == SocketError.HostNotFound) || (errorCode == SocketError.NoData),
$"SocketErrorCode: {errorCode}");

I believe this means that the OS level name resolution works fine, and the problem is more likely some sort of threadpool execution timeout, since async name resolution is actually is async-over-sync on Unix.

I don't think this is CPU intensive test.

It isn't but the rest of the socket test suite is using threadpool threads in parallel quite heavily, and we've seen timeouts because of that. The only difference between release/* and main queues is CPU, which can explain different thread pool behavior. Of course, this is only speculation, but I would rather merge this change on trial and error basis, than spend days figuring out SLES package management. If the the failures disappear, or get reduced significantly after this change, that would be a sign that there is no serious bug, if not, that would mean that we actually have an async hang, which would justify the investigation.

@wfurt
Copy link
Member

wfurt commented Aug 26, 2021

Do we have any stats how long does it take to run the sync test comparing to other platforms?
I'm generally OK making the timeout longer and waiting if that makes difference.

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 26, 2021

I did some microbenchmarks for the failing connect, numbers were something like:

Windows (surface laptop): 280 ms, Ubuntu 100ms, SLES: 5ms. So weirdly SLES is actually the fastest (at least on the Helix DTL machine I got). I think if there is a problem it's not with DNS but rather with the threadpool or our async-over-sync resolution code. (Or the way it works with DnsConnectAsync.)

@wfurt
Copy link
Member

wfurt commented Aug 26, 2021

Would it make sense to move it to the non-parallel bucket?

@antonfirsov
Copy link
Member Author

Would it make sense to move it to the non-parallel bucket?

I think it would make the error go away with higher chance, but also hide potential issues with the async-over-sync code, plus the usual penalty in test run time on all platforms, so I prefer not to do it in this particular case.

@antonfirsov
Copy link
Member Author

Reacting to #58129 (comment), I pushed a change that renames existing TestSettings.PassingTestTimeout to PassingTestShortTimeout and introducing PassingTestLongTimeout = 30sec. This is a necessary change to improve the maintainability for the code on master, backporting the entire change to 6.0 will make it easier to adapt future test improvements.

@antonfirsov

This comment has been minimized.

@azure-pipelines

This comment has been minimized.

@karelz karelz added this to the 7.0.0 milestone Aug 31, 2021
@@ -8,9 +8,10 @@ namespace System.Net.Sockets.Tests
public static class TestSettings
{
// Timeout values in milliseconds.
public const int PassingTestTimeout = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep this untouched to be consistent with other tests? e.g. PassingTestTimeout is normal/default timeout. If we need anything different for come case we would make it with specific name - like PassingTestLongTimeout (that part looks ok to me)

Copy link
Member Author

@antonfirsov antonfirsov Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this renaming to reduce the diff size.

I think more verbosity in the names wouldn't hurt here. 10 sec is quite a short time in comparison to Http.Tests's default PassingTestTimeout=30sec. We use hardcoded 30 seconds at a bunch of places in socket tests, which we will eventually replace with PassingTestLongTimeout, meaning that the "default" PassingTestTimeout=10sec will become much less common.

@karelz
Copy link
Member

karelz commented Sep 1, 2021

Test run reported networking test failures 4x this specific test:

System.Net.Http.Functional.Tests.SocketsHttpHandler_HttpClientHandler_ServerCertificates_Test.UseCallback_ValidCertificate_ExpectedValuesDuringCallback(remoteServer: (BaseUri: https://corefx-net-http11.azurewebsites.net/, HttpVersion: 1.1), url: https://corefx-net-http11.azurewebsites.net/Echo.ashx, checkRevocation: True) [FAIL]

      System.Net.Http.HttpRequestException : The SSL connection could not be established, see inner exception.
      ---- Expected None, got RemoteCertificateChainErrors with chain status RevocationStatusUnknown, OfflineRevocation
      Expected: True
      Actual:   False
      Stack Trace:
        /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/ConnectHelper.cs(92,0): at System.Net.Http.ConnectHelper.EstablishSslConnectionAsync(SslClientAuthenticationOptions sslOptions, HttpRequestMessage request, Boolean async, Stream stream, CancellationToken cancellationToken)
        /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs(1364,0): at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
        /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs(1430,0): at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
        /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs(454,0): at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
        /_/src/libraries/Common/src/System/Threading/Tasks/TaskCompletionSourceWithCancellation.cs(21,0): at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
        /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs(548,0): at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
        /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs(988,0): at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
        /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/RedirectHandler.cs(30,0): at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
        /_/src/libraries/Common/tests/System/Net/Http/HttpClientHandlerTestBase.cs(119,0): at System.Net.Http.Functional.Tests.HttpClientHandlerTestBase.VersionCheckerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
        /_/src/libraries/System.Net.Http/src/System/Net/Http/HttpClient.cs(527,0): at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
        /_/src/libraries/Common/tests/System/Net/Http/HttpClientHandlerTest.ServerCertificates.cs(173,0): at System.Net.Http.Functional.Tests.HttpClientHandler_ServerCertificates_Test.UseCallback_ValidCertificate_ExpectedValuesDuringCallback(RemoteServer remoteServer, Uri url, Boolean checkRevocation)
        --- End of stack trace from previous location ---
        ----- Inner Stack Trace -----
        /_/src/libraries/Common/tests/System/Net/Http/HttpClientHandlerTest.ServerCertificates.cs(159,0): at System.Net.Http.Functional.Tests.HttpClientHandler_ServerCertificates_Test.<>c__DisplayClass14_1.<UseCallback_ValidCertificate_ExpectedValuesDuringCallback>b__0(HttpRequestMessage request, X509Certificate2 cert, X509Chain chain, SslPolicyErrors errors)
        /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/ConnectHelper.cs(49,0): at System.Net.Http.ConnectHelper.<>c__DisplayClass1_0.<SetUpRemoteCertificateValidationCallback>b__0(Object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors)
        /_/src/libraries/System.Net.Security/src/System/Net/Security/SecureChannel.cs(998,0): at System.Net.Security.SecureChannel.VerifyRemoteCertificate(RemoteCertificateValidationCallback remoteCertValidationCallback, SslCertificateTrust trust, ProtocolToken& alertToken, SslPolicyErrors& sslPolicyErrors, X509ChainStatusFlags& chainStatus)
        /_/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs(639,0): at System.Net.Security.SslStream.CompleteHandshake(ProtocolToken& alertToken, SslPolicyErrors& sslPolicyErrors, X509ChainStatusFlags& chainStatus)
        /_/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs(652,0): at System.Net.Security.SslStream.CompleteHandshake(SslAuthenticationOptions sslAuthenticationOptions)
        /_/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs(435,0): at System.Net.Security.SslStream.ForceAuthenticationAsync[TIOAdapter](TIOAdapter adapter, Boolean receiveFirst, Byte[] reAuthenticationData, Boolean isApm)
        /_/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/ConnectHelper.cs(68,0): at System.Net.Http.ConnectHelper.EstablishSslConnectionAsync(SslClientAuthenticationOptions sslOptions, HttpRequestMessage request, Boolean async, Stream stream, CancellationToken cancellationToken)

That may be just some networking hiccup I guess ... cc @wfurt

Rerunning the failed legs.

@antonfirsov

This comment has been minimized.

@azure-pipelines

This comment has been minimized.

Copy link
Member

@wfurt wfurt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karelz
Copy link
Member

karelz commented Sep 2, 2021

Test failures:

  • BeginReceiveFromV4BoundToSpecificV4_Success timeout - likely unrelated - @antonfirsov can you please confirm?
    System.Net.Sockets.Tests.DualModeConnectionlessBeginReceiveFrom.BeginReceiveFromV4BoundToSpecificV4_Success [FAIL]
      System.TimeoutException : The operation has timed out.
      Stack Trace:
        /_/src/libraries/System.Net.Sockets/tests/FunctionalTests/DualModeSocketTest.cs(1612,0): at System.Net.Sockets.Tests.DualModeConnectionlessBeginReceiveFrom.BeginReceiveFrom_Helper(IPAddress listenOn, IPAddress connectTo, Boolean expectedToTimeout)
        /_/src/libraries/System.Net.Sockets/tests/FunctionalTests/DualModeSocketTest.cs(1525,0): at System.Net.Sockets.Tests.DualModeConnectionlessBeginReceiveFrom.BeginReceiveFromV4BoundToSpecificV4_Success()
  • Azure storage outage? -- .packages/microsoft.dotnet.helix.sdk/6.0.0-beta.21427.6/tools/Microsoft.DotNet.Helix.Sdk.MonoQueue.targets(55,5): error : (NETCORE_ENGINEERING_TELEMETRY=Helix) StorageException: nodename nor servname provided, or not known (helixde8s23ayyeko0k025g8.blob.core.windows.net:443) (logs)
  • 2x System.Runtime.Tests killed - unrelated failures - Debian.11 and Debian.10
/root/helix/work/workitem /root/helix/work/workitem
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6264 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162:    24 Killed
  • runtime-staging - unrelated Helix hiccups

Rerunning tests.

@antonfirsov
Copy link
Member Author

Yes, BeginReceiveFromV4BoundToSpecificV4_Success is unrelated.

@karelz
Copy link
Member

karelz commented Sep 3, 2021

Same System.Runtime.Tests unrelated failures on Debian.10 and Debian.11 - tracked by #58616:

  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 28 of 6264 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162:    24 Killed

@karelz karelz merged commit 566b3eb into dotnet:main Sep 3, 2021
@karelz
Copy link
Member

karelz commented Sep 3, 2021

/backport to release/6.0

@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2021

Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/1197645068

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests time out: System.Net.Sockets.Tests.DnsEndPointTest - mostly on SLES.15 in release/6.0 branches
3 participants