Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent Unknown error 258 with no obvious cause #1530

Open
deadwards90 opened this issue Mar 3, 2022 · 75 comments
Open

Intermittent Unknown error 258 with no obvious cause #1530

deadwards90 opened this issue Mar 3, 2022 · 75 comments

Comments

@deadwards90
Copy link

Describe the bug

On occasions we will see the following error

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteReader(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderInternal(IAsyncResult asyncResult)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)

However, SQL Server shows no long running queries and is not using a lot of it's resources during the periods where this happens.

It looks to be more of an intermittent connection issue but we're unable to find any sort of root cause.

To reproduce

We're not sure of the reproduction steps. I've been unable to reproduce this myself by simulating load. From what we can tell this is more likely to happen when the pod is busy (not through just HTTP, but handling events from an external source) but equally it can happen randomly when nothing is really happening on the pod which has caused us quite a substantial amount of confusion.

Expected behavior

Either more information on what the cause might be, or some solution to the issue. I realise the driver might not actually know the issue and it may really be a timeout to it's point of view. We're not entirely sure where the problem lies yet, which is the biggest issue.

Further technical details

Microsoft.Data.SqlClient version: 3.0.1
.NET target: Core 3.1
SQL Server version: Microsoft SQL Azure (RTM) - 12.0.2000.8
Operating system: Docker Container - mcr.microsoft.com/dotnet/aspnet:3.1

Additional context

  • Running in AKS, against Elastic Pools.
  • SQL Server shows no long running queries
  • We sometimes get a TimeoutEvent from the metrics that are collected from the pool. On occasions when we do get them, the error_state will be different.
    • For example, we had one this morning that was 145. We don't know what this means can find no information on what these relate to. I've raised a ticket with the Azure Docs team to look at this. I'll add more onto this when they happen as we've not been keeping track of the error_state codes as we're not sure if they're even relevant.
  • This might be related to this ticket - Execution Timeout Expired Error (258, ReadSniSyncOverAsync) #647
    • However we don't see the ReadSniSyncOverAsync
  • We do have event counter metrics being exported to Prometheus but have found no obvious indicators that something is wrong
@JRahnama
Copy link
Member

JRahnama commented Mar 3, 2022

As we have seen before in issue #647 the underlying reason may come from a different problems which may not be entirely drivers fault. any interruption in connectivity a missed port or socket failure could lead to this issue. We cannot say more without having more on the context of application. The most helpful step could be a minimal repro which it is usually impossible to create. However we can capture EventSource traces and see what has gone wrong. I would suggest closing the issue and follow #647.

@deadwards90
Copy link
Author

deadwards90 commented Mar 4, 2022

@JRahnama happy to close the issue and add my information to that ticket, but I want to be absolutely sure that the ticket referenced (which I've also included in the original report) is not just for the ReadSniSyncOverAsync errors which are not in our stacktrace.

In the mean time, we'll hook up the EventSource you linked to gather some more information. Did not realise that was available!


EDIT: Any suggestions on which event source traces to enable?

@JRahnama
Copy link
Member

JRahnama commented Mar 4, 2022

@dantheman999301 if you use perfview it will capture all events. make sure you filter them by Microsoft.Data.SqlClient.EventSource name.

@deadwards90
Copy link
Author

@JRahnama unfortunately due to the nature of this issue (only showing up in our production environment, which is in Linux Docker in AKS), I think perfview is not going to work for us, as nice as it would be.

We've managed to wrangle the EventListener so that it will only log on errors. These errors usually show up at least once in the morning for us in a certain service during weekdays so fingers crossed on Monday I should have something for you.

@JRahnama
Copy link
Member

JRahnama commented Mar 4, 2022

I can happily point you to PerfCollect for Unix machines, but not sure how it works on docker. it captures all events and you can transfer the files to a windows machine and investigate them.

@deadwards90
Copy link
Author

@JRahnama good news is, we managed to get the logs (we think).

Bad news is, there are 200,000 of them. Unfortunately it didn't filter like I thought it would and as it was happening during a busy period and we have all Keywords on, there was a lot to collect.

We need to work out how we're going to export them from Kibana, and obviously it's not as good as a full dump. Perfcollect might work but we'll need to work out how we're going to hook it up and run it over what is usually a period of an hour without having an impact on our production systems. We might also be able to use dotnet-monitor but it would require some investigation too.

Let me know if it's of any use.

@vincentDAO
Copy link

We got same issue 5 days ago even our code was working and didn't change few days ago

@JRahnama
Copy link
Member

@dantheman999301 sorry for the late response we got busy with preview release. Any kind of log that shows or help us to understand where it happens would be helpful.

@JRahnama
Copy link
Member

We got same issue 5 days ago even our code was working and didn't change few days ago

Same questions applies to your case as well. Any repro or tracing logs would be helpful. I would also suggest investigation network traces as well. that could clarify some of the underlying issues.

@oyvost
Copy link

oyvost commented Apr 5, 2022

258 timeout is a common exception when the DTU limit is reached on Azure SQL. If on Azure, you can try to monitor the Max DTU percentage and see if it hits the limit.

@deadwards90
Copy link
Author

deadwards90 commented Apr 20, 2022

@oyvost In the issue we are facing we can see that the connection is not even made to SQL Server so it's not a throttling issue or similar.

For example, we had an error about 20 minutes ago and it was utilising 4% of the available DTUs.

@JRahnama just to get back to you, the logs we thought we had turned out not to be any good, there was a lot of duplication due to the way we wrote the Event Listener. The closest we've got to any answers on this is that we think it's timing out trying to get a connection from the connection pool even though from what we could tell from the event counters there seemed to be plenty of available connections in the pool. We're not overly confident this is the cause but it's something to go off.

We did see this error once in one of the services that is prone to showing the other error I posted.

Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - Unknown error 16974573)
 ---> System.ComponentModel.Win32Exception (16974573): Unknown error 16974573
   at Microsoft.Data.ProviderBase.DbConnectionPool.CheckPoolBlockingPeriod(Exception e)
   at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()

@DLS201
Copy link

DLS201 commented Jul 1, 2022

Hello,
We currently face a similar issue since approximately 1 month, with SQL queries facing Win32Exceptions with code 258 and no obvious cause. The DB itself signals no issue with very low DTU usage.

@JRahnama
Copy link
Member

JRahnama commented Jul 4, 2022

@DLS201 can you post the stack trace of the exception? have you checked the tcp and socket events/logs?

@DLS201
Copy link

DLS201 commented Jul 4, 2022

Hi,
Here is our error message:

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
---> System.ComponentModel.Win32Exception (258): No error information
   at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__207_0(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass33_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()

Database in an Azure SQL instance, no error on this side.

@mekk1t
Copy link

mekk1t commented Jul 19, 2022

@DLS201 Hello! Any progress on this? I'm encountering the same issue, but on three different occassions. I've documented them as a question on Stackoverflow.

@deadwards90
Copy link
Author

@mekk1t out of interest, what infrastructure have you got?

We've only ever been able to narrow it down (we think) to a timeout waiting for a connection from the connection pool. That said upping the connection pool amount and timeout seems to make no difference.

@DLS201
Copy link

DLS201 commented Aug 10, 2022

Hello,
Azure Support just told us that the product group identified the issue and will fix it Q4 2022.

Regards,

@johngwood
Copy link

Azure Support just told us that the product group identified the issue and will fix it Q4 2022.

@DLS201 any chance you have more information on this? Maybe a link to a bug or something? We are investigating this issue currently and would like to know what MSFT knows...

@robsaa
Copy link

robsaa commented Aug 19, 2022

I am having the exact same issue with this setup:

Microsoft.Data.SqlClient version: 4.1.0
.NET target: .NET 6.0
SQL Server version: Microsoft SQL Server Standard (64-bit) 15.0.2070.41 (on-premise)
Operating system: Docker Container running on linux - mcr.microsoft.com/dotnet/aspnet:6.0

@dbeavon
Copy link

dbeavon commented Aug 23, 2022

@DLS201 do you have a scope? Does it happen for any .net client code running anywhere in Azure?

In my case I'm having tons of problems getting a .Net solution to run correctly in Spark for Synapse.

I'm opening a support case as well. But you could save us all many, many days of time with CSS if you would point us in the right direction. Generally CSS seems to not be highly technical, and they seem just as far removed from the Azure product groups as I am!

Interestingly I can run my spark cluster locally, outside of azure and I have no issues connecting to the same SQL database over expressroute. It is only when my .net code also runs in azure that I have trouble connecting to Azure SQL. This seems counter-intuitive. (Another odd datapoint - the exact same code of mine works fine when running on azure-databricks VM's , which means it may be a problem related specifically to how synapse interacts with azure private resources)

I suspect the problem is with the newer/proprietary layers that Microsoft has been putting on top of their VNET's these days (ie. "managed vnets", and "managed private endpoints"). I have noticed other azure services are extremely flakey as well (eg. Power BI's managed-vnet-gateway and ADF pipelines are pretty unpredictable. I have opened support tickets for both of these as well.)

@MichelZ
Copy link

MichelZ commented Sep 14, 2022

@DLS201 Please elaborate! We're seeing this issue from time to time, too in Azure AKS running aganst Azure SQL Elastic Pool DB's.

@cheenamalhotra
Copy link
Member

Would appreciate more details about driver namespace + version + target framework in use, to help us identify, track and unblock you with known issues.

@MichelZ
Copy link

MichelZ commented Sep 14, 2022

You mean like:

  • Microsoft.Data.SqlClient v4.1.1
  • .NET 6
  • Docker aspnet:6.0-bullseye-slim
  • Host image: AKSUbuntu-1804gen2containerd-2022.08.29

@MichelZ
Copy link

MichelZ commented Sep 14, 2022

OK, so going more down the troubleshooting rabbithole and reading up on some things (especially in connection with AKS), it seems our issue with this was actually SNAT port exhaustion on the outbound load balancer for the AKS cluster.

Diagnose port exhaustion:
https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-standard-diagnostics#how-do-i-check-my-snat-port-usage-and-allocation

https://docs.microsoft.com/en-us/azure/load-balancer/troubleshoot-outbound-connection

What it looks like:
image

We have increased the ports per backend instance (to a ridiculously high number :) ) - so we hope to never see this again

https://docs.microsoft.com/en-us/azure/aks/load-balancer-standard#configure-the-allocated-outbound-ports

@MichelZ
Copy link

MichelZ commented Sep 14, 2022

Actually, it happened again, and this time without hitting the SNAT limit :(

@cheenamalhotra
Copy link
Member

Since you're able to reproduce the issue in AKS, does it reproduce the exceptions in docker containers or Unix env locally as well?
Would it be possible to share a minimal repro here to investigate?

@robsaa
Copy link

robsaa commented Sep 15, 2022

Came across this website while searching for a solution:

https://support.optimizely.com/hc/en-us/articles/4432366206733-CMS-12-site-crash-due-to-SQL-timeout-error-when-working-in-CMS-edit-mode

Can it be related to ThreadPool.MaxThreads ?

I have not tried to adjust MaxThreads myself yet.

@hendxxx
Copy link

hendxxx commented Oct 1, 2022

<ConcurrentGarbageCollection>false</ConcurrentGarbageCollection>
<ThreadPoolMinThreads>4</ThreadPoolMinThreads>
<ThreadPoolMaxThreads>200</ThreadPoolMaxThreads>
<ServerGarbageCollection>false</ServerGarbageCollection>

same here, this is my config... still happen :( why???

@maxafu
Copy link

maxafu commented Nov 18, 2022

I'm having the same issue. Any solutions?

@krompaco
Copy link

krompaco commented Sep 1, 2023

Is anyone experiencing these errors with net7.0?

@jbogard
Copy link

jbogard commented Sep 1, 2023

Yes, we tried upgrading to .NET 7 and the latest SNI package release (5.1.1). It didn't fix it. The only thing that has made any difference was the suggestion to bump the min thread count. We bumped it to 20 and now very rarely see the issue (maybe 1/day or so).

Nothing seems to eliminate it so far but the thread count workaround does help.

@TroyWitthoeft
Copy link

TroyWitthoeft commented Sep 2, 2023

To continue the conversation about this issue being related to Linux and to hopefully narrow in on an event ... our Azure Application Insight logs show that these random timeouts AND application performance issues all started after this Azure East US maintenance window QK0S-TC8 did SOMETHING to Azure functions host on August 17th 9:00PM EST. Something in that host update caused this unknown error 258 to start appearing.

At that point in time, our host went from Linux 5.10.177.1-1.cm1 to Linux 5.15.116.1-1.cm2 and application performance tanked shortly after, and we now have the sudden appearance of these Unkown error 258 throwing exceptions. Some metric digging shows that memory usage (P3V3 with 32GBs) tanked along with performance.

image

No code or db changes on our part, just suddenly the AppInsight logs show kernel 5.15, odd timeout errors, and we presumably can't access memory. @eisenwinter - you said you tried a few different kernels, did you go back to 5.10? Anyone else seeing this issue on 5.10 or lower?

Update: We converted the function app over to Windows. The sql timeout errors are gone and application performance is restored! 🎉

@krompaco
Copy link

krompaco commented Sep 6, 2023

We lifted to .NET 7 but still see occurrences.

@sajidur
Copy link

sajidur commented Sep 24, 2023

I encountered a similar issue where my application was running on Docker within a Red Hat VM. I faced several weeks of challenges while trying to resolve it. Eventually, I decided to switch to a Windows environment, and to my surprise, I encountered no further issues.

I suspect that there might have been a network issue within my Linux VM with the windows SQL Server database.

@SharePointX
Copy link

I am having the same problem.

  • System.Data.SqlClient version: 4.8.5
  • .NET target: .NET 8.0
  • SQL Server version: Microsoft SQL Server 2019 Developer (64-bit) - 15.0.2101.7
  • Operating system: Windows Server 2019 Datacenter (10.0)
System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): 待ち操作がタイムアウトになりました。
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite, String method)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
   at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Common.DbDataAdapter.FillInternal(DataSet dataset, DataTable[] datatables, Int32 startRecord, Int32 maxRecords, String srcTable, IDbCommand command, CommandBehavior behavior)
   at System.Data.Common.DbDataAdapter.Fill(DataTable[] dataTables, Int32 startRecord, Int32 maxRecords, IDbCommand command, CommandBehavior behavior)
   at System.Data.Common.DbDataAdapter.Fill(DataTable dataTable)

The table has 10 columns.
Total number of data is 22,000,000.
When I call "Fill" with PK by SqlCommand, the error occurs only the first time.
On the second call, the correct 10,000 items are retrieved.

It occurs in single-threaded console applications, even when accessed from a single instance.
I believe the problem is not a client application problem, but a database cold start problem.
Is there any way to warm up the database?

@JRahnama
Copy link
Member

JRahnama commented Oct 6, 2023

@SharePointX have you tried Microsoft.Data.SqlClient? Is this constantly happen on your application? Can you share a sample repro?

@SharePointX
Copy link

I swapped to Microsoft.Data.SqlClient to check and still got the same error.
However, it does not occur on the second run.

  • 1st run: timeout error
  • 2nd time: Normal termination immediately
  • After restarting SQL Server service: takes a long time but ends normally

The other thing I found out is that the error occurs only when querying a table with a very large number of records in the same DB instance.

@DavoudEshtehari
Copy link
Member

@SharePointX have you tried increasing the connection timeout to make sure it won't be timed out while retrieving data from the server and how it affects the execution?

I'd suggest retrieving data in smaller batches if it fails with large portions to find the balance.
If the server is not available temporarily (especially over the network), I'd recommend doing the retry pattern on the the application level with the remaining data.

There are some thread starvation scenarios with thread pool and connection pool, which I believe don't apply to your case with a single-threaded app. I don't have any idea about warming up the database, you may ask the SQL Server Dev team.

@oyvost
Copy link

oyvost commented Nov 3, 2023

I revisited this problem and stumbled upon something intriguing while using EF Core 7.0. I discovered that both DbConnection and DbCommand have a CommandTimeout property.

Here's how I configured the database service:

builder.Services
    .AddDbContext<DBContext>(options =>
    {
        var connectionString = "----";
        options.UseSqlServer(connectionString, sqlServerOptionsAction: sqlOption =>
        {
            sqlOption
                .CommandTimeout((int)TimeSpan.FromMinutes(30).TotalSeconds);
        });
    });

When calling the stored procedure, I had to include the following code:

_db.Database.OpenConnection();
var command = _db.Database.GetDbConnection().CreateCommand();
command.CommandText = "StoredProc";
// command.CommandTimeout = _db.Database.GetCommandTimeout().Value; // The command timeout is not set without this line
return await command.ExecuteNonQueryAsync();

I initially assumed that DBCommand.CommandTimeout was inherited from DBConnection.CommandTimeout, but that's not the case. The line mentioned above must be uncommented. After doing so, the timeout issue (258 timeout) is resolved. This behavior was somewhat unexpected. Is there a better way to handle this configuration?

@ErikEJ
Copy link
Contributor

ErikEJ commented Nov 3, 2023

@oyvost You can set it in the connection string.

@L-Palmquist
Copy link

To continue the conversation about this issue being related to Linux and to hopefully narrow in on an event ... our Azure Application Insight logs show that these random timeouts AND application performance issues all started after this Azure East US maintenance window QK0S-TC8 did SOMETHING to Azure functions host on August 17th 9:00PM EST. Something in that host update caused this unknown error 258 to start appearing.

At that point in time, our host went from Linux 5.10.177.1-1.cm1 to Linux 5.15.116.1-1.cm2 and application performance tanked shortly after, and we now have the sudden appearance of these Unkown error 258 throwing exceptions. Some metric digging shows that memory usage (P3V3 with 32GBs) tanked along with performance.

image

No code or db changes on our part, just suddenly the AppInsight logs show kernel 5.15, odd timeout errors, and we presumably can't access memory. @eisenwinter - you said you tried a few different kernels, did you go back to 5.10? Anyone else seeing this issue on 5.10 or lower?

Update: We converted the function app over to Windows. The sql timeout errors are gone and application performance is restored! 🎉

Just want to mention that we experienced a similar issue and converting to Windows seems to have solved the problem.

@dazinator
Copy link

I can't switch to windows to solve this.
I added polly retries with exponential backoff and on the third retry when this error occurs, the query succeeds. This has allowed us to remain operational thank goodness.

@dazinator
Copy link

Would it be possible to hear from the SqlClient team as to whether any fixes in the latest release (5.1.2) may help with this issue? https://github.com/dotnet/SqlClient/blob/main/release-notes/5.1/5.1.2.md

@dbeavon
Copy link

dbeavon commented Jan 11, 2024

If you have chronic problems or outages, please open a support ticket with Microsoft and share the outcome here. I think that type of information will allow us to proceed more productively.

I'm a "big-data" developer in Azure, and I have suffered from chronic socket exceptions in multiple Azure platforms - just as everyone else is describing. The main pattern I've noticed is that they happen within Microsoft's proprietary VNET components (private endpoints). They are particularly severe in multi-tenant environments where several customers can be hosted on the same hardware (or even in containers within the same VM.

The problems are related to bugs in Microsofts software-defined-networking components (SDN proxies like "private endpoints" or "managed private endpoints"). I will typically experience these SDN bugs in a large "wave" that impacts me for an hour and then goes away. The problems have been trending for the worse over the past year, and I've opened many tickets (Power BI, ADF, and Synapse Spark).

The problems are not likely to happen on normal TCP (public internet ports). Nor are they common on VNET traffic when "peering" is used. It seems to be a problem that is specific to the funky private endpoints in Azure. I would be interested to know if others can related to this pattern? Also, the socket exceptions are not specific to SQL. They impact MSAL, ADLS, REST, etc. The underlying TCP exceptions are generally the same in all cases ("connection reset" or "connection timeout").

The Microsoft support cases I have already opened have not gone well. We are a small customer and Microsoft has not taken our problems seriously. I think they are already well-aware of the bugs in their private VNET technology and are trying to keep them hidden. They undermine the severity of these problems by repeating the word "transient" over and over like it was some sort of a magical incantation that makes the bugs more acceptable. To me it seems totally unacceptable to have issues (transient or otherwise) that trigger dozens and dozens of socket exceptions a day.

I find that there are lots of Microsoft's own products that are impacted by these network bugs, notably ADF pipelines. They have had a "VNET IR" that suffers from a massive number of socket exceptions when interacting with resources in the local network (East US). The support teams actually say they have a formalized "policy" to require customers to implement retries on all pipeline activities (they suggest at least three retries with two minute delays in between them). This "policy" is simply a way to avoid the well-known bugs. ADF deployed the VNET IR several years ago and they have probably handled hundreds of CSS tickets about the fragile private endpoints in their managed vnet.

IMPORTANT: Based on a three-year support case with another team (not ADF) I do have bug numbers that apply to private endpoints and these bugs are supposed to be fixed soon (not holding my breath).

The following changes are for the "Azure Network stack":

Feature ending with x509 : Support map space deletion in VFP without breaking flows.
(virtual filtering platform, ie switch?)

The ETA publicized is late 2024

Bug ending with x810 : PE-NC Flow mix up
Network team promised to roll it out early 2024.

Sorry for the long post. I think it is easy for everyone to prove that these problems are specific to Azure, and don't affect on-premise scenarios with Sql Client. The challenge is getting Microsoft to help with them in Azure. I get the sense that there is a substantial lack of honesty and transparency when it comes to networking bugs in Azure. There was only one support scenario where Microsoft was willing to admit to the bugs I've experienced in their SDN technology. In every other Azure platform they are refusing to investigate or discuss the so-called "transient" network errors that cause TCP connections to break between resources that are in direct proximity to each other (0 ms latency).

Here is a discussion of ADF's networking bugs.
https://learn.microsoft.com/en-us/answers/questions/1390627/what-is-the-reason-for-networking-problems-in-adf

Hope this helps. I would appreciate it if those impacted by the socket errors would at least confirm that their scenarios involve VNETS, eg. private endpoints in particular, and possibly managed vnets. If you open a support case with Microsoft, please keep a count of the number of times they use the word "transient" (and they always say "communication failure" rather than socket exception).

@jbogard
Copy link

jbogard commented Jan 11, 2024

If you have chronic problems or outages, please open a support ticket with Microsoft and share the outcome here. I think that type of information will allow us to proceed more productively.

We did this. It was punted around between teams with no resolution so we closed it on our side because we were tired of spending time answering the same questions over and over. The workarounds posted here have helped but not eliminated the issue. We decided to stick with *nix App Service instances although the time spent futzing around on this issue has cost us more than the savings switching from Windows instances.

@dazinator
Copy link

dazinator commented Jan 11, 2024

The main pattern I've noticed is that they happen within Microsoft's proprietary VNET components (private endpoints). They are particularly severe in multi-tenant environments where several customers can be hosted on the same hardware (or even in containers within the same VM.

@dbeavon We have a multi-tenant application running as a docker container on an Azure VM - connecting to Azure SQL over a "private endpoint" (which presents itself as a nic on our vnet). So yes I think this is a good match.

I have not tried deleting the private endpoint and going via the public ip, i will give that a shot.
As it stands, we are mid plan refactoring our application to make our most impacted SQL calls in a newly dedicated process (theory being it was something to do with thread pool ramp up when other things are sharing the thread pool - and this not playing well with the was sql client library is written using sync over async and such). So by giving our task a dedicated dotnet process hopefully that will help. However if it doesn't, aside from being really annoyed, atleast I can give the public IP a go..

It's too painful to go through support, we are a small team and the level of time required to get this over the line appears too much.

@dbeavon
Copy link

dbeavon commented Jan 12, 2024

@dazinator I certainly agree about the pain of support. But if Microsoft isn't working on these sorts of cases, then they will have less motivation to fix the underlying bugs . I doubt there is any telemetry that is captured for the so-called "transient" socket exceptions in their virtualized network technology. If only 1% of the impacted customers are making the effort to report these problems then Microsoft probably thinks the problems aren't very severe.

And if Microsoft won't prioritize or fix these network bugs, they will impact us forever. Just as a case in point - the managed-vnet-IR in ADF has been buggy for ~4 years. And the managed vnet gateway in Power BI is also impacted by the same private endpoint bugs for ~3 years (the only difference here is that the Power BI team wisely allowed their component to remain in a "preview" state rather than pushing ahead with a premature GA). I have had a case open with the PBI team for the entire time this component has been in preview.

The biggest pain for me has been with the Synapse platform. Hosting my Spark workloads on Synapse has been a real nightmare. I see dozens and dozens of socket exceptions each day, in every possible type of remote connection. However this did NOT happen to me when my Spark workloads were hosted on Azure Databricks (using an injected VNETs, rather than private endpoints). Nor do I see these socket exceptions when running a standalone Spark cluster on my own local hardware. It is pretty easy to distinguish the network environments that work fine from the ones that are bug-ridden. There is no excuse for the unreliable nature of Microsoft VNETs. The TCP transport protocol is already designed to be fault-tolerant. Microsoft has really changed the whole equation when it comes to the reliability of TCP connections....

It is especially annoying to think about the fact that these failed connections are taking place between servers that are sitting a hundred yards from each other (or less). ... I will have to find the Azure East US region on google earth to see how big it is. It is certainly not big enough to suffer from "random Internet glitches". I would have better network reliability running my Spark clusters on the Wifi at an airport than I would in Azure!

@TroyWitthoeft
Copy link

TroyWitthoeft commented Jan 12, 2024

@dbeavon - I know how painful and impossible it can be to track down transient network errors. I bet the MS team does too, hence the run around. Transient errors stink, I feel your pain.

I do want to point out that this particular situation could or could NOT be related to network errors. We experienced that the Intermittent Unknown error 258 began with an update to our Linux App Service that clearly changed it's operating telemetry and then this error appeared.

We'll all need to work together to get this one sorted.

@sliekens
Copy link

It doesn't help that 258 seems to be used for a wide range of problems. Your query is slow? Unknown error 258. The network is congested? Unknown error 258. The DBA didn't like the way you looked at him? UNKNOWN ERROR 258.

@Vandersteen
Copy link

Vandersteen commented Jan 14, 2024

The main pattern I've noticed is that they happen within Microsoft's proprietary VNET components (private endpoints). They are particularly severe in multi-tenant environments where several customers can be hosted on the same hardware (or even in containers within the same VM.

@dbeavon We have a multi-tenant application running as a docker container on an Azure VM - connecting to Azure SQL over a "private endpoint" (which presents itself as a nic on our vnet). So yes I think this is a good match.

I have not tried deleting the private endpoint and going via the public ip, i will give that a shot. As it stands, we are mid plan refactoring our application to make our most impacted SQL calls in a newly dedicated process (theory being it was something to do with thread pool ramp up when other things are sharing the thread pool - and this not playing well with the was sql client library is written using sync over async and such). So by giving our task a dedicated dotnet process hopefully that will help. However if it doesn't, aside from being really annoyed, atleast I can give the public IP a go..

It's too painful to go through support, we are a small team and the level of time required to get this over the line appears too much.

We went the other way, from public IP to private endpoint, if my memory serves me, it was after some advice we received from azure support when we asked to investigate it. It did not help much.
What did seem to help (a bit, we still get frequent timeouts / 258), is to have less applications / concurrency on a single 'vm' by making our 'apps' reserve more cpu's (on k8s) even thought they do not really need it when looking at the cpu usage.

@dbeavon
Copy link

dbeavon commented Jan 14, 2024

It would be very easy to send a repro to one of these Azure teams, but they are always unwilling to pursue that option. I have a couple tickets open at the moment where I offered to give the CSS engineers a repro in order to experience these problems for themselves, and they are totally uninterested in taking that path.

(It is strange to behold a vendor that doesn't care about fixing their bugs. When using on-prem software, vendors are typically VERY appreciative if you could provide a consistent reproducible that allowed them to independently review a bug on demand.)

In my experience this problem isn't specific to SQL. In my Synapse-Spark clusters I have chronic socket failures across a number of remotely accessed services in East-US (MSAL, REST, SQL, etc). I just finished up with a support case involving the ADF team whereby they were experiencing a similar problem in their own component (ADF "Linked Service Resolver"). Their component was very unreliable as it was making connections to MSAL from the managed vnet. This particular MS team (that is responsible for the ADF "Linked Service Resolver") says the solution is that every TCP socket which is opened in Azure needs to be wrapped in retry loops that are repeated five times with two minute delays between them. I'm not sure about you folks but it seems absurd. The TCP protocol is supposed to be resilient and fault tolerant, so that application developers don't need to re-invent the wheel and wrap a ton of custom retry logic around every remote TCP connection.

In the end, the ADF team "mitigated" their own component ("Linked Service Resolver") in some way - presumably by using a custom-written retry loop. Then they closed the case without giving any explanation for the elevated number of socket exceptions that were affecting us in the first place. I had hoped to learn from the experience and apply the information to the other types of failing sockets, but this ADF team really didn't give me anything useful as a take-away. Aside from my chronic socket exceptions in ADF/Synapse, I also have frequent outages that are hitting my service-health dashboard. The outages last thirty minutes at a time and come up on a monthly basis, and these outage notifications often use the word "transient communication failures" to explain the RCA. In other words, I think Microsoft product teams are suffering from the network problems themselves, even to the point of publicizing them when they become extreme. Hopefully they will eventually fix these problems for their own sake, (if not for ours.)

@Will-Bill
Copy link

For those moving to PostgreSQL to mitigate these issues, what are you using for the migration from MS SQL to PostgreSQL?

@Malcolm-Stewart
Copy link

Malcolm-Stewart commented Feb 14, 2024

Are you running into the Azure Load Balancer issue, where if the connection goes idle for more than 5 seconds, then the connection is terminated? Normal SQL keep-alive is 30 seconds. Lowering it to 3 seconds avoids the issue by ensuring there is never a gap of 5-seconds without some sort of traffic, even if it is a Keep-Alive packet.

A cold SQL Server may be prone to having periodic pauses in getting/returning data, whereas a warmed-up server should have it in cache. This may explain the issues you are running into.

You should be able to see this using a network trace.

@asierpn
Copy link

asierpn commented Feb 15, 2024

The issue can be easily reproduced by running an aspnet core web api project in a Linux container and making multiple concurrent calls to an endpoint to query a SQL Server database immediately after the service starts. The cost of opening each new thread on Linux appears to be very high, and making 30 concurrent queries with no available threads in the thread pool can take more than 60 seconds. If you do the same test running the service on Windows, it works perfectly.

The only workaround for us to get this to work on Linux is to set the ThreadPool.MinThreads to a high number.

@damodar12345
Copy link

damodar12345 commented Feb 18, 2024

@asierpn @dazinator @MichelZ
Could you please let me know what was the value of min threads you had set as a workaround?

@asierpn
Copy link

asierpn commented Feb 19, 2024

@asierpn @dazinator @MichelZ Could you please let me know what was the value of min threads you had set as a workaround?

It depends on your application, you have to test with different values to find the best one for you, we noticed an improvement when we set it to 20, but eventually we increased it to 100.

@deadwards90
Copy link
Author

@Malcolm-Stewart I'll try to answer some of your questions, I'm going from memory here as I raised this issue some time ago.

Are you running into the Azure Load Balancer issue, where if the connection goes idle for more than 5 seconds, then the connection is terminated? Normal SQL keep-alive is 30 seconds. Lowering it to 3 seconds avoids the issue by ensuring there is never a gap of 5-seconds without some sort of traffic, even if it is a Keep-Alive packet.

I don't believe we were going through an Azure Load Balancer for this. Given what others in this thread have suggested, I don't think this would be what I was seeing but it might be what others have experienced (see also the port exhaustion suggestions from early on).

A cold SQL Server may be prone to having periodic pauses in getting/returning data, whereas a warmed-up server should have it in cache. This may explain the issues you are running into.

I can say we some confidence this was not the issue. These errors happened sporadically, on hot or cold servers.

You should be able to see this using a network trace.

Certainly would be nice but the issue we always had was that these were running on containers in K8s and the errors were not easily reproduced. We'd have to leave network traces running for hours or days at a time to try and catch one in the act. There is @asierpn above who recently said this can be fairly easily produced just by running in a Linux Container and querying (I assume?) any SQL Server instance just as the service is starting

With that said, we witnessed this on both busy and fairly dormant services that had been up and running for days at a time so it might be the same error message but a different root cause, very hard to say though.

If there is anything we (the community) could do here to help nail this down, I'm all ears. Happy to have something running on a machine over a weekend to try and catch it along with some tracing but I'm not sure the best way to set this up.

@David-Engel
Copy link
Contributor

Just want to clarify one misconception.

The cost of opening each new thread on Linux appears to be very high, and making 30 concurrent queries with no available threads in the thread pool can take more than 60 seconds.

While the cost of creating new threads is higher on Linux than on Windows, for example. It's not that cost that slows things down. The system thread pool implementation in .NET is designed to spin up new threads slowly, in a very conservative manner. If a library or application is spawning many tasks that hold threads for any length of time, it will take a while for the system thread pool to add threads to service the waiting queue of tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests