Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 35 - An internal exception was caught) #2103

Closed
Nico-VanHaaster opened this issue Jul 25, 2023 · 19 comments
Labels
🔗 External Issue is in an external component

Comments

@Nico-VanHaaster
Copy link

Describe the bug

We are using Entity Framework 7 with the Microsoft.Data.SqlClient package version 5.1.1. The issue we are facing is the following stack trace appears in groups in our logs every few hours from all our services that use a common routine for loading a database connection. From the logs these issues are typically cleared up within 30seconds, but during that time there is significant errors (10-15).

These errors are all happening from different services (hosted in a AKS cluster) within the same time framework.

I will add that all the services connect to the Azure SQL Database using Private endpoints, and will stop happening after 15-30 seconds. The interval between the errors is random, anywhere from 30minutes to 4 hours so I am at a complete loss where to look.

Microsoft.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 35 - An internal exception was caught)
 ---> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer.
 ---> System.Net.Sockets.SocketException (104): Connection reset by peer
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Stream.Read(Span`1 buffer)
   at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
   at Microsoft.Data.SqlClient.SNI.SslOverTdsStream.Read(Span`1 buffer)
   at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory`1 buffer, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at System.Net.Security.SslStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Receive(SNIPacket& packet, Int32 timeoutInMilliseconds)
   at ....
ClientConnectionId:4d28a01d-2de1-4540-b1ca-9e9ef9cd9501

Further technical details

Microsoft.Data.SqlClient version: 5.1.1
.NET target: dotnet core 7
SQL Server version: Azure Sql
Operating system: Debian Linux - Docker

@JRahnama JRahnama added this to Needs triage in SqlClient Triage Board via automation Jul 25, 2023
@Kaur-Parminder Kaur-Parminder moved this from Needs triage to Needs Investigation in SqlClient Triage Board Jul 25, 2023
@Kaur-Parminder
Copy link
Contributor

@Nico-VanHaaster Thanks for reporting. We will try to reproduce it on our end and update the issue.

@dmytro-gokun
Copy link

Looks like we experience the exact same problem in our Azure Container App. The exact same code does not produce this exception when run in Service Fabric environment

@dmytro-gokun
Copy link

@Kaur-Parminder Did you have a chance to investigate this?

@DavoudEshtehari
Copy link
Member

@dmytro-gokun, It's a different environment configuration and I'd suggest connecting Azure Support for further assistant with it.

@dmytro-gokun
Copy link

dmytro-gokun commented Oct 16, 2023

@DavoudEshtehari After some further investigation, here's my findings:

  1. This is a transient network exception;
  2. The real difference is not between Service Fabric/Kubernetes, but about OS. When running on ServiceFabric, we use Windows OS and this SqlException than comes back with an SqlError where Number = 10053. When run on Kubernetes/Linux, we get an SqlException where Number = 0.

And this is where the difference happen. When we have 10053, we know it's a transient exception and retry it. For example, EF has this code to find if an exception is transient: https://github.com/dotnet/efcore/blob/release/8.0/src/EFCore.SqlServer/Storage/Internal/SqlServerTransientExceptionDetector.cs

But when we get an exception with Number=0 we do not detect it as transient and have to abort the current operation and write error information to the our log.

So, to me this looks like a SqlClient/Linux bug where SqlException.Number is not populated properly. Do you agree?

@Nico-VanHaaster Do you think this also describes your situation?

@ErikEJ
Copy link
Contributor

ErikEJ commented Oct 16, 2023

Looks related to #1773 and #1902

@dmytro-gokun
Copy link

@ErikEJ You are right. And both of those issues are closed. So, the logical question comes: how do we detect transient errors in an OS-independent manner? Is it even possible with the current codebase?

@JRahnama JRahnama moved this from Needs Investigation to Ideas for Future in SqlClient Triage Board Oct 19, 2023
@JRahnama JRahnama added the 🙌 Up-for-Grabs Anyone interested in working on it can ask to be assigned label Oct 19, 2023
@David-Engel
Copy link
Contributor

In talking with the ODBC driver team, for their Linux implementation, they actually have translations from Linux into the equivalent Windows error codes to try and maintain prior Windows-only behavior under error conditions. So it sounds like this is an enhancement we need to make in our managed SNI layer that's used by non-Windows clients. I bet this would be helped by implementing #649...

@JRahnama JRahnama added the 💡 Enhancement New feature request label Oct 24, 2023
@egorshulga
Copy link

egorshulga commented Nov 2, 2023

Let me share my piece of investigation. Recently we observed a spike of errors, which are exactly the same as the topic starter had, and this also impacted our production systems, so we had to take the closest look as we could.

So, we followed the guess, that on linux the code of the underlying problem is not correctly initialized (it is left as 0), thus the built-in retry mechanism simply misses the case when it should be retried. To track it down we added additional logging to AppInsights (which was injected into the ShouldRetryOn method of an ExecutionStrategy of DbContext).

And these are the results: on the image you can see that the SqlException.Errors[0].Number [02] is 0, and that is exactly the property that is used in the built-in retry mechanism of the SqlClient. But we can also see that SqlException.InnerException.InnerException.SocketErrorCode is 10054 [14], and this code is exactly one of the list of retryable codes.

So, it seems that retry mechanism design assumes the SocketErrorCode is copied into SqlError.Number, which obviously does not happen in our case in our environment. May I ask for an advice from the maintainers, what could be the next place to search for the source of the trouble? Where the creation of SqlException is performed?

image

@egorshulga
Copy link

It's really hard to find the place where SocketException should turn into SqlError.Number.

In the meantime we applied this workaround:

public class CustomRetryingExecutionStrategy : SqlServerRetryingExecutionStrategy
{
    protected override bool ShouldRetryOn(Exception e)
    {
        var transactionIsRetried = false;
        if (e is SqlException sqlException)
        {
            if (sqlException.InnerException is IOException ioException && ioException.InnerException is SocketException socketException)
            {
                if (socketException.SocketErrorCode == SocketError.ConnectionReset) // 10054
                {
                    transactionIsRetried = true;
                }
            }
        }

        if (!transactionIsRetried)
        {
            transactionIsRetried = base.ShouldRetryOn(e);
        }
        return transactionIsRetried;
    }
}

And registration:

serviceCollection.AddDbContextFactory<XDbContext>((serviceProvider, options) =>
  options.UseSqlServer(connectionString,
    options => options.ExecutionStrategy(c => new CustomRetryingExecutionStrategy(...)))
)

@DavoudEshtehari
Copy link
Member

@egorshulga
On Managed SNI, socket exceptions come from the underlying Socket library while trying to connect to the server or after that if it interfered with any exception. And SqlException picks the first error number of the Errors collection. In this case, SqlException includes the inner exceptions and reports the error number 0, which doesn't have the required information to do retries and shows the possible opportunity to improve it. I don't see an easy solution to answer your question since Managed SNI errors require improvement following this comment.

@dbeavon
Copy link

dbeavon commented Jan 23, 2024

We also have a massive number of socket exceptions using Azure SQL from .Net clients that connect over private endpoints.

I think there are some well-known causes for these socket exceptions. The challenge has been to try to get Microsoft to provide support. These problems occur in their data center, using their SQL client, and their Azure SQL database. You would think that CSS would want to look at a repro for this problem. But I have opened numerous cases with several of the teams that have experience with private endpoints (eg. Power BI team, ADF team, Synapse team, and Private Endpoint team itself). But nobody wants to share my repro! It is very frustrating. The private endpoint team at CSS appeared to be just a VPN and DNS troubleshooting team, so I gave up on trying to get them to help.

My next plan is to open an AKS support case. Has anyone tried to give that team a repro for this yet? I know it is a lot of work, but I don't think we can fix private endpoints without forcing Mindtree/Microsoft to review a full repro.

I can share what I have learned, if it helps.... Sorry in advance for the wall of text.

The most drastic reduction in our socket exceptions happened when we increased the threads in the threadpool, specifically the min threads, which are only (8,8) by default on Linux (at least on .Net core 3.1). See issue #647 and the related comment #647 (comment)
ThreadPool.SetMinThreads(1024, 512);

That fixes the majority of our socket exceptions. Note that this is a fix which is unrelated to the private endpoints. It just resolves the problems with threadpool starvation on Linux (prior to introducing the fix, you will notice that your callstack waits for async responses from members like "System.Net.Security.SslStream.ReadAsyncInternal")

The remaining socket exceptions are almost certainly due to bugs in the so-called Private Endpoints (aka Managed Private Endpoints). These bugs are well-known by the responsible Microsoft team, but not so much by anyone else at Microsoft (even a PG like the ADF team that use private endpoints extensively doesn't seem to be aware, despite the fact that their product suffers from a massive number of what they call "transient communication failures". We have also had recurring monthly outages that last ~30 mins and refer to "transient failures" as the underlying cause. In one recent incident, they even made up a story that blamed Cosmos DB for a database crash that lasted thirty minutes. In reality, I think they had to change their outage notification a little bit since the normal notification was getting a little repetitive....)

Thankfully there was one team that shared some of the details about the underlying private endpoint bugs. It was only because and I have had a support case open with them for over two years. The Power BI team gave me a tiny bit of information, but they won't go so far as to say that the bugs the found are the same ones that impact the other platforms which use private endpoints. It is pretty clear that these network bugs have delayed the GA .... You will see that they announced their component (azure managed vnet gateway) about three years ago and it still hasn't gone GA - I'm assuming that is primarily because of these unresolved networking bugs (you can find the public preview announcement here: https://powerbi.microsoft.com/en-us/blog/announcing-vnet-connectivity-for-pbi-datasets-public-preview/ )

As per information from the Power BI team, here are some changes they expect to receive to improve the "Azure Network stack":

Feature ending with x509 : Support map space deletion in VFP without breaking flows.
(virtual filtering platform, ie switch?)
Estimated ETA - late 2024

Bug ending with x810 : PE-NC Flow mix up
Estimated ETA - Network team promised to roll it out early 2024.

I also want to point out that these issues in private endpoints are NOT specific to SqlClient, from what I can tell. They seem to affect other types of TCP traffic - particularly long-running operations that hold a connection open for an extended period of time (longer a second).

I wish Microsoft would share more openly and transparently about these networking bugs. Instead what they typically do is to brush off the problem as a "transient" issue, even though there are underlying bugs that are responsible. They will proceed to ask customers to implement cumbersome workarounds (numerous retries with two minute delays in between them). This seems excessive considering the fact that the TCP protocol is intended to be fault-tolerant in the first place, and also because my connections are between resources that are direct proximity to each other within the same datacenter!

One of the reasons it is hard to hold Microsoft accountable is because these private endpoint bugs are related to virtual network (software-defined networking components). It is not related to the physical infrastructure. And only a small subset of customers is typically impacted at any time. The impacted customers typically have to be running workloads on the same physical and possibly even the same virtual machine (if containers are involved). Interestingly the Power BI team was never able to recreate the PE bugs if they were running my repro in their "Canary" region. Eventually they had to run the repro on the East US region in order to see the buggy behavior.

I have kept my Power BI support ticket open for now (going on three years). I think I will try to avoid closing it until they have a permanent fix for the private endpoints. They will finally go GA in February, despite the presence of these bugs. Their plan is to mask over the problem with implicit retries (it is not something I look forward to, considering the long-running queries that pull large amounts of data into Power BI).

I hope this is helpful, if long-winded. I will remember to circle back if/when the AKS support team is willing to take a support case and a repro for these "private endpoint" bugs. I think it is pretty clear that these problems don't happen for me on (1) peered VNETs or on (2) public service endpoints. It is just the funky private endpoints that seem to be buggy. It would be helpful if someone could confirm whether these observations are consistent with their own.

@egorshulga
Copy link

Hey guys, we also requested support from MS. We had a call where some guys from AppServices, Networking and Databases team were invited. Our outcome from the call was learning about the connectivity architecture for Azure SQL Server, as well as about the Redirect connection policy. So, the guess that we pondered upon was that this transient SQL error 35 is caused by some internal gateways, which are a part of the setup of Azure SQL Server, and which simply does not handle the load (which causes its restarts from time to time). Looks like MS guys are aware of the problem, although it is not apparent if it could be easily fixed for everyone.

So, the fix that we were advised to apply (and that we did at the end) was to force the connectivity mode of SQL Server to Redirect.
image

We are constantly monitoring it, but for now I can say that we didn't have occurrences of the error since we applied the change (1.5 weeks). Before that the error was happenning at least once in 3 days (and even more often - up to multiple times per day).

At least the reason and the fix look plausible. Fingers crossed that the error was really caused by those faulty gateways

@dbeavon
Copy link

dbeavon commented Jan 23, 2024

@egorshulga Maybe you are on to something in regards to the problems in our "private endpoints" .

Hopefully your fix works for all of us who are using private endpoints (like OP and myself). In certain "managed vnet" environments (like ADF pipelines and Synapse Spark) we have no choice but to use the "managed private endpoints". There are no other VNET connectivity options.

Can you please share an ICM number for your case so I can contact the Azure SQL team and see if the issue is the reason for our disconnections as well? I suspect it took a number of days (weeks?) before you were able to get as far as you did on your case. If the gateways are a point of failure then at least we need a way to monitor/quantify the underlying failures (to know whenever the gateway "restarts from time to time").

I'm a bit doubtful that the restarting gateways are the only problem going on, but it is definitely worth a look. In many instances our socket exceptions can persist for 30 mins or so and we'll see a series of failures in that time (up to 50% of our tasks). It would be surprising if this sort of experience could be explained by a restarting gateway. Also, this wouldn't explain connectivity problems with REST, MSAL, and other resources.

@DavoudEshtehari
Copy link
Member

Closing this as it is not a driver-relevant issue and reopening #1773 to explore potential enhancements in error messages.

SqlClient Triage Board automation moved this from Ideas for Future to Closed Jan 23, 2024
@DavoudEshtehari DavoudEshtehari added 🔗 External Issue is in an external component and removed 💡 Enhancement New feature request 🙌 Up-for-Grabs Anyone interested in working on it can ask to be assigned labels Jan 23, 2024
@egorshulga
Copy link

egorshulga commented Jan 24, 2024

I suspect it took a number of days (weeks?)

2 months 😉 But alright, that was also a feature freeze for us, so we contributed to the delay too, unfortunately.

for all of us who are using private endpoints

Right, our setup struggles were revolving around SQL Failover Group and SQL Servers in different regions. We finished with setting up a private endpoint and private DNS zones for resolving IPs of the actual servers (after a request got load-balanced by a Traffic Manager, which is apparently under the hood of the Failover Group).

So at the end it seems to be not a problem with private endpoints themselves, but rather with the SQL Server (I would treat those mysterious gateways, that are on the way of the requests in the 'Proxy' mode, part of the setup for SQL Servers). But apparently the problem shows itself only when requests are routed over the MS backbone network. // all of this is mere speculation //

Can you please share an ICM number for your case

Will send in an email

@ventii
Copy link

ventii commented Mar 21, 2024

hi @egorshulga did you have any re-occurrences since you changed the connection policy to Redirect?

@egorshulga
Copy link

egorshulga commented Mar 21, 2024

Yes, unfortunately we can still see. But it is peculiar:

  1. Error message and stack trace are now different. Now it is at InternalOpenAsync
Stacktrace
System.AggregateException: One or more errors occurred. (A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 35 - An internal exception was caught))
 ---> Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 35 - An internal exception was caught)
 ---> System.IO.IOException: Unable to write data to the transport connection: Connection reset by peer.
 ---> System.Net.Sockets.SocketException (104): Connection reset by peer
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Stream.Write(ReadOnlySpan`1 buffer)
   at System.Net.Sockets.NetworkStream.Write(ReadOnlySpan`1 buffer)
   at Microsoft.Data.SqlClient.SNI.SslOverTdsStream.Write(ReadOnlySpan`1 buffer)
   at System.Net.Security.SyncReadWriteAdapter.WriteAsync(Stream stream, ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at Microsoft.Data.SqlClient.SNI.SNIPacket.WriteToStream(Stream stream)
   at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Send(SNIPacket packet)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.SNIWritePacket(PacketHandle packet, UInt32& sniError, Boolean canAccumulate, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.WriteSni(Boolean canAccumulate)
   at Microsoft.Data.SqlClient.TdsParserStateObject.WritePacket(Byte flushMode, Boolean canAccumulate)
   at Microsoft.Data.SqlClient.TdsParser.TdsLogin(SqlLogin rec, FeatureExtension requestedFeatures, SessionData recoverySessionData, FederatedAuthenticationFeatureExtensionData fedAuthFeatureExtensionData, SqlConnectionEncryptOption encrypt)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.Login(ServerInfo server, TimeoutTimer timeout, String newPassword, SecureString newSecurePassword, SqlConnectionEncryptOption encrypt)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.AttemptOneLogin(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean ignoreSniOpenTimeout, TimeoutTimer timeout, Boolean withFailover)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString connectionOptions, SqlCredential credential, TimeoutTimer timeout)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(TimeoutTimer timeout, SqlConnectionString connectionOptions, SqlCredential credential, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, Boolean applyTransientFaultHandling, String accessToken, DbConnectionPool pool)
   at Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnectionPool pool, DbConnection owningObject, DbConnectionOptions options, DbConnectionPoolKey poolKey, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerDatabaseCreator.<>c__DisplayClass20_0.<<ExistsAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerDatabaseCreator.<>c__DisplayClass20_0.<<ExistsAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerDatabaseCreator.<>c__DisplayClass20_0.<<ExistsAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalDatabaseCreator.CanConnectAsync(CancellationToken cancellationToken)
ClientConnectionId:43017729-6681-4b40-8623-8e99c28168e0
   --- End of inner exception stack trace ---
  1. Actually we don't see any impact on real customers' requests. I still can see it in AppInsights via log search, but I'd say it does not bother use anymore.

upd: Ah, just read the stacktrace carefully. It says internal exception was caught. I assume now it is normal then

@dbeavon
Copy link

dbeavon commented Mar 21, 2024

Here is an update about the PE bugs....

The bug ending with x810 (PE-NC Flow mix up) was fixed in early February. As I understand, it was a general-purpose PE bug that impacted normal scenarios where a client on a VM is connecting to Azure SQL.

Despite the fix, I still see lots of socket exceptions in PE's, so I'm guessing they are related to the other bug:
(ending with x509 : Support map space deletion in VFP without breaking flows)

... although I'm not certain. The way this bug ( x509) is documented seems obscure and isn't likely to apply to all my scenarios. This one supposedly applies to only the scenarios where there are multiple containers on a VM.
... So I'm still trying to get a team at Microsoft to take a repro for my Spark problems but haven't found one that is eager to do so yet. I'm guessing that there are multiple bugs; but nobody seems to be willing to dive into them. I've learned that it is pretty impossible to get bugs fixed when working with Microsoft's "professional" support. The back-end PG do not care very much about these customers, will handle their cases in sort of a "best effort" way, and sometimes that means no effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔗 External Issue is in an external component
Projects
Development

No branches or pull requests

10 participants