Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues after Orleans 3.0 upgrade running on a Service Fabric Cluster #6113

Closed
thakursagar opened this issue Nov 11, 2019 · 51 comments
Closed
Assignees
Labels
stale Issues with no activity for the past 6 months
Milestone

Comments

@thakursagar
Copy link

thakursagar commented Nov 11, 2019

I recently upgraded to Orleans 3.0 nuget packages on our application which is deployed on a Service Fabric Cluster. After the upgrade when we run our benchmark tests for performance, I am seeing a lot of exceptions being thrown:
It starts from

Microsoft.Azure.Cosmos.Table.StorageException at Orleans.Clustering.AzureStorage.AzureTableDataManager`1+<>c__DisplayClass28_0+<<ReadTableEntriesAndEtagsAsync>b__0>d.MoveNext
--

Intermediate issue reading Azure storage table OrleansSiloInstances: IsRetriable=False HTTP status code=NotFound REST status code=TableNotFound Exception Type=Microsoft.Azure.Cosmos.Table.StorageException Message='Not Found'
--

Then after a few minutes I see

Orleans.Runtime.SiloUnavailableException
--
The target silo became unavailable for message: NewPlacement Request...

and


Orleans.Runtime.OrleansMessageRejectionException
--

Target S10.2.69.8:20015:311184127 silo is known to be dead

types of messages. I am using the Azure Table Storage Clustering for both the client and server.

@thakursagar
Copy link
Author

thakursagar commented Nov 11, 2019

Update: I noticed I was using the old WindowsAzure.Storage nuget package. So I removed the old WindowsAzure.Storage nuget package and installed the Microsoft.Azure.Storage.* packages for Blob, Queue and Common. That seems to resolve the Microsoft.Azure.Cosmos.Table.StorageException and I am no longer seeing it. However, I now see:

[{"severityLevel":"Warning","parsedStack":[{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Runtime.Messaging.ConnectionManager+<ConnectAsync>d__20.MoveNext","level":0,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":1,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":2,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Runtime.Messaging.ConnectionManager+<GetConnectionAsync>d__15.MoveNext","level":3,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":4,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":5,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Messaging.ClientMessageCenter+<<GetGatewayConnection>g__ConnectAsync|37_1>d.MoveNext","level":6,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":7,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":8,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Internal.OrleansTaskExtentions+<<ToTypedTask>g__ConvertAsync|4_0>d`1.MoveNext","level":9,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":10,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":11,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.OutsideRuntimeClient+<RefreshGrainTypeResolver>d__56.MoveNext","level":12,"line":0}],"outerId":"0","message":"Unable to connect to endpoint S10.2.69.10:20011:0. See InnerException","type":"Orleans.Runtime.Messaging.ConnectionFailedException","id":"3729534"},{"severityLevel":"Warning","parsedStack":[{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess","level":0,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":1,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Internal.OrleansTaskExtentions+<MakeCancellable>d__25`1.MoveNext","level":2,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":3,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess","level":4,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":5,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Runtime.Messaging.ConnectionManager+<ConnectAsync>d__20.MoveNext","level":6,"line":0}],"outerId":"3729534","message":"A task was canceled.","type":"System.Threading.Tasks.TaskCanceledException","id":"18829747"}]

which ultimately results approximately 20 minutes after in a ton of these SiloUnavailableExceptions:

[{"severityLevel":"Error","parsedStack":[{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":0,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":1,"line":0},{"assembly":"Cti.Reg.Apex.Grains, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","method":"","level":2,"line":766,"fileName":""}],"outerId":"0","message":"The target silo became unavailable for message: Request S10.2.69.10:20015:311200119*grn/BE5B12F0/00000000 #2742723: . Target History is: <S10.2.69.7:20017:311200117:*grn/E2326C08/00000000>. See https://aka.ms/orleans-troubleshooting for troubleshooting help.","type":"Orleans.Runtime.SiloUnavailableException","id":"25588762"}]

@thakursagar thakursagar changed the title Issues after Orleans 3.0 upgrade Issues after Orleans 3.0 upgrade running on a Service Fabric Cluster Nov 12, 2019
@sergeybykov
Copy link
Contributor

What clustering provider are you using?

@sergeybykov sergeybykov self-assigned this Nov 13, 2019
@sergeybykov sergeybykov added this to the Triage milestone Nov 13, 2019
@thakursagar
Copy link
Author

@sergeybykov I'm using Azure Table storage as the clustering provider.

@sergeybykov
Copy link
Contributor

Are you performing an in-place upgrade or a new deployment? If the latter, did you use a new cluster ID for it?

Have you looked at the cluster membership table? It should show live state of the cluster.

@thakursagar
Copy link
Author

thakursagar commented Nov 13, 2019

@sergeybykov New deployment. We delete the two Orleans Tables from Azure Tables (SiloInstances and Reminders) everytime before we deploy the new code package to Service Fabric.

I looked at the cluster membership table and it looks like most of the silos were dead (more than 1 time) and then came back up.

@sergeybykov
Copy link
Contributor

Before you start the benchmark, do you see all silos started and joined to the cluster successfully? If the table is fresh, you should not see any dead silo entries at that point.

@thakursagar
Copy link
Author

Yes. Before kicking off the benchmark runs, I made sure all the silos started and are in a status of "Active" in the SiloInstances tables. After a few minutes when I start the run, I see a lot of exceptions thrown from the Orleans.Core library indicating connection issues with reading the SiloInstances tables (which I have pasted above in my second comment on this post) followed by lots of SiloUnavailableExceptions. Then after some time, the silo seems to come back up and the benchmark run continues and then after some time I see the same behavior again. I feel something is not working well with the new Microsoft.Azure.Storage.* packages because when I downgrade Orleans to 2.4.3 which has a dependency to WindowsAzure.Storage, everything seems to be working well and I don't see any such issues. With the 3.0 version, I have to uninstall the WindowsAzure.Storage package as Orleans seems to have dependency on the new Microsoft.Azure.Storage.* packages.

I am also trying to replace the Azure Table Storage Clustering Membership with SQL clustering right now to rule out any issue with these new Microsoft.Azure.Storage.* packages or Azure Table Storage Clustering.

@sergeybykov
Copy link
Contributor

After a few minutes when I start the run, I see a lot of exceptions thrown from the Orleans.Core library indicating connection issues with reading the SiloInstances tables (which I have pasted above in my second comment on this post)

Do you mean this?

Intermediate issue reading Azure storage table OrleansSiloInstances: IsRetriable=False HTTP status code=NotFound REST status code=TableNotFound Exception Type=Microsoft.Azure.Cosmos.Table.StorageException Message='Not Found'

TableNotFound may be caused by you deleting the table. There was a known Azure Table behavior when a previously deleted table would be unavailable for several minutes afterwards. I don't know how/if that changed with the transition to the new library. I suggest instead of deleting the table, use a new cluster ID for each deployment. That's the recommended process.

Note that unavailability of the cluster membership table after a cluster successfully started has no impact on operation of the cluster unless silos leave or join it. This makes me think that the issue with running the benchmark might be somewhere else.

@thakursagar
Copy link
Author

No I meant this. That one was fixed after I uninstalled the WindowsAzure.Storage packages and installed the Microsoft.Azure.Storage.* packages.

[{"severityLevel":"Warning","parsedStack":[{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Runtime.Messaging.ConnectionManager+<ConnectAsync>d__20.MoveNext","level":0,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":1,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":2,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Runtime.Messaging.ConnectionManager+<GetConnectionAsync>d__15.MoveNext","level":3,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":4,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":5,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Messaging.ClientMessageCenter+<<GetGatewayConnection>g__ConnectAsync|37_1>d.MoveNext","level":6,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":7,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":8,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Internal.OrleansTaskExtentions+<<ToTypedTask>g__ConvertAsync|4_0>d`1.MoveNext","level":9,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":10,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":11,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.OutsideRuntimeClient+<RefreshGrainTypeResolver>d__56.MoveNext","level":12,"line":0}],"outerId":"0","message":"Unable to connect to endpoint S10.2.69.10:20011:0. See InnerException","type":"Orleans.Runtime.Messaging.ConnectionFailedException","id":"3729534"},{"severityLevel":"Warning","parsedStack":[{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess","level":0,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":1,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Internal.OrleansTaskExtentions+<MakeCancellable>d__25`1.MoveNext","level":2,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":3,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess","level":4,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":5,"line":0},{"assembly":"Orleans.Core, Version=2.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Orleans.Runtime.Messaging.ConnectionManager+<ConnectAsync>d__20.MoveNext","level":6,"line":0}],"outerId":"3729534","message":"A task was canceled.","type":"System.Threading.Tasks.TaskCanceledException","id":"18829747"}]

After a few minutes when I start the run, I see a lot of exceptions thrown from the Orleans.Core library indicating connection issues with reading the SiloInstances tables (which I have pasted above in my second comment on this post)

Do you mean this?

@thakursagar
Copy link
Author

I have tried using a new serviceId and a new clusterId for deployments but that doesn't seem to fix the issue.

@sergeybykov
Copy link
Contributor

No I meant this.

I don't see anything about the table there. But I do see the following:

"Unable to connect to endpoint S10.2.69.10:20011:0

Is silo 10.2.69.10:20011 part of the cluster and running fine at this point? If not, what happened to it?

@thakursagar
Copy link
Author

Yes that silo 10.2.69.10:20011 is running fine at this point. I even changed the clustering membership from Azure Table Storage to ADO.Net clustering and I still see the same behavior. After a few minutes (3-4) of kicking off the run, every silo starts suspecting every other silo in the cluster which results in a lot of SiloUnavailableExceptions and some silos are marked as dead eventually. Then they come back up again and the same cycle continues.

@sergeybykov
Copy link
Contributor

Hmm. Is this running on a physical cluster with each silo having a few CPU cores and ServerGC? Can you share the beginning of a silo log or, better, a full silo log?

I have a hard time thinking what should happen to a silo for it to stop responding to clustering pings. I guess it is possible to exhaust IO Completion ports, but still... @ReubenBond, do you have any other idea?

@thakursagar
Copy link
Author

We have this running on Azure VM Scale sets with Service Fabric. I will share the silo logs shortly.

@thakursagar
Copy link
Author

thakursagar commented Nov 15, 2019

@sergeybykov Here are the silo logs.
silotraces.zip

@thakursagar
Copy link
Author

thakursagar commented Nov 15, 2019

this is the log from one of the VMs which experiences an unexpected silo shutdown. We are running SF Cluster version 6.5.664.9590.

Application: xxxx
Framework Version: v4.0.30319
Description: The application requested process termination through System.Environment.FailFast(string message).
Message: FATAL EXCEPTION from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! I should be Dead according to membership table (in TryToSuspectOrKill): entry = [SiloAddress=S10.2.69.11:20006:311391329 SiloName=Silo_5c243 Status=Dead HostName=nt-3wcu-2000004 ProxyPort=20005 RoleName=xxxx UpdateZone=0 FaultZone=0 StartTime = 2019-11-14 01:35:30.559 GMT IAmAliveTime = 2019-11-14 01:45:32.171 GMT Suspecters = [S10.2.69.7:20005:311391549, S10.2.69.8:20005:311391330] SuspectTimes = [2019-11-14 01:48:18.358 GMT, 2019-11-14 01:48:20.650 GMT]].. Exception: null.
Current stack:    at System.Environment.GetStackTrace(Exception e, Boolean needFileInfo)
   at System.Environment.get_StackTrace()
   at Orleans.Runtime.FatalErrorHandler.OnFatalException(Object sender, String context, Exception exception)
   at Orleans.Runtime.MembershipService.MembershipTableManager.KillMyselfLocally(String reason)
   at Orleans.Runtime.MembershipService.MembershipTableManager.<TryToSuspectOrKill>d__50.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at Orleans.Runtime.MembershipService.AzureBasedMembershipTable.<ReadAll>d__10.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at Orleans.AzureUtils.OrleansSiloInstanceManager.<FindAllSiloEntries>d__29.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at Orleans.Clustering.AzureStorage.AzureTableDataManager`1.<ReadTableEntriesAndEtagsAsync>d__28.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at Orleans.Internal.AsyncExecutorWithRetries.<ExecuteWithRetriesHelper>d__4`1.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at Orleans.Clustering.AzureStorage.AzureTableDataManager`1.<>c__DisplayClass28_0.<<ReadTableEntriesAndEtagsAsync>b__0>d.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at Microsoft.Azure.Cosmos.Table.RestExecutor.TableCommand.Executor.<ExecuteAsync>d__1`1.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(Task`1 completedTask)
   at Microsoft.Azure.Cosmos.Table.RestExecutor.Utils.AsyncStreamCopier`1.<StartCopyStreamAsync>d__13.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(Task`1 completedTask)
   at Microsoft.Azure.Cosmos.Table.RestExecutor.Utils.AsyncStreamCopier`1.<StartCopyStreamAsyncHelper>d__14.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at Microsoft.Azure.Cosmos.Table.RestExecutor.Utils.TaskExtensions.<WithCancellation>d__0`1.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Threading.Tasks.TaskFactory.CompleteOnInvokePromise.Invoke(Task completingTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at System.Net.Http.HttpClientHandler.WebExceptionWrapperStream.<ReadAsync>d__4.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining, Task& currentTask)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncTrimPromise`1.Complete(TInstance thisRef, Func`3 endMethod, IAsyncResult asyncResult, Boolean requiresSynchronization)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncTrimPromise`1.CompleteFromAsyncResult(IAsyncResult asyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
   at System.Net.LazyAsyncResult.ProtectedInvokeCallback(Object result, IntPtr userToken)
   at System.Net.ChunkParser.CompleteUserRead(Object result)
   at System.Net.ChunkParser.ParseTrailer()
   at System.Net.ChunkParser.ProcessResponse()
   at System.Net.ChunkParser.ReadCallback(IAsyncResult ar)
   at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
   at System.Net.LazyAsyncResult.ProtectedInvokeCallback(Object result, IntPtr userToken)
   at System.Net.Security._SslStream.ProcessFrameBody(Int32 readBytes, Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.ReadFrameCallback(AsyncProtocolRequest asyncRequest)
   at System.Net.AsyncProtocolRequest.CompleteRequest(Int32 result)
   at System.Net.FixedSizeReader.CheckCompletionBeforeNextRead(Int32 bytes)
   at System.Net.FixedSizeReader.ReadCallback(IAsyncResult transportResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Net.ContextAwareResult.Complete(IntPtr userToken)
   at System.Net.LazyAsyncResult.ProtectedInvokeCallback(Object result, IntPtr userToken)
   at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)
Stack:
   at System.Environment.FailFast(System.String)
   at Orleans.Runtime.FatalErrorHandler.OnFatalException(System.Object, System.String, System.Exception)
   at Orleans.Runtime.MembershipService.MembershipTableManager.KillMyselfLocally(System.String)
   at Orleans.Runtime.MembershipService.MembershipTableManager+<TryToSuspectOrKill>d__50.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.__Canon)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.__Canon)
   at Orleans.Runtime.MembershipService.AzureBasedMembershipTable+<ReadAll>d__10.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.__Canon)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.__Canon)
   at Orleans.AzureUtils.OrleansSiloInstanceManager+<FindAllSiloEntries>d__29.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.__Canon)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.__Canon)
   at Orleans.Clustering.AzureStorage.AzureTableDataManager`1+<ReadTableEntriesAndEtagsAsync>d__28[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.__Canon)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.__Canon)
   at Orleans.Internal.AsyncExecutorWithRetries+<ExecuteWithRetriesHelper>d__4`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.__Canon)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.__Canon)
   at Orleans.Clustering.AzureStorage.AzureTableDataManager`1+<>c__DisplayClass28_0+<<ReadTableEntriesAndEtagsAsync>b__0>d[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.__Canon)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.__Canon)
   at Microsoft.Azure.Cosmos.Table.RestExecutor.TableCommand.Executor+<ExecuteAsync>d__1`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.Threading.Tasks.VoidTaskResult)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.Threading.Tasks.VoidTaskResult)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.Threading.Tasks.Task`1<System.Threading.Tasks.VoidTaskResult>)
   at Microsoft.Azure.Cosmos.Table.RestExecutor.Utils.AsyncStreamCopier`1+<StartCopyStreamAsync>d__13[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.Threading.Tasks.VoidTaskResult)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.Threading.Tasks.VoidTaskResult)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(System.Threading.Tasks.Task`1<System.Threading.Tasks.VoidTaskResult>)
   at Microsoft.Azure.Cosmos.Table.RestExecutor.Utils.AsyncStreamCopier`1+<StartCopyStreamAsyncHelper>d__14[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(Int32)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(Int32)
   at Microsoft.Azure.Cosmos.Table.RestExecutor.Utils.TaskExtensions+<WithCancellation>d__0`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.__Canon)
   at System.Threading.Tasks.TaskFactory+CompleteOnInvokePromise.Invoke(System.Threading.Tasks.Task)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(Int32)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(Int32)
   at System.Net.Http.HttpClientHandler+WebExceptionWrapperStream+<ReadAsync>d__4.MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(Int32)
   at System.Threading.Tasks.TaskFactory`1+FromAsyncTrimPromise`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].Complete(System.__Canon, System.Func`3<System.__Canon,System.IAsyncResult,Int32>, System.IAsyncResult, Boolean)
   at System.Threading.Tasks.TaskFactory`1+FromAsyncTrimPromise`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].CompleteFromAsyncResult(System.IAsyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr)
   at System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr)
   at System.Net.ChunkParser.CompleteUserRead(System.Object)
   at System.Net.ChunkParser.ParseTrailer()
   at System.Net.ChunkParser.ProcessResponse()
   at System.Net.ChunkParser.ReadCallback(System.IAsyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr)
   at System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr)
   at System.Net.Security._SslStream.ProcessFrameBody(Int32, Byte[], Int32, Int32, System.Net.AsyncProtocolRequest)
   at System.Net.Security._SslStream.ReadFrameCallback(System.Net.AsyncProtocolRequest)
   at System.Net.AsyncProtocolRequest.CompleteRequest(Int32)
   at System.Net.FixedSizeReader.CheckCompletionBeforeNextRead(Int32)
   at System.Net.FixedSizeReader.ReadCallback(System.IAsyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Net.ContextAwareResult.Complete(IntPtr)
   at System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr)
   at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)

@ReubenBond
Copy link
Member

@thakursagar those appear to be traces from one of the clients - I'm not able to see silo logs there

@thakursagar
Copy link
Author

@ReubenBond Sorry i got the wrong logs. I will get the correct ones soon.

@thakursagar
Copy link
Author

logs.zip
Here you go @ReubenBond. I hope this is what you're looking for.

@ReubenBond
Copy link
Member

@thakursagar I'm not seeing the OnFatalException call or the "I have been told I am dead" log line in those logs - are there more logs that I can look at?

@thakursagar
Copy link
Author

@ReubenBond That OnFatalException was from the Event Viewer logs in one of the VMSS nodes. The Service Fabric event logs also show there was an unexpected termination for both the server and client services.

@thakursagar
Copy link
Author

vmlogs.zip
Here are the logs from the VMs. They're pretty much the same type on all of them. I am out of ideas at this point honestly. Do you think maybe running it on Service Fabric is causing an issue?

@ReubenBond
Copy link
Member

@thakursagar the log indicates that the silo was declared as dead by other silos in the cluster. This can mean that communication with that silo was lost. It can also mean that the node froze for a very long time (eg, 30s). Analyzing logs from all relevant nodes will often reveal the cause.

@thakursagar
Copy link
Author

@ReubenBond I tried to run the app in a different environment with a different VMSS size configuration and I could not reproduce the issue. Maybe it is an infrastructure related thing (Azure) at this point.

@thakursagar
Copy link
Author

thakursagar commented Nov 21, 2019

@ReubenBond Update on this - tried spinning up another identical environment to the one that was causing issues and observed the same behavior. Looked into the silo logs further and saw the trend was silos were suspecting other silos because they were not receiving the ping in a timely fashion. So I changed the configuration to set the UseLivenessGossip=false and increased the ProbeTimeout to 30 seconds. This caused the benchmark runs to almost finish its run before seeing the SiloUnavailableExceptions. Any thoughts here? I am thinking of disabling the LivenessEnabled property or increase the ProbeTimeout further and give this another shot. Also attaching the silo traces where you will see messages which say "the ping attempt was cancelled"
query_data (19).zip

@ReubenBond
Copy link
Member

Could you show me your silo configuration code?

@ReubenBond
Copy link
Member

The logs you uploaded have no message saying "the ping attempt was cancelled"

@thakursagar
Copy link
Author

@ReubenBond Sorry the logs had the 10k limit on export. I've filtered those exact logs in this file. I will get the config code soon.
query_data (20).zip

@thakursagar
Copy link
Author

@ReubenBond Here's the silo config code:

{
			const string serviceName = "xxxx";
            ServiceEventSource.Current.Message($"[PID {Process.GetCurrentProcess().Id}] CreateServiceInstanceListeners()");
            var listener = OrleansServiceListener.CreateStateless((fabricServiceContext, builder) =>
            {
                builder.Configure<ClusterOptions>(options =>
                {
                    // The service id is unique for the entire service over its lifetime. This is used to identify persistent state
                    // such as reminders and grain state.
                    options.ServiceId = serviceName;

                    // The cluster id identifies a deployed cluster. Since Service Fabric uses rolling upgrades, the cluster id
                    // can be kept constant. This is used to identify which silos belong to a particular cluster.
                    options.ClusterId = "xxxxxx";
                });

               

                var activation = fabricServiceContext.CodePackageActivationContext;
                var keyVault = new AzureKeyVault(new ServiceFabricConfigProvider());
                var serviceConfigProvider = new ServiceFabricConfigProvider();

                var cloudConfig = activation.GetConfigurationPackageObject("Config");
                var storageConnectionString = keyVault.GetSecretByNameAsync("xxxx").GetAwaiter().GetResult();
                var aiInstrumentationKey = keyVault.GetSecretByNameAsync("xxxx").GetAwaiter().GetResult();
                Common.Infrastructure.LoggerFactory.Initialize(aiInstrumentationKey.Value, keyVault);
                var loggerFactory = Common.Infrastructure.LoggerFactory.Instance;
                Log.Logger = new LoggerConfiguration()
               .Enrich.FromLogContext()
               .WriteTo.ApplicationInsights(aiInstrumentationKey.Value, TelemetryConverter.Traces)
               .CreateLogger();
                var minimumLogLevel = new LogLevelHelper(keyVault).GetMinimumLogLevel().GetAwaiter().GetResult();
                if (minimumLogLevel.Equals(LogEventLevel.Debug))
                    builder.AddApplicationInsightsTelemetryConsumer(aiInstrumentationKey.Value);
                var invariant = "System.Data.SqlClient"; // for Microsoft SQL Server
                var connectionString = "xxxx";
                //use AdoNet for clustering 
                builder.UseAdoNetClustering(options =>
                {
                    options.Invariant = invariant;
                    options.ConnectionString = connectionString;
                });
                //use AdoNet for reminder service
                builder.UseAdoNetReminderService(options =>
                {
                    options.Invariant = invariant;
                    options.ConnectionString = connectionString;
                });

                builder.ConfigureLogging(logging => {
                    logging.AddSerilog(dispose: true)
                    .AddFilter("", LogLevel.Debug)
                    .AddFilter("Orleans", LogLevel.Debug);
                });
                var endpoints = activation.GetEndpoints();


                var siloEndpoint = endpoints["OrleansSiloEndpoint"];
                var gatewayEndpoint = endpoints["OrleansProxyEndpoint"];

                var hostname = fabricServiceContext.NodeContext.IPAddressOrFQDN;

                builder.ConfigureEndpoints(hostname, siloEndpoint.Port, gatewayEndpoint.Port);

                builder.ConfigureApplicationParts(parts =>
                {
                    parts.AddApplicationPart(typeof(TenantGrain).Assembly).WithReferences();
                    parts.AddApplicationPart(typeof(ITenantGrain).Assembly).WithReferences();
                });
                ConfigureServices(builder, loggerFactory);
                builder.Configure<SiloMessagingOptions>(options => options.ResponseTimeout = TimeSpan.FromMinutes(30));

                builder.Configure<SerializationProviderOptions>(options => options.SerializationProviders.Add(typeof(Orleans.Serialization.ProtobufSerializer).GetTypeInfo()));

                builder.AddStartupTask<Startup>();

            });

            return new[] { listener };
        }

@ReubenBond
Copy link
Member

builder.Configure<SiloMessagingOptions>(options => options.ResponseTimeout = TimeSpan.FromMinutes(30));

I strongly recommend using a value less than 2 minutes for ResponseTimeout. The default of 30 seconds is an appropriate value.

Is there some way that you can get more comprehensive logs? There's not much to work with in those logs. I'd start by looking for warnings and errors from all silos. In particular, is the process freezing? Are blocking operations causing threads on that host to deadlock?

@thakursagar
Copy link
Author

@ReubenBond I'm working on getting the silo logs being written to a file which I will share with you.

@thakursagar
Copy link
Author

@ReubenBond Here is a complete silo log from one of the silos.
log20191121.zip

@thakursagar
Copy link
Author

@ReubenBond some more silo logs for you.
log20191121-2_2.zip

@thakursagar
Copy link
Author

@ReubenBond Did you get a chance to go through the silo logs? I looked at warnings and errors from all silos and nothing seem to be giving me a clue about the possible root cause. The process just seems to crash with lots of silo unavailable exceptions because the silo's don't ping response is not received in time. I would say that there might be a few blocking operations overall but this benchmark run seems to be working with the Orleans 2.4 version so my expectation is it should work with 3.0 as well?

@ReubenBond
Copy link
Member

The cause of the issue is not immediately obvious. I see that some silo(s) are pausing for a few seconds at a time. All of the silo logs are lumped in together in that one file, so there's no way to distinguish them.

It's interesting that it the same silo is declared dead twice in those logs, 10.2.73.6 and the suspecting silos which voted it dead are always Suspecters = [S10.2.73.7:20001:312059015, S10.2.73.8:20001:312059012].

Is there a networking issue? Could you please verify that you are definitely using Orleans 3.0.0 on all hosts and not 3.0.0-beta1?

@sujesharukil
Copy link

@ReubenBond - do Silo pings go through grains? as in uses the same dedicated threads as Orleans?

@ReubenBond
Copy link
Member

Silo pings use different threads to grains. In 3.0, they are handled directly by the connection processor (see SiloConnection.cs)

@sujesharukil
Copy link

Thanks for the info @ReubenBond . Are there logs from ping receivers that they have received a ping and will respond?
The cancellation of pings in this case seems to be just a wait timeout. The SF explorer showed all the silos as healthy and the silo logs do show that the silos are just shutting down because they have been told that they are dead.
There is nothing that tells how the ping was handled/not handled other than just that it timed out.

@ReubenBond
Copy link
Member

There are logs: you need to enable trace level logging for "Microsoft.Orleans.Networking" to show them.

@sujesharukil
Copy link

                    .AddFilter("Orleans", LogLevel.Trace);
                });

Like this?

@ReubenBond
Copy link
Member

Yes, but with Microsoft.Orleans.Networking instead of just Orleans

@sujesharukil
Copy link

thanks, we will give that a shot and reach back

@thakursagar
Copy link
Author

The cause of the issue is not immediately obvious. I see that some silo(s) are pausing for a few seconds at a time. All of the silo logs are lumped in together in that one file, so there's no way to distinguish them.

It's interesting that it the same silo is declared dead twice in those logs, 10.2.73.6 and the suspecting silos which voted it dead are always Suspecters = [S10.2.73.7:20001:312059015, S10.2.73.8:20001:312059012].

Is there a networking issue? Could you please verify that you are definitely using Orleans 3.0.0 on all hosts and not 3.0.0-beta1?

The logs that I have uploaded are from one silo. Each file represents one silo log. I have verified we are using all 3.0 packages.

@ReubenBond
Copy link
Member

ReubenBond commented Nov 27, 2019

Thanks for the clarification, @thakursagar, I had misinterpreted them because of the doubled-up log lines.

I just noticed that you are using AppInsights for telemetry. I've seen some issues with AppInsights and blocking threads recently (particularly in Flush calls) which could potentially cause issues like this. Diagnosing that is not trivial, but capturing a memory dump can help (I can analyze it if you like, just email me a link to it). Capturing a perf trace can also help and is preferable since it gives an idea of behavior over time rather than just a snapshot.

To capture a trace, download the latest version of PerfView from here: https://github.com/microsoft/perfview/releases/tag/P2.0.48 and copy PerfView.exe to the target machine and execute the following in an elevated command prompt:

.\PerfView.exe /acceptEULA /noGui /threadTime /zip /maxCollectSec:30 /bufferSizeMB:1024 /circularMB:1024 /dataFile:1.etl collect

I can help to analyze the resulting zipped traces.

We can also make some time to diagnose this over a call. That might be faster.

If you are able to send me a dump of the membership table (eg, using Azure Storage Explorer), that is also useful for diagnosing this.

My current inclination is blocked threads on the scheduler, since we see so many stalls.

The ResponseTimeout should definitely be lowered to < 2 minutes.

@sujesharukil
Copy link

@ReubenBond , we do a few Task.WhenAll awaiting a few thousand grain calls. We have seen this can take a while (30 minutes is harsh, but that was because we didn't know where it will break). But now that you mention stalled threads and/or blocked threads, it would definitely make sense why some of the processes take long.
We can certainly provide you with the dumps and we greatly appreciate your help in this regard.

@ReubenBond
Copy link
Member

Any time, @sujesharukil. I'm flying to the other side of the planet today, but I'll try to help whenever I have connectivity

@thakursagar
Copy link
Author

@ReubenBond I tried unhooking App Insights from the app but saw the same behavior. I captured the perf traces using PerfView and have emailed you the link to traces. Thanks for all your help!

@ReubenBond
Copy link
Member

Closing due to inactivity - let us know if this is still an issue

@thakursagar
Copy link
Author

@ReubenBond sorry I haven't gotten a chance to do the next steps that we discussed on the email thread. We have put down the upgrade for a bit due to other priorities. I did find an interesting issue with some of our GrainInterfaces though - we were importing the [OneWay] from the System.Runtime.Remoting.Messaging library instead of Orleans.Concurrency. Do you think that might be related to this?

@ReubenBond
Copy link
Member

It's possible, worth testing.

@thakursagar
Copy link
Author

@ReubenBond @sergeybykov Gave another shot this week to update the application to the 3.4.3 version of Orleans. What I am seeing this time is a lot of errors like this:

{"FailedProbeCount":"1","MessageTemplate":"Did not get response for probe #{Id} to silo {Silo} after {Elapsed}. Current number of consecutive failed probes is {FailedProbeCount}","SourceContext":"Orleans.Runtime.MembershipService.SiloHealthMonitor","Elapsed":"00:00:05.0047962","EventId":"{\"Id\":100613}","Silo":"S10.2.69.11:20052:362863605","Id":"5279"}

followed by this:

[{"severityLevel":"Error","parsedStack":[{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw","level":0,"line":0},{"assembly":"mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089","method":"System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification","level":1,"line":0},{"assembly":"Grains, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","method":"Grains.SectionProcessorGrain+<Answers>d__47.MoveNext","level":2,"line":766,"fileName":"FileName.cs"}],"outerId":"0","message":"The target silo became unavailable for message: Request S10.2.69.10:20056:362863155*grn/Grains.ProcessorGrain/0+012018.3107424C-0F21-5FCF-17C8-E542DA6B9848.EB5F71F1-1C7E-9D9B-5C14-6274B9E0203D@5a696d96->S10.2.69.11:20052:362863150*grn/Grains.Grain/0+012018.3107424C-0F21-5FCF-17C8-E542DA6B9848.1@3eaac383 InvokeMethodRequest Interfaces#830933. Target History is: <S10.2.69.11:20052:362863150:*grn/Grains/012018.3107424C-0F21-5FCF-17C8-E542DA6B9848.1:@3eaac383>. See https://aka.ms/orleans-troubleshooting for troubleshooting help.","type":"Orleans.Runtime.SiloUnavailableException","id":"44914034"}]

Do you think there is anything that changed w.r.t the errors that I am seeing in the 3.x version as compared to the 2.4.2 version?
I am wondering what changed in the 3.x version of Orleans which is causing this to happen and if there is any way to workaround that? The same code works very well without any issues with the 2.4.2 version of Orleans.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 25, 2021
@ghost ghost added the stale Issues with no activity for the past 6 months label Dec 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale Issues with no activity for the past 6 months
Projects
None yet
Development

No branches or pull requests

4 participants