-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues after Orleans 3.0 upgrade running on a Service Fabric Cluster #6113
Comments
Update: I noticed I was using the old WindowsAzure.Storage nuget package. So I removed the old WindowsAzure.Storage nuget package and installed the Microsoft.Azure.Storage.* packages for Blob, Queue and Common. That seems to resolve the
which ultimately results approximately 20 minutes after in a ton of these SiloUnavailableExceptions:
|
What clustering provider are you using? |
@sergeybykov I'm using Azure Table storage as the clustering provider. |
Are you performing an in-place upgrade or a new deployment? If the latter, did you use a new cluster ID for it? Have you looked at the cluster membership table? It should show live state of the cluster. |
@sergeybykov New deployment. We delete the two Orleans Tables from Azure Tables (SiloInstances and Reminders) everytime before we deploy the new code package to Service Fabric. I looked at the cluster membership table and it looks like most of the silos were dead (more than 1 time) and then came back up. |
Before you start the benchmark, do you see all silos started and joined to the cluster successfully? If the table is fresh, you should not see any dead silo entries at that point. |
Yes. Before kicking off the benchmark runs, I made sure all the silos started and are in a status of "Active" in the SiloInstances tables. After a few minutes when I start the run, I see a lot of exceptions thrown from the Orleans.Core library indicating connection issues with reading the SiloInstances tables (which I have pasted above in my second comment on this post) followed by lots of SiloUnavailableExceptions. Then after some time, the silo seems to come back up and the benchmark run continues and then after some time I see the same behavior again. I feel something is not working well with the new Microsoft.Azure.Storage.* packages because when I downgrade Orleans to 2.4.3 which has a dependency to WindowsAzure.Storage, everything seems to be working well and I don't see any such issues. With the 3.0 version, I have to uninstall the WindowsAzure.Storage package as Orleans seems to have dependency on the new Microsoft.Azure.Storage.* packages. I am also trying to replace the Azure Table Storage Clustering Membership with SQL clustering right now to rule out any issue with these new Microsoft.Azure.Storage.* packages or Azure Table Storage Clustering. |
Do you mean this?
TableNotFound may be caused by you deleting the table. There was a known Azure Table behavior when a previously deleted table would be unavailable for several minutes afterwards. I don't know how/if that changed with the transition to the new library. I suggest instead of deleting the table, use a new cluster ID for each deployment. That's the recommended process. Note that unavailability of the cluster membership table after a cluster successfully started has no impact on operation of the cluster unless silos leave or join it. This makes me think that the issue with running the benchmark might be somewhere else. |
No I meant this. That one was fixed after I uninstalled the WindowsAzure.Storage packages and installed the Microsoft.Azure.Storage.* packages.
|
I have tried using a new serviceId and a new clusterId for deployments but that doesn't seem to fix the issue. |
I don't see anything about the table there. But I do see the following:
Is silo |
Yes that silo |
Hmm. Is this running on a physical cluster with each silo having a few CPU cores and ServerGC? Can you share the beginning of a silo log or, better, a full silo log? I have a hard time thinking what should happen to a silo for it to stop responding to clustering pings. I guess it is possible to exhaust IO Completion ports, but still... @ReubenBond, do you have any other idea? |
We have this running on Azure VM Scale sets with Service Fabric. I will share the silo logs shortly. |
@sergeybykov Here are the silo logs. |
this is the log from one of the VMs which experiences an unexpected silo shutdown. We are running SF Cluster version 6.5.664.9590.
|
@thakursagar those appear to be traces from one of the clients - I'm not able to see silo logs there |
@ReubenBond Sorry i got the wrong logs. I will get the correct ones soon. |
logs.zip |
@thakursagar I'm not seeing the |
@ReubenBond That |
vmlogs.zip |
@thakursagar the log indicates that the silo was declared as dead by other silos in the cluster. This can mean that communication with that silo was lost. It can also mean that the node froze for a very long time (eg, 30s). Analyzing logs from all relevant nodes will often reveal the cause. |
@ReubenBond I tried to run the app in a different environment with a different VMSS size configuration and I could not reproduce the issue. Maybe it is an infrastructure related thing (Azure) at this point. |
@ReubenBond Update on this - tried spinning up another identical environment to the one that was causing issues and observed the same behavior. Looked into the silo logs further and saw the trend was silos were suspecting other silos because they were not receiving the ping in a timely fashion. So I changed the configuration to set the |
Could you show me your silo configuration code? |
The logs you uploaded have no message saying "the ping attempt was cancelled" |
@ReubenBond Sorry the logs had the 10k limit on export. I've filtered those exact logs in this file. I will get the config code soon. |
@ReubenBond Here's the silo config code:
|
builder.Configure<SiloMessagingOptions>(options => options.ResponseTimeout = TimeSpan.FromMinutes(30)); I strongly recommend using a value less than 2 minutes for Is there some way that you can get more comprehensive logs? There's not much to work with in those logs. I'd start by looking for warnings and errors from all silos. In particular, is the process freezing? Are blocking operations causing threads on that host to deadlock? |
@ReubenBond I'm working on getting the silo logs being written to a file which I will share with you. |
@ReubenBond Here is a complete silo log from one of the silos. |
@ReubenBond some more silo logs for you. |
@ReubenBond Did you get a chance to go through the silo logs? I looked at warnings and errors from all silos and nothing seem to be giving me a clue about the possible root cause. The process just seems to crash with lots of silo unavailable exceptions because the silo's don't ping response is not received in time. I would say that there might be a few blocking operations overall but this benchmark run seems to be working with the Orleans 2.4 version so my expectation is it should work with 3.0 as well? |
The cause of the issue is not immediately obvious. I see that some silo(s) are pausing for a few seconds at a time. All of the silo logs are lumped in together in that one file, so there's no way to distinguish them. It's interesting that it the same silo is declared dead twice in those logs, Is there a networking issue? Could you please verify that you are definitely using Orleans 3.0.0 on all hosts and not 3.0.0-beta1? |
@ReubenBond - do Silo pings go through grains? as in uses the same dedicated threads as Orleans? |
Silo pings use different threads to grains. In 3.0, they are handled directly by the connection processor (see SiloConnection.cs) |
Thanks for the info @ReubenBond . Are there logs from ping receivers that they have received a ping and will respond? |
There are logs: you need to enable trace level logging for "Microsoft.Orleans.Networking" to show them. |
Like this? |
Yes, but with Microsoft.Orleans.Networking instead of just Orleans |
thanks, we will give that a shot and reach back |
The logs that I have uploaded are from one silo. Each file represents one silo log. I have verified we are using all 3.0 packages. |
Thanks for the clarification, @thakursagar, I had misinterpreted them because of the doubled-up log lines. I just noticed that you are using AppInsights for telemetry. I've seen some issues with AppInsights and blocking threads recently (particularly in Flush calls) which could potentially cause issues like this. Diagnosing that is not trivial, but capturing a memory dump can help (I can analyze it if you like, just email me a link to it). Capturing a perf trace can also help and is preferable since it gives an idea of behavior over time rather than just a snapshot. To capture a trace, download the latest version of PerfView from here: https://github.com/microsoft/perfview/releases/tag/P2.0.48 and copy
I can help to analyze the resulting zipped traces. We can also make some time to diagnose this over a call. That might be faster. If you are able to send me a dump of the membership table (eg, using Azure Storage Explorer), that is also useful for diagnosing this. My current inclination is blocked threads on the scheduler, since we see so many stalls. The |
@ReubenBond , we do a few Task.WhenAll awaiting a few thousand grain calls. We have seen this can take a while (30 minutes is harsh, but that was because we didn't know where it will break). But now that you mention stalled threads and/or blocked threads, it would definitely make sense why some of the processes take long. |
Any time, @sujesharukil. I'm flying to the other side of the planet today, but I'll try to help whenever I have connectivity |
@ReubenBond I tried unhooking App Insights from the app but saw the same behavior. I captured the perf traces using PerfView and have emailed you the link to traces. Thanks for all your help! |
Closing due to inactivity - let us know if this is still an issue |
@ReubenBond sorry I haven't gotten a chance to do the next steps that we discussed on the email thread. We have put down the upgrade for a bit due to other priorities. I did find an interesting issue with some of our GrainInterfaces though - we were importing the [OneWay] from the System.Runtime.Remoting.Messaging library instead of Orleans.Concurrency. Do you think that might be related to this? |
It's possible, worth testing. |
@ReubenBond @sergeybykov Gave another shot this week to update the application to the 3.4.3 version of Orleans. What I am seeing this time is a lot of errors like this:
followed by this:
Do you think there is anything that changed w.r.t the errors that I am seeing in the 3.x version as compared to the 2.4.2 version? |
I recently upgraded to Orleans 3.0 nuget packages on our application which is deployed on a Service Fabric Cluster. After the upgrade when we run our benchmark tests for performance, I am seeing a lot of exceptions being thrown:
It starts from
Then after a few minutes I see
and
types of messages. I am using the Azure Table Storage Clustering for both the client and server.
The text was updated successfully, but these errors were encountered: