Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing Silo Failures and Connection Issues in Orleans v3.5.0 Application #8678

Open
thakursagar opened this issue Oct 20, 2023 · 5 comments
Assignees

Comments

@thakursagar
Copy link

thakursagar commented Oct 20, 2023

We have been experiencing issues with our .NET application running on Orleans v3.5.0 within the Service Fabric cluster. We use Azure Table Storage for our Clustering membership.

Currently, our production environment comprises 16 silos. However, we have encountered a persistent problem where one random silo consistently fails to send a ping response, leading to its declaration as DEAD within the cluster. Upon investigating this issue, we noticed that the SiloInstances table displays two "Active" entries for the same silo, each associated with different port numbers. The outdated entry also appears in the SuspectingSilos column. Consequently, following the silo's declaration as DEAD, we observe a surge in OrleansMessageRejection messages and Silo Unavailable messages.

To mitigate this issue, we manually mark the entry with the incorrect (old) port number in the Silo Instances table as Dead. Subsequently, we restart the remaining silos, prompting them to update their membership tables with the correct port number for the affected silo. Failing to take these steps results in ongoing OrleansMessageRejection exceptions and the consequent stalling of all requests directed to the affected silo.

We are perplexed as to why Orleans does not internally resolve this situation by selecting the appropriate port number for the deceased silo. Additionally, we have observed an influx of Microsoft.AspNetCore.Connections.ConnectionAbortedException exceptions in our logs. This situation occurs despite the absence of excessive CPU load or resource-intensive processes on the nodes at the time of the incidents.

Please let me know if you require any further information or if there are additional diagnostic steps we can undertake to assist in the resolution process.

@ghost ghost added the Needs: triage 🔍 label Oct 20, 2023
@thakursagar
Copy link
Author

@ReubenBond I also emailed you the Orleans logs

@ReubenBond
Copy link
Member

ReubenBond commented Oct 20, 2023

Upon investigating this issue, we noticed that the SiloInstances table displays two "Active" entries for the same silo, each associated with different port numbers

Orleans uses fixed port numbers by default, but when you are running it on Service Fabric, Service Fabric allocates ports from a range of ports to prevent port clashes. When you have two silos with the same IP address but different ports, Orleans cannot tell that they are actually two instances of the same silo because some people run multiple silos per host, and cluster orchestrators like Service Fabric can do so during upgrades.

Based on your logs, I can see that you have configured your cluster so that each silo monitors 1 other silo (NumProbedSilos: 1), but each silo requires 3 votes to be declared dead (NumVotesForDeathDeclaration: 3). So, it is impossible for a silo to ever declare another dead and therefore the cluster membership algorithm cannot do its job. I've opened #8679 to add configuration validation on startup for the relationship between those properties.

Here's the documentation for those two properties:

/// <summary>
/// Gets or sets the number of silos each silo probes for liveness.
/// </summary>
/// <remarks>
/// This determines how many hosts each host will monitor by default.
/// A low value, such as the default value of three, is generally sufficient and allows for prompt removal of another silo in the event that it stops functioning.
/// When a silo becomes suspicious of another silo, additional silos may begin to probe that silo to speed up the detection of non-functioning silos.
/// </remarks>
/// <value>Each silo will actively monitor up to three other silos by default.</value>
public int NumProbedSilos { get; set; } = 3;

/// <summary>
/// Gets or sets the number of non-expired votes that are needed to declare some silo as down (should be at most <see cref="NumProbedSilos"/>)
/// </summary>
/// <value>Two votes are sufficient for a silo to be declared as down, by default.</value>
public int NumVotesForDeathDeclaration { get; set; } = 2;

Please leave them as their default values or otherwise ensure that NumVotesForDeathDeclaration is less than or equal to NumProbedSilos. Your settings also allow 5 missed probes with 20s per probe, resulting in about 100s detection time for each defunct silo - what lead to that configuration?

As for why the silo initially fails, it is not clear to me from the logs. Which version of .NET are you running?

@thakursagar
Copy link
Author

Thank you for the quick response and the explanation @ReubenBond.

We had altered with that configuration when we had upgraded to Orleans 3.0 and were facing this issue.
I guess we can reset those settings back to the default configuration now and see how it goes.

We are using .NET Framework v4.7.2.

@thakursagar
Copy link
Author

Also, if it would be helpful I can get you a larger snapshot of the logs which starts at a few hours prior to when we encountered the issue. The logs that I sent earlier start from a few minutes ago (~10 mins) when the OrleansMessageRejections started appearing.

@ReubenBond
Copy link
Member

Once the config is fixed, if you're still seeing this, a larger log sample and/or memory dump would be helpful. I wonder if you're hitting thread pool starvation or another issue. Having logs from the silo which crashed would be very useful, along with logs from other silos to correlate events (unresponsiveness, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants